RG

Ritu Gupta.

About.

A data engineer with focus on building large-scale distributed data systems.

Designed, developed and implemented high-volume event pipelines and analytics platforms processing 100s of Terabytes of data daily.

Passionate about building innovative, scalable data-driven products that create meaningful impact and help enhance real-world customer experience.

Core Tech stacks

Spark, AWS, Airflow, Python

Customers served

Advertising, IoT, Telecom

01 Background
02 Foundations in Quality Engineering
03 Building Software and Distributed Systems
04 Designing Large-Scale Reliable Data Pipelines

Page 03

Key Problems Solved.

Problem

Ad Clickstream Analytics - Data Lakehouse

Problem

Large-Scale Data Processing - ETL Pipeline Design

Problem

Enhanced Quality - Data Observability & Reliability

Problem

Scaling and Optimization - Distributed Systems

Impact

Petabytes of Event-driven data processing systems

Impact

~100TBs of data ingested, transformed and analyzed daily

Impact

90% less customer reported Data Completeness issues

Impact

Resources and processing times optimized by 35%

Page 04

Deep Dive I. System Architecture.

Objectives

  • Enable petabyte-scale analytics from high-volume advertising event streams, owned by multiple engineering teams.
  • Transform raw event records into supply, cost, and click-through performance metrics.

Solutions

  • Designed a distributed Data LakeHouse architecture supporting stream-to-batch analytics across large-scale datasets.
  • Owned the design and implementation of a three-zone ETL architecture, defining ingestion, storage, and compute layers.

Impact

  • Enabled Petabyte-scale event processing across the analytics platform.
  • Reduced metrics availability time by ~75%, improving analytics SLAs.

Distributed Data Lakehouse

01-> Source

Event Logs

02-> Ingestion

Data Streams

03-> Processing

Micro-Batches

04-> Storage

Transformations

05-> Transform

Aggregations

06-> Analytics

Metrics & APIs

Consumption

Athena + API

Page 05

Deep Dive II. ETL Pipelines.

Raw Events - Multiple Sources
Schema, File Format Validation
Preprocessing and Filtering
Exploding Nested Data, Groupings
Metadata enrichment, Aggregations
Partitioned and Indexed Datasets
Analytics Tables & Service APIs

Objectives

  • Transform raw event logs into analytics-ready datasets per use cases.
  • Support analytics across high-volume event streams (~100TB/day).

Solutions

  • Implemented multi-stage ETL pipelines including validation, filtering, and flattening of nested event data.
  • Designed aggregation and partitioning strategies to optimize query patterns across multiple datasets.

Impact

  • Enabled transformations across ~100TB daily.
  • Improved downstream query performance.

Page 06

Deep Dive III. System Reliability.

Focus 01

Data Observability Platform

Problem Areas

  • Track completeness and timeliness across critical datasets.
  • Surface integrity issues before they reach customers.

Solutions

  • Built Data Completeness checks across source, ingestion, and transformations layers.
  • Optimized for auto reprocessing and backfills for late arriving events.
  • Accounted for custom configurable dataset onboarding process.

Impact

  • 90% reduction in customer reported data related defects.
Focus 02

Monitoring - Metrics, Alarms, Dashboards

Problem Areas

  • Solve high detection and resolution time for data quality issues.
  • Reduce manual intervention and developer efforts for debugging.

Solutions

  • Added threshold-based alarms for failures, delays, and anomalies.
  • Created custom dashboards for data health and quality trends.
  • Integrated with Ticketing systems for real-time Alerting.

Impact

  • Reduction in data quality issue detection and resolution time from 2 days to 1 hour.

Page 07

Distributed Systems. Scaling.

Focus 01

Scaling and Optimization

  • Owned distributed system troubleshooting, debugging large-scale Spark and pipeline failures.
  • Scaled EMR and Glue compute workloads for long-running jobs, achieving ~35% reduction in processing time.
  • Addressed Spark data skew and shuffle bottlenecks, with partitioning and salting strategies.
  • Implemented batch sizing and parallelization strategies based on dataset volume spikes.
Focus 02

Operational Excellence

  • Built scalable resources with Infrastructure-as-Code (IaC), maintaining security and access controls.
  • Improved deployment speed and pipeline stability through automated CI/CD workflows and Automation Frameworks.
  • Increased multi-level test coverage and validation mechanisms across ETL pipelines and systems.
  • Enhanced documentation across systems and products for extensibility and maintainability.

Page 08

Impact.

Scale

100TB+

of data processed daily

Quality

90%

less data defects reported

Detection

1 hour

detection time reduced from 2 days

Optimization

35%

optimization in resources and processing times

Volume

Petabytes

of events data Extracted, Transformed and Loaded

Experience

1000s

of customers served with analytics and performance insights