RG

Ritu Gupta.

About.

A data engineer with focus on building large-scale distributed data systems.

Designed, developed and implemented high-volume event pipelines and analytics platforms processing 100s of Terabytes of data daily.

Recognized for strong ownership and the ability to quickly learn, adapt, and deliver impactful solutions across diverse domains and customer segments.

Passionate about building innovative, scalable data-driven products that create meaningful impact and help enhance real-world experience.

Core Tech stacks

Python, Spark, AWS, Airflow

Customers served

Advertising, IoT, Telecom

01 Background
02 Foundations in Quality Engineering
03 Building Software and Distributed Systems
04 Designing Large-Scale Reliable Data Pipelines

Page 03

Key Problems Solved.

Problem

Ad Clickstream Analytics - Data Lakehouse

Problem

Large-Scale Data Processing - ETL Pipeline Design

Problem

Enhanced Quality - Data Observability & Reliability

Problem

Scaling and Optimization - Distributed Systems

Impact

Petabytes of Event-driven data processing systems

Impact

~100TBs of data ingested, transformed and analyzed daily

Impact

90% less customer reported Data Completness issues

Impact

Resources and processing times optimized by 35%

Page 04

Deep Dive I. System Architecture.

Objectives

  • Enable petabyte-scale analytics from high-volume advertising event streams, owned by multiple engineering teams.
  • Transform raw event records into supply, cost, and click-through performance metrics.

Solutions

  • Designed a distributed Data LakeHouse architecture supporting stream-to-batch analytics across large-scale datasets.
  • Owned the design and implementation of a three-zone ETL architecture, defining ingestion, storage, and compute layers.

Impact

  • Enabled Petabyte-scale event processing across the analytics platform.
  • Reduced metrics availability time by ~75%, improving analytics SLAs.

Distributed Data Lakehouse

01-> Source

Event Logs

02-> Ingestion

Data Streams

03-> Processing

Micro-Batches

04-> Storage

Transformations

05-> Transform

Aggregations

06-> Analytics

Metrics & APIs

Consumption

Athena + API

Page 05

Deep Dive II. ETL Pipelines.

Raw Events - Multiple Sources
Schema, File Format Validation
Preprocessing and Filtering
Exploding Nested Data, Groupings
Metadata enrichment, Aggregations
Partitioned and Indexed Datasets
Analytics Tables & Service APIs

Objectives

  • Convert raw event logs into structured, analytics-ready datasets.
  • Support periodic transformations across high-volume event streams (~100TB/day).

Solutions

  • Implemented multi-stage ETL pipelines including validation, filtering, and flattening of nested event data.
  • Designed aggregation and partitioning strategies to optimize storage and query performance.

Impact

  • Enabled transformations across ~100TB daily.
  • Improved downstream query performance.

Page 06

Deep Dive III. System Reliability.

Focus 01

Data Observability Platform

Objectives

  • Track completeness and timeliness across critical datasets.
  • Surface integrity issues before they reach customers.
  • Create a reliable quality signal across the platform.

Coverage

  • Completeness, timeliness, integrity, and issue resolution.
  • Checks across ingestion, transformations, and serving layers.
  • Actionable defect visibility for data producers and consumers.
Focus 02

Monitoring - Metrics, Alarms, Dashboards

Signals

  • Operational metrics across pipelines, jobs, and datasets.
  • Threshold-based alarms for failures, delays, and anomalies.
  • Dashboards for execution health and quality trends.

Outcomes

  • Faster detection and response to data quality issues.
  • Clear ownership across runtime health and reliability signals.
  • Greater confidence in daily platform operations.

Page 07

Distributed Systems. Scaling.

Focus 01

Scaling & Optimization

Objectives

  • Improve processing efficiency across distributed workloads.
  • Reduce resource pressure caused by skew and scaling constraints.
  • Maintain stable performance as data volumes continued to grow.

Approach

  • Identified bottlenecks in compute distribution and runtime stages.
  • Adjusted partitioning, workload balance, and execution strategies.
  • Optimized resource allocation against observed processing patterns.
Focus 02

Operational Excellence

Practices

  • Established reliable operational ownership across critical pipelines.
  • Improved runtime visibility through monitoring and issue response loops.
  • Supported steady delivery with repeatable deployment and recovery patterns.

Outcomes

  • Reduced operational friction across large-scale distributed systems.
  • Created clearer execution health and incident response visibility.
  • Helped sustain platform quality while scaling throughput.

Page 08

Impact.

Scale

100TB+

of data processed daily

Quality

90%

less data defects reported

Detection

1 hour

detection time reduced from 2 days

Optimization

35%

optimization in resources and processing times

Volume

Petabytes

of events data extracted, transformed and loaded

Experience

1000s

of customers served with analytics and insights