Ritu Gupta.

Agenda

About

Background, roles, customers served, and core stacks.

Key Problems Solved

High-volume systems, ETL design, reliability, and scale.

Deep Dives

System Architecture, ETL pipelines, and Observability.

Distributed Systems at Scale

On Scaling Constraints, Monitoring and Optimizations.

Impact

Scale, quality improvement, detection speed, and outcomes.

About.

A data engineer with focus on building large-scale distributed data systems.

Designed, developed and implemented high-volume event pipelines and analytics platforms processing 100s of Terabytes of data daily.

Passionate about building innovative, scalable data-driven products that create meaningful impact and help enhance real-world customer experience.

Core Tech stacks

Spark, AWS, Airflow, Python

Customers served

Advertising, IoT, Telecom

01 Background

02 Foundations in Quality Engineering

03 Building Software and Distributed Systems

04 Designing Large-Scale Reliable Data Pipelines

Page 03

Key Problems Solved.

Problem

Ad Clickstream Analytics - Data Lakehouse

Problem

Large-Scale Data Processing - ETL Pipeline Design

Problem

Enhanced Quality - Data Observability & Reliability

Problem

Scaling and Optimization - Distributed Systems

Impact

Petabytes of Event-driven data processing systems

Impact

~100TBs of data ingested, transformed and analyzed daily

Impact

90% less customer reported Data Completeness issues

Impact

Resources and processing times optimized by 35%

Page 04

Deep Dive I. System Architecture.

Objectives

Enable petabyte-scale analytics from high-volume advertising event streams, owned by multiple engineering teams.
Transform raw event records into supply, cost, and click-through performance metrics.

Solutions

Designed a distributed Data LakeHouse architecture supporting stream-to-batch analytics across large-scale datasets.
Owned the design and implementation of a three-zone ETL architecture, defining ingestion, storage, and compute layers.

Impact

Enabled Petabyte-scale event processing across the analytics platform.
Reduced metrics availability time by ~75%, improving analytics SLAs.

Distributed Data Lakehouse

01-> Source

Event Logs

02-> Ingestion

Data Streams

03-> Processing

Micro-Batches

04-> Storage

Transformations

05-> Transform

Aggregations

06-> Analytics

Metrics & APIs

Consumption

Athena + API

Page 05

Deep Dive II. ETL Pipelines.

Raw Events - Multiple Sources

Schema, File Format Validation

Preprocessing and Filtering

Exploding Nested Data, Groupings

Metadata enrichment, Aggregations

Partitioned and Indexed Datasets

Analytics Tables & Service APIs

Objectives

Transform raw event logs into analytics-ready datasets per use cases.
Support analytics across high-volume event streams (~100TB/day).

Solutions

Implemented multi-stage ETL pipelines including validation, filtering, and flattening of nested event data.
Designed aggregation and partitioning strategies to optimize query patterns across multiple datasets.

Impact

Enabled transformations across ~100TB daily.
Improved downstream query performance.

Page 06

Deep Dive III. System Reliability.

Focus 01

Data Observability Platform

Problem Areas

Track completeness and timeliness across critical datasets.
Surface integrity issues before they reach customers.

Solutions

Built Data Completeness checks across source, ingestion, and transformations layers.
Optimized for auto reprocessing and backfills for late arriving events.
Accounted for custom configurable dataset onboarding process.

Impact

90% reduction in customer reported data related defects.

Focus 02

Monitoring - Metrics, Alarms, Dashboards

Problem Areas

Solve high detection and resolution time for data quality issues.
Reduce manual intervention and developer efforts for debugging.

Solutions

Added threshold-based alarms for failures, delays, and anomalies.
Created custom dashboards for data health and quality trends.
Integrated with Ticketing systems for real-time Alerting.

Impact

Reduction in data quality issue detection and resolution time from 2 days to 1 hour.

Page 07

Distributed Systems. Scaling.

Focus 01

Scaling and Optimization

Owned distributed system troubleshooting, debugging large-scale Spark and pipeline failures.
Scaled EMR and Glue compute workloads for long-running jobs, achieving ~35% reduction in processing time.
Addressed Spark data skew and shuffle bottlenecks, with partitioning and salting strategies.
Implemented batch sizing and parallelization strategies based on dataset volume spikes.

Focus 02

Operational Excellence

Built scalable resources with Infrastructure-as-Code (IaC), maintaining security and access controls.
Improved deployment speed and pipeline stability through automated CI/CD workflows and Automation Frameworks.
Increased multi-level test coverage and validation mechanisms across ETL pipelines and systems.
Enhanced documentation across systems and products for extensibility and maintainability.

Page 08

Impact.

Scale

100TB+

of data processed daily

Quality

90%

less data defects reported

Detection

1 hour

detection time reduced from 2 days

Optimization

35%

optimization in resources and processing times

Volume

Petabytes

of events data Extracted, Transformed and Loaded

Experience

1000s

of customers served with analytics and performance insights

Core Areas.

Data pipelines

Large-scale ETL pipeline design and data processing

ETL Processes

Data lineage, transformation modeling, and metadata management

Systems Design

Cloud-based distributed data platforms and big-data storage formats

Data Quality

Data Completeness, Observability, and Reliability engineering

Scaling

Performance optimization and pipeline scalability