RG

Ritu Gupta.

Page 02

About.

A data engineer with focus on building scalable distributed data systems.

Designed, developed and implemented high-volume event pipelines and analytics platforms processing 100s of Terabytes of data daily.

Core Tech stacks

Python, Spark, AWS, Airflow

Customers served

Advertising, IoT, Telecom

01 Background
02 Foundations in Quality Engineering
03 Building Automation and Distributed Systems
04 Designing Large-Scale Reliable Data Pipelines

Page 03

Key Problems Solved.

Problem

Ad Clickstream Analytics - Data Lakehouse

Problem

Large-Scale Data Processing - ETL Pipeline Design

Problem

Enhanced Quality - Data Observability & Reliability

Problem

Scaling and Optimization - Distributed Systems

Impact

Petabytes of Event-driven data processing systems

Impact

~100TBs of data ingested, transformed and analyzed daily

Impact

90% less customer reported Data Completness issues

Impact

Resources and processing times optimized by 35%

Page 04

Deep Dive I - System Architecture.

Distributed Data Lakehouse

  • Event-driven ingestion
  • Micro-batch Spark processing
  • Layered storage and transformation zones
  • Athena and API-based consumption
Source

Event Sources

Ingestion

Data Stream Ingestion

Processing

Raw Micro-Batch Processing

Storage

Transformations

Transform

Aggregations

Analytics

Analytics & APIs

Consumption

Athena + API

Page 05

Deep Dive II - ETL Pipelines.

Raw Events
Schema Validation
Preprocessing & Filtering
Explode Nested Data
Aggregations
Partitioned Datasets
Analytics Tables

ETL Transformations

  • Schema-first validation before transformations
  • Progressive refinement from raw to analytics-ready data
  • Nested event flattening for downstream aggregation
  • Partitioned outputs optimized for query performance

Page 06

Deep Dive III - Data Observability.

Focus 01

Platform for Completeness, Reliability

Focus 02

Monitoring - Metrics, alerts, dashboards

Page 07

Distributed Systems at Scale.

Scaling & Optimization

Built around measurable objectives and a system design that keeps data quality visible, actionable, and resilient.

Operational Excellence

Operational Excellence

CI/CD

CI/CD

Page 08

Impact.

Scale

100TB+

of data processed daily

Quality

90%

less data defects reported

Detection

1 hour

detection time reduced from 2 days

Optimization

35%

optimization in resources and processing times

Volume

Petabytes

of events data extracted, transformed and loaded

Experience

1000s

of customers served with analytics and insights