Somya Chaudhary

⚡

How I Handled 20M+ Records Without Breaking a Sweat

5 min read · SQL · Python · Cloud

The tools, techniques, and mindset that make massive datasets manageable — without a supercomputer.

The moment that changed how I think about scale

Someone handed me a 20M row dataset and said "figure out what's going on with churn." My laptop froze. Pandas ran for 40 minutes before I killed it. I'd been thinking about data wrong — as a file to open, not a system to query.

The biggest mindset shift: stop moving data to your tool. Move your logic to where the data lives.

Step 1 — Sample aggressively during exploration

1–2% of 20M rows is 200K — more than enough to understand schema and distributions. In Python: df.sample(frac=0.01). In SQL: TABLESAMPLE BERNOULLI(1). This alone saves hours.

Step 2 — Push computation to the database

SQL is for transformation. Python is for analysis. If you're doing GROUP BY in pandas on 20M rows, you're doing it wrong. Write the aggregation in SQL, pull the summarised result — your query engine is optimised for this in ways pandas never will be.

Step 3 — Use window functions, not self-joins

Self-joins on large tables are query killers. ROW_NUMBER(), LAG(), NTILE() do the same job in a single pass. My cohort retention analysis cut query time from 8 minutes to 22 seconds using window functions.

Step 4 — Partition and index deliberately

On cloud warehouses, partitioning by date column is the single biggest performance lever. A properly partitioned table means your query only reads 1/365th of the data when you filter by date.

The real skill isn't knowing which tools to use. It's knowing when to let the database do the heavy lifting.

🤖

GenAI Is My Analyst Superpower — Here's How I Use It

5 min read · GenAI · Productivity · SQL

Not hype. Actual workflows where AI makes me 3x faster without sacrificing accuracy or judgment.

The honest take

GenAI is a force multiplier for analysts who already know what they're doing. It's nearly useless — and sometimes dangerous — for those who don't.

I don't use AI to replace my thinking. I use it to eliminate the 40% of my time that was never thinking to begin with.

Where I actually use it: SQL generation

Writing boilerplate SQL is boring. I describe what I need in plain English and iterate from an AI-generated starting point. The key: I always review the output before running it. I understand what it's doing. I just didn't write the first draft from scratch.

Data summarization

After an EDA, I have 15 charts and 40 observations. Turning that into a 5-bullet executive summary used to take 45 minutes. Now I dump the key stats into a structured prompt and edit the AI draft heavily. Time: 10 minutes.

Where GenAI fails

Anything requiring business context. AI doesn't know that your "active user" definition excludes internal employees, or that the Q3 dip was a planned maintenance window. Domain knowledge is non-negotiable.

The analysts who thrive in the AI era know exactly where it helps and where it lies.

🔧

Feature Engineering: The Unglamorous Skill That Makes or Breaks ML Models

6 min read · Python · ML · EDA

Nobody talks about it at conferences. Everyone struggles with it in production.

Why it clicked for me

In my breast cancer EDA, radius, area, and perimeter were all measuring "tumor size" differently — correlations above 0.95. Including all three gave a model the same information three times. Feature quality explained more variance than algorithm selection.

A mediocre algorithm with great features beats a great algorithm with mediocre features. Every single time.

The multicollinearity trap

Correlation heatmap first, always. Any pair above 0.85–0.90 gets investigated. Keep the most interpretable feature, drop the rest.

Creating features that encode business logic

Raw columns rarely capture what drives outcomes. In e-commerce, I don't just use days_since_last_purchase. I create purchase_velocity, basket_size_trend, and category_diversity — these encode behaviour patterns the raw timestamps never surface.

My feature engineering checklist

Check correlation matrix — flag anything above 0.85
Look at distributions — heavy skew often needs log transform
Plot feature vs target — which features separate classes?
Create domain-meaningful aggregates and ratios
Scale before distance-based models, not before trees

🏗️

Stop Confusing Databases, Warehouses & Lakes — Here's the Real Difference

5 min read · Architecture · Cloud · SQL

A decision framework every analyst needs before touching a cloud data project.

Why this confusion is expensive

I've seen teams store operational data in a warehouse, run analytical queries on a transactional database, and dump raw logs without a schema. Each is a different kind of expensive mistake.

The question isn't which is better. It's: what are you trying to do, and how fast do you need it?

Databases (OLTP) — the operational layer

MySQL, PostgreSQL — built for speed at the row level. INSERT, UPDATE, DELETE. Never run analytics workloads on your production database — you'll slow down operations for real users.

Data Warehouses (OLAP) — the analytical layer

Redshift, Snowflake, BigQuery — column-oriented systems built for complex analytical queries. My pattern: raw tables land from ETL, dbt transforms them into clean models, analysts query the models.

Data Lakes — the raw storage layer

S3, ADLS — object stores at very low cost. No schema, no enforcement. Everything lands here first. With AWS Athena, you can query S3 directly using SQL without loading into a warehouse — useful for ad hoc exploration.

The one-line decision rule

Building a product feature → database. Answering a business question → warehouse. Keep everything and figure it out later → lake.

📊

The Funnel Analysis Playbook: How to Find Where Your Business Is Bleeding Revenue

6 min read · Funnel Analysis · SQL · Business Strategy

The exact framework I used to identify $36K in recoverable revenue — and how you can apply it to any conversion problem.

Why most funnel analysis is useless

I've seen dashboards showing conversion rates at every step. A number. No context, no benchmark, no actionability. Funnel analysis isn't about measuring conversion — it's about diagnosing where intent breaks down and why.

Your funnel doesn't have a conversion problem. It has a specific problem, at a specific step, for a specific segment. Your job is to find it.

Step 1 — Build session-level funnels, not event-level

If a user refreshes a page 5 times, that's 5 events but one session. Always deduplicate first, or your funnel will look better than it is. Use ROW_NUMBER() OVER (PARTITION BY user_id, session_id) to flag first occurrence per stage per session.

Step 2 — Find the primary bottleneck

In my e-commerce analysis, 93.2% bounced at the view stage. Cart-to-purchase was healthy. This completely changes the recommendation: fix discovery, not checkout. Fix the biggest drop first — everything else is noise.

Step 3 — Segment every way that matters

Overall funnel hides stories. Always segment by: device, user type (new vs returning), category, time of day, traffic source, price band. Smartphones had 50K+ views but sub-2% CVR — the single highest recovery opportunity.

Step 4 — Quantify in revenue terms

Don't say "conversion is low." Say: "2,434 October buyers, 1.2% returned vs 8% benchmark. At 5% retention: 120 additional purchases × $300 AOV = $36K recoverable from a single campaign." That's a number a CEO acts on.

The deliverable of funnel analysis isn't a chart. It's a prioritised list of specific interventions, each with an estimated revenue impact.

Flagship Projects

E-Commerce User Behaviour Analytics

Financial Risk & Borrower Analysis

Blinkit Sales Intelligence

Domain Analytics

OncoInsights: Breast Cancer EDA

Global Financial Comparison 2024

Olympics Data Analysis

Skills & Tools

Languages & AI

BI & Visualization

Cloud & Data Engineering

Analytics Methods

Experience

Education & Certifications

Analytics Playbooks

The moment that changed how I think about scale

Step 1 — Sample aggressively during exploration

Step 2 — Push computation to the database

Step 3 — Use window functions, not self-joins

Step 4 — Partition and index deliberately

The honest take

Where I actually use it: SQL generation

Data summarization

Where GenAI fails

Why it clicked for me

The multicollinearity trap

Creating features that encode business logic

My feature engineering checklist

Why this confusion is expensive

Databases (OLTP) — the operational layer

Data Warehouses (OLAP) — the analytical layer

Data Lakes — the raw storage layer

The one-line decision rule

Why most funnel analysis is useless

Step 1 — Build session-level funnels, not event-level

Step 2 — Find the primary bottleneck

Step 3 — Segment every way that matters

Step 4 — Quantify in revenue terms

Let's Work Together

Open to opportunities