Somya Chaudhary
Data Analyst · New Delhi, India

Somya Chaudhary

Data Analyst focused on solving business problems using data

3.5 years turning complex datasets into decisions that move revenue. I've worked with 20M+ records across insurance, e-commerce, and technology — building pipelines, dashboards, and analyses that stakeholders actually act on.

3.5Years Experience
20M+Records Handled
8+Projects Delivered
8.5BITS Pilani GPA
scroll

Flagship Projects

End-to-end analytics work that answered real business questions and moved real numbers.

E-Commerce Funnel · 88K Users
🏆 Hero Project

E-Commerce User Behaviour Analytics

Diagnosed a $36K revenue leak across 88,000 users
93.2% of sessions never add to cart — discovery problem, not checkout
Month-1 retention at 1.2% vs 8–15% industry benchmark
Champions (0.3% users) drive 14.3% of all revenue
MySQLPythonPower BIRFMCohort
14.26% default rate Financial Risk · 38K Borrowers
🏆 Flagship

Financial Risk & Borrower Analysis

Identified mispriced risk across $436M loan portfolio
Grade A borrowers had highest default rate (18.5%) — classic mispriced risk
52% default spike in Q1 credit card loans — seasonal pattern uncovered
10–15% projected default reduction via verification strategy
Power BIDAXPower QueryRisk Analytics
Dairy 1.19x Frozen 40% margin 74% churned Blinkit · 100K+ Transactions
🏆 Flagship

Blinkit Sales Intelligence

Uncovered a 74% churn crisis and growth-margin paradox
74% of customers churned — major retention gap identified
Dairy & Breakfast: highest growth (1.19x) but lowest margin (20%)
Email marketing ROI at ~105% — highest across all channels
MySQLPower BIRFMCohortCTEs

Domain Analytics

Focused analytical studies across healthcare, global finance, and Olympic history.

🔬

OncoInsights: Breast Cancer EDA

Python · Pandas · Seaborn · 569 samples

EDA on tumor characteristics to identify features that distinguish malignant from benign — supporting data-driven diagnostic decision-making.

Radius, area & perimeter are strongest malignancy indicators
Malignant avg radius 17.46 vs benign 12.15 — 43% difference
High multicollinearity found — dimensionality reduction recommended
View on GitHub →
🌍

Global Financial Comparison 2024

Power BI · DAX · 30+ Countries

Side-by-side country financial comparison revealing the tensions between growth, stability, and inflation across global markets.

Argentina: 1.56M Merval index on 211% inflation — growth with fragility
India 6.8%, Vietnam 6.4% lead global GDP growth
Stability ≠ Growth: Switzerland, Norway post slowest growth
View on GitHub →
🏅

Olympics Data Analysis

MySQL · CTEs · Window Functions · 270K records

14 SQL queries dissecting 120 years of Olympic history — medal distributions, gender gaps, age performance peaks, and regional dominance patterns.

Peak performance age: 22–26 wins the most medals
Gender gap: 187K male vs 73K female athletes historically
USA, Russia, Germany dominate all-time medal tables
View on GitHub →

Skills & Tools

A full-stack analytics toolkit — from raw data to executive dashboard.

🐍

Languages & AI

SQL (MySQL, Redshift, Snowflake)95%
Python (Pandas, NumPy, Seaborn)88%
GenAI (SQL generation, summarization)82%
CTEsWindow FunctionsMatplotlibScikit-learn
📊

BI & Visualization

Power BI (DAX, Power Query)92%
Advanced Excel88%
Tableau75%
Dashboard DesignStakeholder ReportingKPI Design
☁️

Cloud & Data Engineering

AWS (S3, Redshift, Athena, Glue)80%
Snowflake78%
dbt + ETL Pipelines75%
Data WarehousingMySQLData Modeling
📈

Analytics Methods

Funnel & Cohort Analysis95%
RFM & Customer Segmentation92%
A/B Testing & Statistical Modeling85%
Root Cause AnalysisEDAKPI TrackingChurn Analysis

Experience

3.5 years of progressive impact across insurance, e-commerce, and technology.

Jun 2025 – Present
Data Analyst
VSM Infotech Pvt. Ltd. · New Delhi
  • Automated 12+ recurring reports saving 15+ hrs/week; Power BI dashboards contributed to 7% Q3 revenue uplift
  • Built AWS Glue ETL pipelines integrating 4 data sources into Snowflake — improved data freshness from daily to hourly
  • Integrated GenAI tools for SQL generation and anomaly summarization — reduced ad-hoc turnaround by 40%
Feb 2025 – Jun 2025
Data Analyst
Axis Max Life Insurance · Gurugram
  • Cohort retention analysis on 100K+ policy records — identified 31,000+ high-risk churn customers
  • Power BI funnel dashboards for 200+ agents drove targeted coaching, improving renewal rate by 8%
  • Automated monthly reconciliation in Python — cut report generation time by 70%
Oct 2023 – Jan 2025
Data Analyst
DigiTace Tech Solutions Pvt. Ltd. · Gurugram
  • Built sales demand forecasting model achieving 87% accuracy — reduced inventory costs by ₹14L annually
  • A/B testing and cohort churn analysis identified key drop-off points — lowered monthly churn by 6%
  • Unified KPI reporting into single Power BI dashboard — compressed reporting cycle from 3 days to 4 hours
Oct 2022 – Oct 2023
Chat Operations Analyst
Concentrix (Uber: US–Canada) · Gurugram
  • Monitored KPIs (CSAT >92%, AHT) for 40+ agents using SQL-based query pattern analysis
  • Reduced average handle time by 12% and improved SLA adherence by 10%
Jan 2022 – Sep 2022
Data Analytics Intern
AI Variant (ExcelR) · Remote
  • Executed 6+ end-to-end analytics projects — EDA, data cleaning, and stakeholder dashboards
  • Built core skills in Python, SQL, and Excel translating raw data into actionable business insights

Education & Certifications

M.Sc. Data Science & Artificial Intelligence
BITS Pilani
Expected 2027 · GPA 8.5/10
Bachelor of Commerce (B.Com)
IGNOU
Completed 2024 · GPA 7.0/10
Certifications
Simplilearn
Data Analytics with Generative AI
April 2026
HackerRank
SQL (Intermediate) Certification
September 2025
JPMorgan Chase (Forage)
Quantitative Research Job Simulation
October 2025
McKinsey.org
Forward Learning Programme
December 2025
ExcelR Solutions
Data Analyst Certification (with Distinction)
February 2024

Analytics Playbooks

Frameworks and lessons from 3.5 years of real analytical work.

How I Handled 20M+ Records Without Breaking a Sweat
5 min read · SQL · Python · Cloud
The tools, techniques, and mindset that make massive datasets manageable — without a supercomputer.

The moment that changed how I think about scale

Someone handed me a 20M row dataset and said "figure out what's going on with churn." My laptop froze. Pandas ran for 40 minutes before I killed it. I'd been thinking about data wrong — as a file to open, not a system to query.

The biggest mindset shift: stop moving data to your tool. Move your logic to where the data lives.

Step 1 — Sample aggressively during exploration

1–2% of 20M rows is 200K — more than enough to understand schema and distributions. In Python: df.sample(frac=0.01). In SQL: TABLESAMPLE BERNOULLI(1). This alone saves hours.

Step 2 — Push computation to the database

SQL is for transformation. Python is for analysis. If you're doing GROUP BY in pandas on 20M rows, you're doing it wrong. Write the aggregation in SQL, pull the summarised result — your query engine is optimised for this in ways pandas never will be.

Step 3 — Use window functions, not self-joins

Self-joins on large tables are query killers. ROW_NUMBER(), LAG(), NTILE() do the same job in a single pass. My cohort retention analysis cut query time from 8 minutes to 22 seconds using window functions.

Step 4 — Partition and index deliberately

On cloud warehouses, partitioning by date column is the single biggest performance lever. A properly partitioned table means your query only reads 1/365th of the data when you filter by date.

The real skill isn't knowing which tools to use. It's knowing when to let the database do the heavy lifting.
🤖
GenAI Is My Analyst Superpower — Here's How I Use It
5 min read · GenAI · Productivity · SQL
Not hype. Actual workflows where AI makes me 3x faster without sacrificing accuracy or judgment.

The honest take

GenAI is a force multiplier for analysts who already know what they're doing. It's nearly useless — and sometimes dangerous — for those who don't.

I don't use AI to replace my thinking. I use it to eliminate the 40% of my time that was never thinking to begin with.

Where I actually use it: SQL generation

Writing boilerplate SQL is boring. I describe what I need in plain English and iterate from an AI-generated starting point. The key: I always review the output before running it. I understand what it's doing. I just didn't write the first draft from scratch.

Data summarization

After an EDA, I have 15 charts and 40 observations. Turning that into a 5-bullet executive summary used to take 45 minutes. Now I dump the key stats into a structured prompt and edit the AI draft heavily. Time: 10 minutes.

Where GenAI fails

Anything requiring business context. AI doesn't know that your "active user" definition excludes internal employees, or that the Q3 dip was a planned maintenance window. Domain knowledge is non-negotiable.

The analysts who thrive in the AI era know exactly where it helps and where it lies.
🔧
Feature Engineering: The Unglamorous Skill That Makes or Breaks ML Models
6 min read · Python · ML · EDA
Nobody talks about it at conferences. Everyone struggles with it in production.

Why it clicked for me

In my breast cancer EDA, radius, area, and perimeter were all measuring "tumor size" differently — correlations above 0.95. Including all three gave a model the same information three times. Feature quality explained more variance than algorithm selection.

A mediocre algorithm with great features beats a great algorithm with mediocre features. Every single time.

The multicollinearity trap

Correlation heatmap first, always. Any pair above 0.85–0.90 gets investigated. Keep the most interpretable feature, drop the rest.

Creating features that encode business logic

Raw columns rarely capture what drives outcomes. In e-commerce, I don't just use days_since_last_purchase. I create purchase_velocity, basket_size_trend, and category_diversity — these encode behaviour patterns the raw timestamps never surface.

My feature engineering checklist

  • Check correlation matrix — flag anything above 0.85
  • Look at distributions — heavy skew often needs log transform
  • Plot feature vs target — which features separate classes?
  • Create domain-meaningful aggregates and ratios
  • Scale before distance-based models, not before trees
🏗️
Stop Confusing Databases, Warehouses & Lakes — Here's the Real Difference
5 min read · Architecture · Cloud · SQL
A decision framework every analyst needs before touching a cloud data project.

Why this confusion is expensive

I've seen teams store operational data in a warehouse, run analytical queries on a transactional database, and dump raw logs without a schema. Each is a different kind of expensive mistake.

The question isn't which is better. It's: what are you trying to do, and how fast do you need it?

Databases (OLTP) — the operational layer

MySQL, PostgreSQL — built for speed at the row level. INSERT, UPDATE, DELETE. Never run analytics workloads on your production database — you'll slow down operations for real users.

Data Warehouses (OLAP) — the analytical layer

Redshift, Snowflake, BigQuery — column-oriented systems built for complex analytical queries. My pattern: raw tables land from ETL, dbt transforms them into clean models, analysts query the models.

Data Lakes — the raw storage layer

S3, ADLS — object stores at very low cost. No schema, no enforcement. Everything lands here first. With AWS Athena, you can query S3 directly using SQL without loading into a warehouse — useful for ad hoc exploration.

The one-line decision rule

Building a product feature → database. Answering a business question → warehouse. Keep everything and figure it out later → lake.

📊
The Funnel Analysis Playbook: How to Find Where Your Business Is Bleeding Revenue
6 min read · Funnel Analysis · SQL · Business Strategy
The exact framework I used to identify $36K in recoverable revenue — and how you can apply it to any conversion problem.

Why most funnel analysis is useless

I've seen dashboards showing conversion rates at every step. A number. No context, no benchmark, no actionability. Funnel analysis isn't about measuring conversion — it's about diagnosing where intent breaks down and why.

Your funnel doesn't have a conversion problem. It has a specific problem, at a specific step, for a specific segment. Your job is to find it.

Step 1 — Build session-level funnels, not event-level

If a user refreshes a page 5 times, that's 5 events but one session. Always deduplicate first, or your funnel will look better than it is. Use ROW_NUMBER() OVER (PARTITION BY user_id, session_id) to flag first occurrence per stage per session.

Step 2 — Find the primary bottleneck

In my e-commerce analysis, 93.2% bounced at the view stage. Cart-to-purchase was healthy. This completely changes the recommendation: fix discovery, not checkout. Fix the biggest drop first — everything else is noise.

Step 3 — Segment every way that matters

Overall funnel hides stories. Always segment by: device, user type (new vs returning), category, time of day, traffic source, price band. Smartphones had 50K+ views but sub-2% CVR — the single highest recovery opportunity.

Step 4 — Quantify in revenue terms

Don't say "conversion is low." Say: "2,434 October buyers, 1.2% returned vs 8% benchmark. At 5% retention: 120 additional purchases × $300 AOV = $36K recoverable from a single campaign." That's a number a CEO acts on.

The deliverable of funnel analysis isn't a chart. It's a prioritised list of specific interventions, each with an estimated revenue impact.

Let's Work Together

Open to opportunities

Looking for data analyst roles where I can solve real business problems — not just build dashboards. Open to full-time positions, contract work, and interesting analytical challenges.