The problem

Last quarter, we launched a seller onboarding training program. Our metrics team found that account managers completing the program achieved 15% higher revenue than those who didn't. Leadership pushed for mandatory rollout. But wait — did the training cause the lift, or did we enroll only our most capable managers first?

This is the counterfactual problem. We see what happened with the program (Y=1) but never observe what would have happened without it (Y=0) for the same seller. A simple before-after or treated-vs-untreated comparison confuses correlation with causation. Without careful thinking, we ship the program to managers who'd succeed anyway, wasting budget.

I've watched teams celebrate metrics that collapse after launch because the underlying causal story was never right. This guide helps you avoid that trap.

The intuition

Causality asks: if we could rewind time and not give seller A the training, would their revenue be lower? That's the counterfactual. We never observe both worlds simultaneously. Statistics alone can't solve this — we need design.

The key insight is that randomisation breaks the link between unobserved confounders and treatment. If we randomly assign training to 50% of sellers and hold out the other 50%, then the treated and untreated groups are exchangeable on average. Any difference in outcomes is causal.

In observational data (no randomisation), we must make assumptions: perhaps we measure all confounders, or find a natural experiment. The architecture of your data and your assumptions determine what you can learn.

Randomisation breaks the link between confounders and treatment. In observational data, your assumptions are your foundation.

In practice

At Amazon, I've seen three approaches. First, run experiments — randomise exposure and measure outcomes. This is gold standard but sometimes slow or logistically hard. Second, leverage natural experiments: a sales tax policy, regional rollout, or machine learning feature flag that effectively randomises. Third, make parametric assumptions and adjust for measured confounders.

For the seller training example, we randomised: split sellers into control (no training) and treatment (training offered). After two months, the treatment group had 12% revenue lift, not 15%. The 3% gap is likely driven by selection bias in the observational data. Now rollout is confident.

The hard part is knowing what you don't know. Document your assumptions, run sensitivity analyses, and replicate findings across cohorts. One clean experiment beats ten observational studies.

Going deeper (optional)

Causal inference sits at the intersection of statistics, econometrics, and computer science. Three schools dominate: the potential outcomes framework (Rubin), the directed acyclic graph (Pearl), and econometric difference-in-differences. Each has different strengths.

Key concepts include SUTVA (no interference between units), unconfoundedness (all confounders measured), positivity (treated and untreated units in every stratum), and consistency (well-defined counterfactuals). Violate any of these and your estimate is biased.

For deeper study: Angrist & Pischke, Pearl & Mackenzie, and Cunningham's Causal Inference: The Mixtape (free online).

Y_i = Y_i(0) if W_i = 0
Y_i = Y_i(1) if W_i = 1

ATE = E[Y(1) - Y(0)]
ATT = E[Y(1) - Y(0) | W = 1]