When You Can't Run an A/B Test

The problem

You want to know if your new seller onboarding flow increases 90-day survival rate. You can't randomise — the programme launched nationally. You want to measure the impact of a price change but interference between users rules out standard A/B. You want to evaluate a recommendation algorithm but cookie-based randomisation leaks between devices.

A/B testing is the gold standard. It's also frequently impossible. This piece is a decision framework for what to do instead.

The intuition

Quasi-experimental methods work by finding as-if randomisation — situations where nature, policy, or operational decisions created variation in treatment assignment that is plausibly independent of potential outcomes.

The four workhorses:

Difference-in-differences (DiD): compare the before-after change in a treated group to the before-after change in an untreated group. Assumes parallel trends — that in the absence of treatment, both groups would have changed at the same rate. Testable pre-treatment.

Regression discontinuity (RDD): exploit a threshold rule. Sellers above a GMV threshold get a dedicated account manager. Sellers just below and just above the threshold are similar except for the treatment — compare them. Requires a sharp or fuzzy threshold and enough units near the boundary.

Synthetic control: build a weighted counterfactual from untreated units that matches the treated unit pre-treatment. Best for aggregate-level interventions (country, region, product category) with a single treated unit.

Instrumental variables (IV): find a variable that affects treatment assignment but has no direct effect on the outcome (the instrument). Use it to isolate exogenous variation in treatment. Classic example: distance to college as an instrument for education.

The question isn't which method is best. It's which method's assumptions are most defensible given your specific data and context.

In practice

The decision framework I use:

1. Did a sharp threshold determine treatment? → Consider RDD. Check if units can manipulate which side of the threshold they're on (which would invalidate it).

2. Do you have a panel (multiple time periods) and a plausible comparison group? → DiD. Run event studies to verify parallel pre-trends.

3. Is the treatment at an aggregate level (country, store) with multiple untreated units? → Synthetic control. Check pre-treatment fit.

4. Do you have a variable that affects treatment but not outcome directly? → IV. The hardest assumption to verify — exclusion restriction requires genuine domain knowledge.

5. None of the above? → Matching + regression adjustment, but be honest about the assumptions you're making and run sensitivity analyses.

For any of these, document your identifying assumption clearly in the methods section of your analysis. The assumption is the entire foundation of the causal claim — if it's wrong, nothing else saves you.

Going deeper (optional)

The parallel trends assumption in DiD is untestable for the treatment period but testable pre-treatment. The standard approach is an event study: run your DiD specification period-by-period in the pre-treatment window and verify that the coefficients are near zero. If you see pre-trends, your DiD is likely picking up existing differences, not treatment effects.

Staggered adoption DiD (where different units are treated at different times) has a recently discovered problem: with heterogeneous treatment effects across cohorts, the classic two-way fixed effects estimator can produce the wrong sign. Use Callaway & Sant'Anna or Sun & Abraham estimators instead.

import pandas as pd
import statsmodels.formula.api as smf

# Simple 2x2 DiD
model = smf.ols(
    "outcome ~ treated * post + C(unit) + C(time)",
    data=panel_df
).fit(cov_type="cluster", cov_kwds={"groups": panel_df["unit"]})

print(model.summary().tables[1])
# The treated:post coefficient is your DiD estimate