DoubleML: Causal Inference Meets Machine Learning

The problem

You want to estimate the causal effect of a pricing intervention on seller revenue. You have 200 features — seller age, category, geography, historical performance. Classic OLS will give you biased estimates if any of those features predict both who got treated and what their revenue is (confounders).

You could manually select controls, but that's fragile at scale. What if you could let machine learning handle the confounding, while still producing a clean, low-bias causal estimate with valid confidence intervals?

The intuition

DoubleML (Double/Debiased Machine Learning) solves the problem by splitting the task in two. Instead of one regression trying to do everything at once, you run two "nuisance" models first:

1. Predict the treatment W from covariates X. Get the residuals Ṽ = W − Ê[W|X]. 2. Predict the outcome Y from covariates X. Get the residuals Ỹ = Y − Ê[Y|X].

Then regress Ỹ on Ṽ. The coefficient is your causal estimate.

Why does this work? By residualising both the outcome and treatment on X, you've removed the variation in both that's explained by confounders. What's left is the variation in treatment that's not due to observed confounders — which, under the assumption that X captures all confounding, is as good as random. Regressing the cleaned outcome on the cleaned treatment gives you a causal effect, not a predictive association.

The key insight: use ML to clean out confounding, then estimate causality on the residuals. Prediction and identification are separate tasks.

In practice

The nuisance models (steps 1 and 2) can be any ML algorithm — XGBoost, random forest, neural nets. This is where the "ML" in DoubleML comes in. Because you're not using these models for inference (just for cleaning), their flexibility is a feature, not a problem.

Cross-fitting is critical: you train each nuisance model on one half of the data and generate residuals on the other. This prevents overfitting from biasing the final estimate — a subtle but important detail that the original Chernozhukov et al. paper formalises.

At Amazon I used DoubleML to estimate the causal impact of account manager interventions on seller revenue, controlling for ~150 seller features. The approach found effects that OLS had been systematically overestimating by 30–40% due to positive selection into intervention.

Going deeper (optional)

The formal guarantee is that the causal estimator is root-n consistent even when the nuisance models converge at slower rates — this is the "debiasing" part. The Neyman orthogonality condition ensures the final estimate is robust to small errors in the nuisance models.

The Python package doubleml implements the full pipeline cleanly. A minimal example:

from doubleml import DoubleMLPLR
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

ml_l = GradientBoostingRegressor()   # outcome model
ml_m = GradientBoostingClassifier()  # treatment model

dml = DoubleMLPLR(data, ml_l, ml_m)
dml.fit()
print(dml.summary)  # coefficient + SE + p-value