Writing
Things I've worked with, written down.
Benchmarking GenAI: Beyond Vibes
Designing evaluation systems that give you launch confidence — not just high scores.
When You Can't Run an A/B Test
A decision framework for quasi-experimental methods: RDD, synthetic control, DiD, and uplift modeling.
Prompt Engineering That Holds in Production
What actually works when you're scoring 32K+ products, not just demo notebooks.
Heterogeneous Treatment Effects and CATE
Why average effects hide the story — and how to find who actually benefits.
Guardrail Metrics: What You're Protecting
Defining the metrics you won't trade off, and how to structure experiment readouts around them.
Semantic Search at Scale: Brand Standardization
Using FAISS and embeddings to map 300K noisy brand strings to a canonical taxonomy.
Synthetic Control for Product Launches
Building counterfactual baselines when you can't run an A/B test.
CUPED and Variance Reduction
How to run faster, cheaper A/B tests using pre-experiment covariates.
Multimodal Evaluation Pipelines
Ingesting images and HTML, extracting structured signals, and measuring quality across proprietary KPIs.
DoubleML: Causal Inference Meets Machine Learning
How Double/Debiased ML separates prediction from causal estimation — with Python code.
Power Analysis: A Practical Primer
How to calculate sample size, set MDE, and avoid underpowering your most important experiments.
LLM-as-Judge: Making Models Evaluate Models
How to build a rubric-based evaluation framework, calibrate scoring, and validate with human audits.
The Potential Outcomes Framework
Rubin's causal model, counterfactuals, and what it actually means to estimate a treatment effect.
Why Causality Matters More Than Correlation
Starting with a hiring manager asking 'did this training program work?' and ending with potential outcomes.
Stay in the loop
New pieces, straight to your inbox.
Causal inference, experimentation, GenAI. No noise — just the next piece when it's ready.