The problem
You run an experiment for two weeks. No significant result. You extend it to four weeks. Still nothing. You extend to six weeks and finally see p = 0.048. You declare victory and ship.
This is p-hacking, and it's more common than anyone admits. The cure is to decide your sample size before you start — and stop when you reach it, regardless of the p-value. That's what power analysis is for.
The intuition
Power analysis answers: how many samples do I need to detect an effect of a given size with a given level of confidence? It connects four quantities:
- α (significance level): the false positive rate you'll tolerate (typically 0.05) - β (power = 1 − β): the probability of detecting a real effect if it exists (typically 0.8) - MDE (minimum detectable effect): the smallest effect you care about detecting - σ (variance): the variability in your outcome metric
Fix any three, and the fourth is determined. Most teams fix α = 0.05, power = 0.8, pick an MDE based on business context, and solve for n.
The MDE is the most important input, and the one teams get wrong most often. It should answer: "what's the smallest effect that would change a business decision?" Not "what effect are we hoping to see?" Those are different questions.
Setting the MDE too low is free optimism. It just means you need more samples. Setting it based on what you hope to see is p-hacking waiting to happen.
In practice
For a two-sided t-test on a continuous metric:
from scipy import stats
import numpy as np
def min_sample_size(baseline_mean, mde_relative, std, alpha=0.05, power=0.8):
mde_abs = baseline_mean * mde_relative
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
n = 2 * ((z_alpha + z_beta) * std / mde_abs) ** 2
return int(np.ceil(n))
# Example: baseline revenue $100, MDE 5%, std $80
n = min_sample_size(100, 0.05, 80)
# n ≈ 2,000 per arm
For binary metrics (conversion rate), replace the formula with the proportion-based version. For count metrics (orders per user), consider using the delta method or bootstrapping to estimate variance.
One thing teams consistently underestimate: the variance of their metric. Pull at least 4 weeks of historical data to estimate it, not 1 week. Metrics with high weekly seasonality will have much larger variance than a snapshot suggests.
Going deeper (optional)
Sequential testing methods (like mSPRT or always-valid inference) let you peek at results continuously without inflating your false positive rate. They're becoming standard at mature experimentation organisations. The tradeoff: they typically require 10–30% more samples than a fixed-horizon test at the same α and power.
Clustered experiments (randomising at the store/city level instead of the user level) require a degrees-of-freedom adjustment. Standard power formulas assume independent observations — clustering inflates variance. Use the design effect: DEFF = 1 + (m − 1) * ICC, where m is cluster size and ICC is the intra-cluster correlation.
def power_with_clustering(n_per_arm, mde_abs, std, cluster_size, icc, alpha=0.05):
deff = 1 + (cluster_size - 1) * icc
effective_n = n_per_arm / deff
se = std * np.sqrt(2 / effective_n)
z = mde_abs / se
power = 1 - stats.norm.cdf(stats.norm.ppf(1 - alpha / 2) - z)
return power