Inferensys

Glossary

Minimum Detectable Effect

The Minimum Detectable Effect (MDE) is the smallest true effect size that an experiment is statistically powered to detect, given a specified sample size, significance level, and desired power.
Elegant overhead shot of a polished wooden communal table in a sun-drenched WeWork lounge, laptops and tablets displaying AI workflow dashboards, plants and pendant lights in background.
A/B TESTING FRAMEWORKS

What is Minimum Detectable Effect?

A core statistical concept in experiment design that determines the sensitivity of an A/B test.

The Minimum Detectable Effect (MDE) is the smallest true effect size—such as a lift in a key performance indicator—that a statistical experiment is powered to detect with a specified degree of confidence, given its sample size, significance level (alpha), and desired statistical power. It is a critical input for calculating the required sample size before launching an A/B test, ensuring the experiment is not underpowered and can reliably identify meaningful performance differences between variants.

A smaller MDE requires a larger sample size to detect subtle changes, increasing experiment cost and duration. In evaluation-driven development, setting the MDE involves a business trade-off: it should reflect the smallest improvement that would justify implementing a change. MDE is inversely related to statistical power; a higher power to detect a given effect also requires a larger sample. It is distinct from the observed effect size in a completed experiment.

A/B TESTING FRAMEWORKS

Key Components of MDE Calculation

The Minimum Detectable Effect is not a single number but a function of several interdependent statistical parameters. Understanding these components is essential for designing a properly powered experiment.

01

Statistical Power (1 - β)

Statistical power is the probability that your test will correctly detect a true effect of the specified MDE size. It represents the test's sensitivity.

  • Standard Threshold: 80% or 90%. A power of 80% means there's a 20% chance of a Type II error (false negative).
  • Trade-off with Sample Size: Higher power requires a larger sample size, all else being equal. Doubling power from 80% to 90% significantly increases the required N.
  • Engineering Implication: Choosing 80% vs. 90% power is a business risk decision balancing the cost of a missed opportunity against the cost of running a larger, longer experiment.
02

Significance Level (α)

The significance level (alpha) is the probability threshold for rejecting the null hypothesis when it is actually true, i.e., the risk of a Type I error (false positive).

  • Standard Threshold: 5% (α = 0.05). This sets a 5% false positive rate.
  • Relationship to MDE: A stricter alpha (e.g., 0.01) requires stronger evidence to declare a winner, which increases the required sample size for a given MDE and power.
  • Multiple Testing Correction: If running many concurrent experiments or checking multiple metrics, the family-wise error rate inflates. Techniques like the Bonferroni correction adjust alpha downward, which directly increases the MDE or required sample size.
03

Baseline Conversion Rate (p)

The baseline conversion rate is the current performance metric of your control variant before any change. For a binary metric (e.g., click-through rate), this is a proportion (p).

  • Critical Input: MDE is often expressed as a relative lift (e.g., 5%). The absolute difference the test must detect is p * (MDE relative). A 5% lift on a 10% baseline is a 0.5 percentage point absolute change.
  • Impact on Variance: The variance of a proportion is p(1-p). This variance is highest when p = 0.5 and decreases as p approaches 0 or 1. Higher variance requires a larger sample size to detect the same relative MDE.
  • Practical Note: Use recent, reliable historical data to estimate p. An inaccurate baseline is a common cause of underpowered experiments.
04

Sample Size (N) & Allocation

Sample size (N) is the total number of independent experimental units (e.g., users, sessions) required. It is the output of the MDE calculation but also a primary constraint.

  • Calculation: N is derived from the chosen α, power (1-β), baseline rate (p), and the absolute MDE. Formulas differ for proportions (z-test) and means (t-test).
  • 50/50 Split: The most statistically efficient allocation is an equal split between control and treatment. Deviating from this (e.g., 90/10) increases the total N required for the same power.
  • Traffic & Duration: N must be feasible given your available traffic. Required experiment duration = N / (daily traffic * allocation %). Long durations increase the risk of seasonal effects contaminating results.
05

Variance & Standard Deviation (σ)

For continuous metrics (e.g., revenue per user, session duration), the variance (σ²) or standard deviation (σ) of the underlying data is a key driver of MDE.

  • Role in Calculation: The detectable difference in means is proportional to σ / √N. Noisier data (high σ) makes it harder to detect small effects, requiring a larger N or accepting a larger MDE.
  • Estimation Challenge: Accurately estimating σ from historical data is crucial. Overestimating σ leads to an overpowered, wasteful test; underestimating leads to an underpowered test likely to miss real effects.
  • Reducing Variance: Techniques like CUPED (Controlled-experiment Using Pre-Experiment Data) can reduce σ by accounting for user-level covariates, effectively lowering the MDE for a fixed N.
06

One-Sided vs. Two-Sided Tests

This choice defines the alternative hypothesis. A two-sided test looks for any difference (increase or decrease). A one-sided test looks for a difference in a specific direction (e.g., increase only).

  • Statistical Impact: A one-sided test at alpha = 0.05 has the same critical value as a two-sided test at alpha = 0.10. Therefore, for the same α, a one-sided test has higher power (or a smaller MDE) to detect an effect in the specified direction.
  • Appropriate Use: One-sided tests are only valid when you have a strong a priori reason that the effect cannot be in the opposite direction (e.g., removing a latency bug cannot increase latency). Misuse increases the risk of false positives.
  • Standard Practice: Two-sided tests are the default in A/B testing for fairness, as they guard against unexpected negative impacts.
A/B TESTING FRAMEWORKS

How is MDE Calculated and Applied in AI Testing?

The Minimum Detectable Effect (MDE) is a foundational statistical parameter in A/B testing that defines the sensitivity of an experiment. This section details its calculation and practical application in evaluating AI model performance.

The Minimum Detectable Effect (MDE) is the smallest true difference in a key performance metric—such as accuracy, click-through rate, or latency—that an A/B test is statistically powered to detect, given a specified sample size, significance level (alpha), and desired statistical power (1-beta). It is calculated prior to an experiment using power analysis, which balances the trade-off between the required sample size and the sensitivity needed to identify a meaningful improvement when comparing two AI models or configurations. Setting the MDE is a critical business and engineering decision, as it directly determines the experiment's duration and resource requirements.

In AI testing, the MDE is applied to determine if a new model variant's performance delta is both statistically significant and practically important. For instance, when testing a new recommendation algorithm, the MDE defines the minimum lift in conversion rate that justifies deployment. Engineers use the MDE to calculate the necessary sample size and monitor guardrail metrics to ensure the primary optimization does not cause regressions. A well-chosen MDE prevents underpowered experiments that miss real effects and over-powered ones that waste resources detecting trivial differences, ensuring efficient and conclusive model evaluation.

MINIMUM DETECTABLE EFFECT

Frequently Asked Questions

Essential questions about the Minimum Detectable Effect (MDE), a core statistical concept for designing and powering A/B tests in AI and software systems.

The Minimum Detectable Effect (MDE) is the smallest true effect size that a statistical experiment is powered to detect with a specified probability, given a predetermined sample size, significance level, and statistical power. It represents the practical sensitivity threshold of your test. In the context of A/B testing a new AI model, the MDE answers the question: 'What is the smallest improvement in our primary metric (e.g., click-through rate, accuracy) that this experiment can reliably distinguish from random noise?' It is a critical input for sample size calculation, ensuring you collect enough data to have a reasonable chance of observing the effect you expect or care about.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.