Glossary

Minimum Detectable Effect

The Minimum Detectable Effect (MDE) is the smallest true effect size that an experiment is statistically powered to detect, given a specified sample size, significance level, and desired power.

Get in touch Learn more

Elegant overhead shot of a polished wooden communal table in a sun-drenched WeWork lounge, laptops and tablets displaying AI workflow dashboards, plants and pendant lights in background.

A/B TESTING FRAMEWORKS

What is Minimum Detectable Effect?

A core statistical concept in experiment design that determines the sensitivity of an A/B test.

The Minimum Detectable Effect (MDE) is the smallest true effect size—such as a lift in a key performance indicator—that a statistical experiment is powered to detect with a specified degree of confidence, given its sample size, significance level (alpha), and desired statistical power. It is a critical input for calculating the required sample size before launching an A/B test, ensuring the experiment is not underpowered and can reliably identify meaningful performance differences between variants.

A smaller MDE requires a larger sample size to detect subtle changes, increasing experiment cost and duration. In evaluation-driven development, setting the MDE involves a business trade-off: it should reflect the smallest improvement that would justify implementing a change. MDE is inversely related to statistical power; a higher power to detect a given effect also requires a larger sample. It is distinct from the observed effect size in a completed experiment.

A/B TESTING FRAMEWORKS

Key Components of MDE Calculation

The Minimum Detectable Effect is not a single number but a function of several interdependent statistical parameters. Understanding these components is essential for designing a properly powered experiment.

Statistical Power (1 - β)

Statistical power is the probability that your test will correctly detect a true effect of the specified MDE size. It represents the test's sensitivity.

Standard Threshold: 80% or 90%. A power of 80% means there's a 20% chance of a Type II error (false negative).
Trade-off with Sample Size: Higher power requires a larger sample size, all else being equal. Doubling power from 80% to 90% significantly increases the required N.
Engineering Implication: Choosing 80% vs. 90% power is a business risk decision balancing the cost of a missed opportunity against the cost of running a larger, longer experiment.

Significance Level (α)

The significance level (alpha) is the probability threshold for rejecting the null hypothesis when it is actually true, i.e., the risk of a Type I error (false positive).

Standard Threshold: 5% (α = 0.05). This sets a 5% false positive rate.
Relationship to MDE: A stricter alpha (e.g., 0.01) requires stronger evidence to declare a winner, which increases the required sample size for a given MDE and power.
Multiple Testing Correction: If running many concurrent experiments or checking multiple metrics, the family-wise error rate inflates. Techniques like the Bonferroni correction adjust alpha downward, which directly increases the MDE or required sample size.

Baseline Conversion Rate (p)

The baseline conversion rate is the current performance metric of your control variant before any change. For a binary metric (e.g., click-through rate), this is a proportion (p).

Critical Input: MDE is often expressed as a relative lift (e.g., 5%). The absolute difference the test must detect is p * (MDE relative). A 5% lift on a 10% baseline is a 0.5 percentage point absolute change.
Impact on Variance: The variance of a proportion is p(1-p). This variance is highest when p = 0.5 and decreases as p approaches 0 or 1. Higher variance requires a larger sample size to detect the same relative MDE.
Practical Note: Use recent, reliable historical data to estimate p. An inaccurate baseline is a common cause of underpowered experiments.

Sample Size (N) & Allocation

Sample size (N) is the total number of independent experimental units (e.g., users, sessions) required. It is the output of the MDE calculation but also a primary constraint.

Calculation: N is derived from the chosen α, power (1-β), baseline rate (p), and the absolute MDE. Formulas differ for proportions (z-test) and means (t-test).
50/50 Split: The most statistically efficient allocation is an equal split between control and treatment. Deviating from this (e.g., 90/10) increases the total N required for the same power.
Traffic & Duration: N must be feasible given your available traffic. Required experiment duration = N / (daily traffic * allocation %). Long durations increase the risk of seasonal effects contaminating results.

Variance & Standard Deviation (σ)

For continuous metrics (e.g., revenue per user, session duration), the variance (σ²) or standard deviation (σ) of the underlying data is a key driver of MDE.

Role in Calculation: The detectable difference in means is proportional to σ / √N. Noisier data (high σ) makes it harder to detect small effects, requiring a larger N or accepting a larger MDE.
Estimation Challenge: Accurately estimating σ from historical data is crucial. Overestimating σ leads to an overpowered, wasteful test; underestimating leads to an underpowered test likely to miss real effects.
Reducing Variance: Techniques like CUPED (Controlled-experiment Using Pre-Experiment Data) can reduce σ by accounting for user-level covariates, effectively lowering the MDE for a fixed N.

One-Sided vs. Two-Sided Tests

This choice defines the alternative hypothesis. A two-sided test looks for any difference (increase or decrease). A one-sided test looks for a difference in a specific direction (e.g., increase only).

Statistical Impact: A one-sided test at alpha = 0.05 has the same critical value as a two-sided test at alpha = 0.10. Therefore, for the same α, a one-sided test has higher power (or a smaller MDE) to detect an effect in the specified direction.
Appropriate Use: One-sided tests are only valid when you have a strong a priori reason that the effect cannot be in the opposite direction (e.g., removing a latency bug cannot increase latency). Misuse increases the risk of false positives.
Standard Practice: Two-sided tests are the default in A/B testing for fairness, as they guard against unexpected negative impacts.

A/B TESTING FRAMEWORKS

How is MDE Calculated and Applied in AI Testing?

The Minimum Detectable Effect (MDE) is a foundational statistical parameter in A/B testing that defines the sensitivity of an experiment. This section details its calculation and practical application in evaluating AI model performance.

The Minimum Detectable Effect (MDE) is the smallest true difference in a key performance metric—such as accuracy, click-through rate, or latency—that an A/B test is statistically powered to detect, given a specified sample size, significance level (alpha), and desired statistical power (1-beta). It is calculated prior to an experiment using power analysis, which balances the trade-off between the required sample size and the sensitivity needed to identify a meaningful improvement when comparing two AI models or configurations. Setting the MDE is a critical business and engineering decision, as it directly determines the experiment's duration and resource requirements.

In AI testing, the MDE is applied to determine if a new model variant's performance delta is both statistically significant and practically important. For instance, when testing a new recommendation algorithm, the MDE defines the minimum lift in conversion rate that justifies deployment. Engineers use the MDE to calculate the necessary sample size and monitor guardrail metrics to ensure the primary optimization does not cause regressions. A well-chosen MDE prevents underpowered experiments that miss real effects and over-powered ones that waste resources detecting trivial differences, ensuring efficient and conclusive model evaluation.

KEY DIFFERENCES

MDE vs. Related Statistical Concepts

This table clarifies the distinct role of Minimum Detectable Effect (MDE) by contrasting it with other fundamental statistical and experimental concepts used in A/B testing and causal inference.

Concept	Definition	Primary Role in Experimentation	Relationship to MDE
Minimum Detectable Effect (MDE)	The smallest true effect size an experiment is statistically powered to detect, given a specified sample size, significance level (α), and power (1-β).	A design parameter used for sample size calculation and power analysis before an experiment begins.	Core concept being defined.
Statistical Power (1-β)	The probability that a test will correctly reject a false null hypothesis (i.e., detect a true effect).	A target probability (e.g., 80%) set during experiment design to ensure adequate sensitivity.	MDE is calculated based on a desired level of power. Higher power allows detection of a smaller MDE, all else being equal.
P-Value	The probability, assuming the null hypothesis is true, of observing an effect at least as extreme as the one in the sample data.	A post-experiment metric used to determine if an observed effect is statistically significant.	The p-value is calculated from observed data. If the observed effect size meets or exceeds the pre-specified MDE and is statistically significant (p < α), the experiment was adequately powered to detect it.
Confidence Interval	A range of values, derived from sample data, that is likely to contain the true population parameter (e.g., the true effect size) with a specified confidence level (e.g., 95%).	Provides a post-experiment estimate of the magnitude and uncertainty of an observed effect.	A well-powered experiment (designed with an appropriate MDE) will typically yield a confidence interval that is narrow enough to be informative. If the MDE was 2% and the 95% CI is [0.1%, 0.5%], the true effect is likely smaller than the MDE the test was designed to detect.
Statistical Significance	A determination that an observed effect is unlikely to be due to random chance, typically declared when a p-value falls below a pre-defined significance level (α).	A binary outcome (significant/not significant) based on comparing the p-value to alpha (α).	Achieving statistical significance does not mean the effect is practically important. MDE grounds significance in practical importance by defining the smallest effect considered meaningful for the business.
Average Treatment Effect (ATE)	The average causal difference in outcomes between the treatment and control groups across the entire population.	The target estimand in a causal inference study or A/B test; the 'true effect' we are trying to measure.	MDE is the smallest ATE that the experiment has a good chance (power) of detecting as statistically significant. The observed ATE is compared to the MDE for practical interpretation.
Sample Size (N)	The number of experimental units (e.g., users, sessions) included in the study.	A key design variable that directly impacts an experiment's cost, duration, and statistical precision.	Sample size is calculated from the chosen MDE, significance level (α), and power (1-β). A smaller MDE requires a larger sample size to detect.
Guardrail Metric	A secondary performance or system health indicator monitored during an experiment to ensure optimization of a primary metric does not cause unacceptable degradation.	A risk mitigation tool to protect user experience and business fundamentals during experimentation.	Independent of MDE calculation. While MDE is set for the primary metric, guardrail metrics are monitored for any negative movement, regardless of statistical thresholds.

MINIMUM DETECTABLE EFFECT

Frequently Asked Questions

Essential questions about the Minimum Detectable Effect (MDE), a core statistical concept for designing and powering A/B tests in AI and software systems.

The Minimum Detectable Effect (MDE) is the smallest true effect size that a statistical experiment is powered to detect with a specified probability, given a predetermined sample size, significance level, and statistical power. It represents the practical sensitivity threshold of your test. In the context of A/B testing a new AI model, the MDE answers the question: 'What is the smallest improvement in our primary metric (e.g., click-through rate, accuracy) that this experiment can reliably distinguish from random noise?' It is a critical input for sample size calculation, ensuring you collect enough data to have a reasonable chance of observing the effect you expect or care about.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

A/B TESTING FRAMEWORKS

Related Terms

The Minimum Detectable Effect is a core parameter in experimental design. Understanding these related statistical and methodological concepts is essential for planning robust A/B tests and interpreting their results.

Statistical Power

Statistical power is the probability that a hypothesis test will correctly reject a false null hypothesis. It represents the test's sensitivity to detect a true effect when one exists.

Directly linked to MDE: A higher desired power (e.g., 80% vs. 90%) requires a larger sample size to detect the same MDE, or allows detection of a smaller MDE with the same sample size.
Calculated as 1 - β, where β is the probability of a Type II error (failing to detect a true effect).
Along with significance level (α) and sample size, power is a key input for calculating the MDE before an experiment begins.

Sample Size

Sample size is the number of observations or experimental units (e.g., users, sessions) included in a study. It is a primary determinant of an experiment's precision and its ability to detect an effect.

Has an inverse relationship with MDE: For a fixed power and significance level, a larger sample size enables the detection of a smaller Minimum Detectable Effect.
Sample size calculations are performed during experiment planning to ensure the test is adequately powered to detect the business-relevant MDE.
Insufficient sample size leads to underpowered experiments that are likely to miss meaningful effects, resulting in inconclusive or misleading 'no significant difference' findings.

Effect Size

Effect size is a quantitative measure of the magnitude of a phenomenon or the difference between two groups. Unlike statistical significance, it is not influenced by sample size.

The MDE is a planned or target effect size. It is the threshold of practical importance that the experiment is designed to detect.
Common measures include Cohen's d (standardized mean difference), relative risk, and percentage lift (e.g., a 2% increase in conversion rate).
After an experiment, the observed effect size is calculated from the data. If the observed effect meets or exceeds the pre-defined MDE and is statistically significant, the result is considered both reliable and meaningful.

Significance Level (Alpha)

The significance level, denoted by alpha (α), is the probability of rejecting the null hypothesis when it is actually true (a Type I error or false positive).

It is the threshold for declaring statistical significance. A common standard is α = 0.05, meaning a 5% risk of a false positive.
Along with power and sample size, α is a critical input for calculating the MDE. A more stringent alpha (e.g., 0.01) requires a larger sample size to detect the same MDE, or results in a larger MDE for a fixed sample size.
Setting α involves a trade-off between the risk of false positives and the sensitivity (power) of the test.

Practical Significance

Practical significance asks whether a statistically detected effect is large enough to be of real-world value or worth the cost of implementation. It is a business judgment, not a statistical one.

The MDE is explicitly defined to represent practical significance. An experiment should be powered to detect the smallest effect that would justify a business decision (e.g., launching a new model).
A result can be statistically significant (unlikely due to chance) but not practically significant if the observed effect is tiny and below the MDE threshold.
Defining the MDE forces alignment between data scientists and product/business stakeholders on what constitutes a meaningful outcome before the experiment runs.

Type II Error (Beta)

A Type II error occurs when a hypothesis test fails to reject a false null hypothesis, meaning a true effect is missed. The probability of a Type II error is denoted by beta (β).

Statistical power is 1 - β. If β=0.20, power is 0.80 (80%).
The MDE is defined in relation to β. The test is designed to have a (1-β) probability of detecting an effect as large as or larger than the MDE.
A high β (low power) increases the risk of concluding 'no difference' when a meaningful difference (at or above the MDE) actually exists. This is a primary risk of underpowered experiments.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Minimum Detectable Effect

What is Minimum Detectable Effect?

Key Components of MDE Calculation

Statistical Power (1 - β)

Significance Level (α)

Baseline Conversion Rate (p)

Sample Size (N) & Allocation

Variance & Standard Deviation (σ)

One-Sided vs. Two-Sided Tests

How is MDE Calculated and Applied in AI Testing?

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there