A null hypothesis (H₀) is a default statistical proposition that there is no effect, no difference, or no relationship between defined groups or variables. In the context of A/B testing and experimental design, it is the assumption that any observed difference in a key metric between a treatment group (e.g., a new AI model) and a control group is due to random chance alone. The goal of an experiment is to gather sufficient evidence to reject the null hypothesis in favor of an alternative hypothesis (H₁) that asserts a true effect exists.
Glossary
Null Hypothesis

What is a Null Hypothesis?
The foundational statistical assumption tested in controlled experiments like A/B tests.
Rejecting the null hypothesis typically involves calculating a p-value and comparing it to a pre-defined significance level (alpha). A low p-value indicates that the observed data is unlikely under the null hypothesis, providing statistical grounds for rejection. Crucially, failing to reject the null does not prove it true; it merely indicates insufficient evidence for an effect. This framework is central to frequentist inference and underpins rigorous performance metric comparison in Evaluation-Driven Development.
Key Characteristics of the Null Hypothesis
The null hypothesis (H₀) is the foundational assumption in statistical hypothesis testing, positing no effect or no difference. Understanding its formal properties is critical for designing valid A/B tests and interpreting p-values.
Default Position of No Effect
The null hypothesis is the default or status quo assumption that any observed difference in an experiment is due to random chance, not a systematic effect. It is a precise, testable statement about a population parameter (e.g., 'The mean click-through rate for variant A equals the mean for variant B'). The burden of proof lies on the alternative hypothesis (H₁) to provide sufficient evidence for rejection.
- Purpose: Serves as a skeptical baseline that experimental data must overcome.
- Example: In an A/B test for a new recommendation algorithm, H₀ states: 'The new algorithm does not increase average user engagement time compared to the old algorithm.'
Falsifiable and Precise
A properly formulated null hypothesis must be mathematically falsifiable. It makes a specific claim about a population parameter (like a mean, proportion, or variance) that can be contradicted by sample data. Vague statements cannot be tested.
- Key Property: It is structured for potential rejection, not proof. You can never 'accept' or 'prove' H₀; you can only fail to reject it based on insufficient evidence.
- Statistical Model: It defines the expected distribution of the test statistic under the assumption of no effect (e.g., a t-distribution centered at zero for a difference in means).
Direct Link to P-Value and Significance
The p-value is calculated directly under the assumption that H₀ is true. It represents the probability of observing data as extreme as, or more extreme than, the sample results, assuming the null hypothesis is correct. A small p-value indicates that the observed data is unlikely under H₀, leading to its rejection in favor of H₁.
- Interpretation: A p-value of 0.03 means there is a 3% chance of seeing the observed effect (or larger) if H₀ were true.
- Threshold: The pre-defined significance level (alpha, α)—commonly 0.05—is the threshold for rejecting H₀. If p ≤ α, H₀ is rejected.
Basis for Type I and Type II Errors
Hypothesis testing errors are defined in relation to the truth of H₀.
- Type I Error (False Positive): Incorrectly rejecting a true null hypothesis. The probability of this error is controlled by the significance level (α).
- Type II Error (False Negative): Failing to reject a false null hypothesis. The probability of this error is denoted by beta (β). Statistical power is 1 - β, the probability of correctly rejecting a false H₀.
These error trade-offs are fundamental to experiment design, determining required sample sizes via power analysis.
Not a Statement of Equality (Only)
While often an assertion of equality (e.g., μ₁ = μ₂), H₀ can also be formulated as a statement of 'no worse than' or using an inequality for one-sided tests.
- Two-Sided H₀: μ₁ = μ₂ (Tests for any difference)
- One-Sided H₀: μ₁ ≤ μ₂ (Tests specifically for an increase)
The choice between one-sided and two-sided tests must be made a priori, based on the research question, as it affects the p-value calculation and interpretation.
Operational Role in A/B Testing
In A/B testing frameworks, the null hypothesis is the engine of decision-making. It allows the translation of a business question ('Does this new UI improve conversion?') into a statistical procedure.
- Assignment & Measurement: Users are randomly assigned to control (A) and treatment (B) groups. A metric (e.g., conversion rate) is measured for both.
- Test Execution: A statistical test (e.g., a chi-squared test for proportions, a t-test for means) computes a test statistic and corresponding p-value under H₀.
- Decision Rule: If p-value ≤ α, reject H₀ and conclude a statistically significant difference exists. The experiment then shifts to analyzing the estimated effect size and its confidence interval for practical significance.
How Null Hypothesis Testing Works in AI Experiments
The null hypothesis is the foundational concept of statistical hypothesis testing, providing the formal baseline against which experimental results in AI and machine learning are rigorously evaluated.
The null hypothesis (H₀) is a default statistical proposition that there is no effect, no difference, or no relationship between defined groups or variables. In an AI experiment, such as an A/B test comparing two model versions, the null hypothesis typically states that any observed performance difference is due to random chance. The experiment's goal is to gather sufficient evidence to reject the null hypothesis in favor of an alternative hypothesis (H₁) that a true effect exists, using metrics like the p-value and a pre-defined significance level (alpha).
The testing framework involves calculating a test statistic from the observed data (e.g., the difference in mean accuracy between model groups) and determining the probability (p-value) of seeing a result this extreme if the null hypothesis were true. If this p-value is less than the alpha threshold (commonly 0.05), the null is rejected, indicating a statistically significant result. This formal mechanism controls the rate of Type I errors (false positives) and is central to causal inference from experimental data, providing a mathematically rigorous alternative to heuristic performance comparisons.
Null Hypothesis vs. Alternative Hypothesis
A comparison of the two opposing statements that form the foundation of statistical inference in A/B testing and experimentation.
| Feature | Null Hypothesis (H₀) | Alternative Hypothesis (H₁ or Hₐ) |
|---|---|---|
Core Definition | A default statement of 'no effect' or 'no difference' between groups or conditions. | A statement proposing a specific effect, difference, or relationship that the experiment aims to find evidence for. |
Assumed Truth at Start | ||
Goal of Statistical Test | To gather evidence against it, with the aim of rejection. | To gather evidence for it, by rejecting the null. |
Typical Mathematical Form | Equality: e.g., μ₁ = μ₂, p₁ = p₂, θ = 0 | Inequality or difference: e.g., μ₁ ≠ μ₂, p₁ > p₂, θ ≠ 0 |
Relationship to P-Value | The p-value is calculated assuming H₀ is true. A small p-value indicates the observed data is unlikely under H₀. | The p-value is not directly calculated for H₁. Rejecting H₀ provides indirect support for H₁. |
Outcome of Test (α=0.05) | "Fail to reject H₀" (p-value ≥ 0.05). Evidence is insufficient to discard the default position. | "Reject H₀ in favor of H₁" (p-value < 0.05). Statistically significant evidence for an effect. |
Risk of Incorrect Conclusion (Type I Error) | Probability = α (significance level). Falsely rejecting a true null hypothesis (false positive). | |
Risk of Incorrect Conclusion (Type II Error) | Probability = β. Failing to reject a false null hypothesis (false negative). Power = 1 - β. | |
Role in Experiment Design | Defines the baseline for calculating test statistics and p-values. Essential for determining sample size and power. | Defines the minimum detectable effect (MDE) used in power analysis to determine the required sample size. |
Frequently Asked Questions
The null hypothesis is a foundational concept in statistical hypothesis testing, forming the default assumption that any observed effect in an experiment is due to random chance. This FAQ addresses its role in A/B testing, machine learning evaluation, and rigorous experimentation.
The null hypothesis (H₀) is a default statistical proposition that there is no effect, no difference, or no relationship between defined groups or variables in an experiment. In the context of A/B testing for AI models, it typically states that the performance metric (e.g., click-through rate, accuracy) for the new model variant (Treatment B) is equal to that of the baseline model (Control A). The experiment's goal is to gather sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis (H₁), which posits a real difference exists. Failing to reject H₀ does not prove it true; it merely indicates insufficient data to confidently claim an effect.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
The null hypothesis is a foundational concept in statistical testing. These related terms define the core mechanisms, metrics, and methodologies used to test and potentially reject it in the context of AI and software experiments.
Statistical Significance
A result is deemed statistically significant if the observed data is sufficiently unlikely under the assumption that the null hypothesis is true. This is formally determined by comparing a calculated p-value to a pre-defined significance level (alpha), commonly set at 0.05. Achieving statistical significance provides evidence to reject the null hypothesis, suggesting the observed effect (e.g., a model performance improvement) is real and not due to random chance.
P-Value
The p-value quantifies the strength of evidence against the null hypothesis. It is the probability of observing a test statistic at least as extreme as the one calculated from your sample data, assuming the null hypothesis is true.
- A small p-value (e.g., < 0.05) indicates the observed data is improbable under the null, leading to its rejection.
- A large p-value suggests the data is compatible with the null hypothesis, so it is not rejected. Crucially, the p-value is not the probability that the null hypothesis is true or false.
Type I & Type II Error
Hypothesis testing involves two fundamental error types tied to the null hypothesis.
- Type I Error (False Positive): Incorrectly rejecting a true null hypothesis. The probability of this is controlled by the significance level (alpha).
- Type II Error (False Negative): Failing to reject a false null hypothesis. The probability of this is denoted by beta. The power of a test is (1 - beta), the probability of correctly rejecting a false null. Experiment design involves balancing these risks.
Alternative Hypothesis
The alternative hypothesis (H1 or Ha) is the statement that directly contradicts the null hypothesis. It represents the effect or difference the experiment is designed to detect.
- In an A/B test comparing Model A and Model B, if the null is 'no difference in accuracy,' the alternative is 'there is a difference in accuracy.'
- It can be one-sided (specifying a direction, e.g., 'Model B is more accurate') or two-sided (specifying any difference). The test's statistical machinery evaluates evidence in favor of the alternative.
Statistical Power
Statistical power is the probability that a test will correctly reject a false null hypothesis. It is the test's sensitivity to detect a true effect.
- High power (e.g., 0.8 or 80%) is desirable to avoid Type II Errors.
- Power depends on:
- Sample Size: Larger samples increase power.
- Effect Size: Larger true differences are easier to detect.
- Significance Level (Alpha): A higher alpha increases power but also increases Type I Error risk. Calculating required sample size before an experiment ensures adequate power.
Confidence Interval
A confidence interval provides a range of plausible values for an unknown population parameter (e.g., the true difference in model performance). A 95% confidence interval means that if the experiment were repeated many times, 95% of such intervals would contain the true parameter value.
- Direct Relationship to Null Hypothesis: If a 95% confidence interval for a difference does not include zero, it corresponds to rejecting the null hypothesis at the 5% significance level.
- It conveys both the estimated effect size and the precision of the estimate, offering more information than a binary significance test.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us