A chi-squared test is a statistical hypothesis test used to determine if there is a significant association between two categorical variables or if an observed frequency distribution differs from an expected theoretical distribution. It operates by comparing observed counts in contingency tables to expected counts derived under the null hypothesis of independence or specified distribution, calculating a test statistic that follows a chi-squared probability distribution. This non-parametric test is a cornerstone of A/B testing frameworks for analyzing experiment outcomes like user preference between two model interfaces.
Glossary
Chi-Squared Test

What is a Chi-Squared Test?
A fundamental statistical procedure for analyzing categorical data to assess independence or goodness-of-fit.
The test's utility in Evaluation-Driven Development lies in validating data splits and evaluating model performance across categorical outcomes, such as error type classification. For multi-variate testing with more than two categories, the chi-squared test of independence generalizes seamlessly. Key assumptions include sufficiently large expected cell counts (typically >5) and independent observations. Violations may necessitate Fisher's exact test. The resulting p-value indicates whether to reject the null hypothesis of no association, providing a rigorous, quantitative check for categorical relationships.
Key Features of Chi-Squared Tests
Chi-squared tests are non-parametric statistical procedures used to analyze categorical data. They assess the independence of variables or the goodness-of-fit between observed and expected frequency distributions.
Tests for Categorical Data
The chi-squared test is fundamentally designed for categorical (nominal or ordinal) variables, not continuous data. It operates on frequency counts in a contingency table. Common applications include:
- Testing if gender is independent of product preference.
- Determining if the distribution of user sign-ups across regions matches expected marketing targets.
- Assessing if a model's error types are associated with specific input data categories.
Goodness-of-Fit Test
The chi-squared goodness-of-fit test evaluates how well an observed frequency distribution matches an expected theoretical distribution.
Process:
- Define a null hypothesis stating the observed data follows the expected distribution.
- Calculate the test statistic (χ²) by summing the squared differences between observed and expected counts, divided by the expected counts.
- Compare the statistic to a chi-squared distribution with the appropriate degrees of freedom.
Example: Testing if a six-sided die is fair by comparing the observed rolls of each number to the expected frequency of 1/6.
Test of Independence
The chi-squared test of independence determines if there is a significant association between two categorical variables in a contingency table.
Key Mechanism:
- The null hypothesis states the variables are independent.
- Expected frequencies for each cell are calculated assuming independence: (row total * column total) / grand total.
- A large χ² statistic indicates the observed joint frequencies deviate substantially from what would be expected if the variables were unrelated, leading to rejection of the null hypothesis.
Use Case: In A/B testing, analyzing if the variant (A or B) is independent of a binary outcome like 'conversion' vs. 'no conversion'.
Reliance on Large Sample Sizes
Chi-squared tests are large-sample approximations. They require sufficient expected frequencies in each contingency table cell to be valid. A common rule is that all expected counts should be at least 5. With sparse data (small samples or many categories), the test can be unreliable. In such cases, exact tests like Fisher's Exact Test are used instead. This requirement directly impacts experiment design, necessitating adequate traffic allocation in A/B tests involving categorical outcomes.
Degrees of Freedom Calculation
The shape of the reference chi-squared distribution is determined by degrees of freedom (df), which depend on the test type:
- Goodness-of-Fit:
df = k - 1 - p, wherekis the number of categories andpis the number of parameters estimated from the data. - Test of Independence: For an
r x ccontingency table,df = (r - 1) * (c - 1).
The degrees of freedom are crucial for finding the critical value or p-value. A higher df results in a flatter, more spread-out chi-squared distribution, requiring a larger test statistic to achieve significance.
Interpretation & Limitations
Interpretation: A significant result (p-value < alpha, e.g., 0.05) provides evidence against the null hypothesis but does not measure the strength or direction of the association. Effect size measures like Cramér's V or the phi coefficient must be calculated separately.
Key Limitations:
- Association, not Causation: The test identifies relationships but cannot prove cause-and-effect.
- No Directionality: It does not indicate which categories contribute most to the association (post-hoc analysis required).
- Sensitive to Sample Size: With very large samples, trivial associations may become statistically significant, necessitating business context for interpretation.
Chi-Squared Test vs. Other Statistical Tests
A comparison of the Chi-Squared test with other common statistical tests, highlighting their distinct data requirements, null hypotheses, and primary use cases in A/B testing and evaluation.
| Feature / Metric | Chi-Squared Test | T-Test (Independent Samples) | ANOVA (One-Way) | Fisher's Exact Test |
|---|---|---|---|---|
Primary Use Case | Test for association or goodness-of-fit between categorical variables. | Compare the means of two independent groups. | Compare the means of three or more independent groups. | Test for association in a 2x2 contingency table when sample sizes are very small. |
Data Type Required | Categorical (counts/frequencies). | Continuous (interval or ratio). | Continuous (interval or ratio). | Categorical (counts/frequencies). |
Typical Null Hypothesis (H₀) | No association between variables; observed frequencies match expected. | No difference between the means of two groups (μ₁ = μ₂). | No difference between the means of all groups (μ₁ = μ₂ = μ₃...). | No association between the two categorical variables. |
Number of Groups/Variables | Two or more categorical variables (e.g., 2x2, 3x5 table). | Two groups (one categorical variable with 2 levels). | Three or more groups (one categorical variable with k levels). | Two categorical variables forming a 2x2 table. |
Key Assumptions |
|
|
|
|
Output / Test Statistic | Chi-squared statistic (χ²), degrees of freedom, p-value. | T-statistic, degrees of freedom, p-value. | F-statistic, degrees of freedom (between & within groups), p-value. | Exact p-value (calculated from hypergeometric distribution). |
Common A/B Testing Application | Comparing conversion rates (success/failure) between two or more variants. | Comparing average session duration or revenue per user between two variants. | Comparing a metric (e.g., task completion time) across three or more UI designs. | Analyzing click-through rates for a new button when total impressions are very low (< 30). |
Sample Size Consideration | Requires sufficient expected counts; unreliable with very small samples. | Power depends on sample size and effect size; robust with moderate N. | Power depends on sample size, effect size, and number of groups. | Designed specifically for small sample sizes where Chi-squared is invalid. |
Frequently Asked Questions
A chi-squared test is a fundamental statistical hypothesis test used to determine if there is a significant association between categorical variables or if an observed frequency distribution differs from an expected one. It is a cornerstone of evaluation-driven development, particularly within A/B testing frameworks, for validating the independence of experimental groups or the fit of a model to data.
A chi-squared test is a statistical hypothesis test that assesses the independence of categorical variables or the goodness-of-fit of an observed distribution to an expected one. It works by calculating the chi-squared statistic (χ²), which quantifies the discrepancy between observed and expected frequencies. The formula is χ² = Σ [(Oᵢ - Eᵢ)² / Eᵢ], where Oᵢ is the observed frequency and Eᵢ is the expected frequency for category i. This calculated statistic is then compared to a critical value from the chi-squared distribution, which depends on the degrees of freedom. If the statistic exceeds the critical value, the null hypothesis of independence or goodness-of-fit is rejected, indicating a statistically significant association or deviation.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
The Chi-Squared Test is a foundational tool within statistical hypothesis testing for categorical data. Its proper application and interpretation rely on understanding these closely related concepts and methodologies.
Statistical Significance
A determination that an observed effect in sample data is unlikely to have occurred by random chance alone. In the context of a Chi-Squared Test, a result is deemed statistically significant when the calculated p-value falls below a pre-defined threshold (alpha, typically 0.05), leading to the rejection of the null hypothesis of independence between variables.
P-Value
The probability, under the assumption of the null hypothesis, of obtaining a test statistic result at least as extreme as the one actually observed. For a Chi-Squared Test, a low p-value (e.g., < 0.05) provides evidence against the null hypothesis, suggesting the observed association between categorical variables is not due to random sampling variation.
Null Hypothesis
The default statistical proposition that there is no effect or no association. In a Chi-Squared Test of independence, the null hypothesis states that the two categorical variables are independent—the distribution of one variable is the same across all categories of the other. The test evaluates whether the observed data provides sufficient evidence to reject this hypothesis.
Contingency Table
A matrix that displays the frequency distribution of two categorical variables, showing how data points are divided across their combined categories. Also known as a cross-tabulation, it is the fundamental input for a Chi-Squared Test of independence.
- Rows represent categories of one variable.
- Columns represent categories of the other variable.
- Cells contain the observed counts, which are compared to expected counts under the null hypothesis.
Degrees of Freedom
A parameter intrinsic to the Chi-Squared distribution, calculated from the dimensions of the contingency table. It represents the number of values in the final calculation of a statistic that are free to vary. For a Chi-Squared Test on an r x c table, degrees of freedom = (r - 1) * (c - 1). This value is crucial for determining the correct critical value from the Chi-Squared distribution to assess significance.
Fisher's Exact Test
An alternative to the Chi-Squared Test used for analyzing contingency tables, particularly when sample sizes are small or expected cell counts are below 5. Unlike the Chi-Squared Test, which relies on an asymptotic approximation, Fisher's Exact Test calculates the exact probability of observing the given distribution under the null hypothesis, making it more accurate for small datasets.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us