Propensity score matching is a statistical technique for estimating causal effects from observational data by simulating the conditions of a randomized experiment. It reduces selection bias by matching units (e.g., users, patients) that received a 'treatment' (like a new AI model) with comparable control units that did not, based on their estimated probability—or propensity score—of receiving that treatment given their observed characteristics. This creates balanced comparison groups for a more reliable estimate of the Average Treatment Effect.
Glossary
Propensity Score Matching

What is Propensity Score Matching?
Propensity score matching is a quasi-experimental method used in causal inference to reduce selection bias by matching treated and untreated units with similar probabilities of receiving the treatment based on observed covariates.
In A/B testing frameworks, PSM is used for post-hoc analysis to validate results or analyze non-randomized data, such as when users self-select into groups. The process involves estimating propensity scores (often via logistic regression), applying a matching algorithm (e.g., nearest neighbor), and checking for covariate balance. While powerful, its validity depends on the ignorability assumption—that all confounding variables are observed and included—making it a cornerstone of rigorous evaluation-driven development for CTOs assessing model impact.
Key Characteristics of Propensity Score Matching
Propensity score matching is a quasi-experimental method used in causal inference to reduce selection bias by matching treated and untreated units with similar probabilities of receiving the treatment based on observed covariates.
Bias Reduction via Balancing
The core function of propensity score matching is to reduce selection bias by creating a balanced comparison group. It does this by matching each treated unit with one or more control units that have a similar propensity score—the estimated probability of receiving the treatment given their observed covariates (e.g., age, income, prior behavior).
- Goal: Make the treatment and control groups statistically comparable on observed variables, mimicking randomization.
- Result: Differences in outcomes between the matched groups can be more credibly attributed to the treatment effect, not pre-existing differences.
The Propensity Score (e(X))
The propensity score, denoted as e(X), is a single scalar summary of all observed pre-treatment covariates (X). It is defined as the conditional probability of assignment to a treatment given the observed covariates: e(X) = Pr(T=1 | X).
- Estimation: Typically estimated using a logistic regression or a machine learning classifier (e.g., gradient boosting) where the treatment assignment is the dependent variable and covariates are predictors.
- Role: According to the Rosenbaum-Rubin Theorem, if treatment assignment is strongly ignorable given X, then it is also strongly ignorable given the propensity score e(X). This allows for matching on a single dimension.
Common Matching Algorithms
Once propensity scores are estimated, units are paired using specific algorithms. The choice affects the quality and variance of the causal estimate.
- Nearest Neighbor Matching: Each treated unit is matched with the control unit whose propensity score is closest. Can be performed with or without replacement.
- Caliper Matching: A tolerance level (caliper) is set (e.g., 0.2 standard deviations of the propensity score). Matches are only made if the score difference is within this caliper, improving match quality.
- Stratification/Subclassification: Units are divided into strata (e.g., quintiles) based on their propensity score. The treatment effect is estimated within each stratum and then averaged.
- Optimal Matching: Minimizes the total absolute distance across all matches, often producing more balanced samples than greedy nearest-neighbor.
Assumption of Strong Ignorability
Propensity score matching relies on the critical Strong Ignorability or Unconfoundedness assumption. This has two parts:
- Conditional Independence: The potential outcomes (Y(1), Y(0)) are independent of the treatment assignment (T) given the observed covariates X. Formally:
(Y(1), Y(0)) ⟂ T | X. - Positivity/Overlap: For all possible values of X, there is a positive probability of receiving either treatment or control. Formally:
0 < Pr(T=1 | X) < 1.
- Implication: This assumption means there are no unobserved confounders. Violation of this assumption (i.e., hidden bias) invalidates the causal conclusions from PSM.
Post-Matching Diagnostics
After matching, analysts must check if balance was achieved. This is a crucial validation step.
- Standardized Mean Difference (SMD): The primary metric. For each covariate, calculate the difference in means between treated and control groups, divided by the pooled standard deviation. An SMD below 0.1 is typically considered good balance.
- Variance Ratios: The ratio of variances for each covariate between groups should be close to 1.
- Visual Checks: Examine plots like love plots (forest plots of SMDs before/after matching) and propensity score distribution histograms to assess overlap improvement.
Contrast with Randomized Experiments
PSM is a quasi-experimental method used when randomized controlled trials (RCTs) are infeasible, unethical, or too costly.
- RCT Gold Standard: Random assignment ensures groups are balanced on both observed and unobserved covariates on average.
- PSM Limitation: Only balances observed covariates. It cannot adjust for unobserved confounders, which remains its fundamental weakness.
- Use Case: Commonly applied in observational studies in economics (e.g., evaluating job training programs), healthcare (e.g., drug effectiveness from electronic health records), and marketing (e.g., measuring campaign impact from customer data).
Propensity Score Matching vs. Other Causal Methods
A technical comparison of propensity score matching against other primary methodologies for estimating causal effects from observational data, highlighting core assumptions, implementation complexity, and typical use cases.
| Feature / Dimension | Propensity Score Matching (PSM) | Regression Adjustment | Instrumental Variables (IV) | Difference-in-Differences (DiD) |
|---|---|---|---|---|
Primary Goal | Reduce selection bias by creating a balanced comparison group | Statistically control for confounding variables | Address unobserved confounding via an external instrument | Control for time-invariant unobserved confounding |
Key Assumption | Conditional Independence (Ignorability) & Overlap | Correct model specification (linearity, no omitted variables) | Valid instrument: Relevant & Excludable | Parallel trends in pre-treatment period |
Handles Unobserved Confounders? | ||||
Data Requirements | Rich observed covariates for matching | Rich observed covariates for modeling | A valid instrumental variable | Panel or repeated cross-sectional data |
Implementation Complexity | Medium (matching algorithm, balance checks) | Low (standard regression) | High (instrument validation, 2SLS) | Medium (pre/post period construction) |
Output | Estimated Average Treatment Effect on the Treated (ATT) | Conditional Average Treatment Effect (CATE) | Local Average Treatment Effect (LATE) | Average Treatment Effect (ATE) |
Common Use Case | Evaluating a medical treatment using patient records | Estimating price elasticity from sales data | Estimating effect of education on earnings using policy changes | Measuring impact of a new law across regions over time |
Risk of Model Misspecification | Low (non-parametric matching) | High (relies on functional form) | Medium (relies on IV assumptions) | Medium (relies on parallel trends) |
Frequently Asked Questions
A quasi-experimental method for estimating causal effects from observational data by reducing selection bias.
Propensity score matching is a quasi-experimental method used in causal inference to estimate the effect of a treatment, policy, or intervention by reducing selection bias from observed confounding variables. It works by modeling the probability (the propensity score) that a unit (e.g., a user, patient, or customer) would receive the treatment based on its observed covariates. Treated units are then matched with untreated control units that have a similar propensity score, creating a balanced comparison group where the distribution of observed confounders is statistically equivalent. The average treatment effect on the treated is then estimated by comparing outcomes between the matched pairs, approximating the conditions of a randomized controlled trial.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Propensity score matching is a core technique within causal inference, a field dedicated to estimating cause-and-effect from observational data. These related terms define the broader methodological and statistical landscape.
Causal Inference
Causal inference is the process of drawing conclusions about cause-and-effect relationships from data. Unlike correlation, it seeks to estimate the impact of an intervention (treatment). Core methods include:
- Randomized Controlled Trials: The gold standard, where random assignment eliminates confounding.
- Quasi-Experimental Methods: Used when randomization isn't possible (e.g., propensity score matching, difference-in-differences).
- Structural Causal Models: A framework using directed acyclic graphs to encode assumptions about data-generating processes.
Average Treatment Effect
The Average Treatment Effect is the primary target of causal inference. It represents the average difference in outcomes between a treatment group and a control group across a population.
- ATE: The effect for the entire population.
- ATT: Average Treatment Effect on the Treated (the effect for those who actually received the treatment).
- ATC: Average Treatment Effect on the Controls. Propensity score matching is often used to estimate the ATT, answering: 'What was the effect for those who received the treatment?'
Selection Bias
Selection bias is the systematic error that occurs when the treated and untreated groups differ in ways that affect the outcome, independent of the treatment itself. This is the fundamental problem propensity score matching aims to solve.
- Sources: Self-selection, non-random program enrollment, or confounding variables.
- Consequence: Observed correlation does not equal causation. For example, comparing outcomes of patients who chose a drug versus those who didn't may reflect underlying health differences, not just drug efficacy.
Confounding Variables
A confounding variable is a factor that influences both the treatment assignment and the outcome, creating a spurious association. It is the primary driver of selection bias.
- Example: In studying a training program's effect on salary, prior education confounds the analysis if more educated individuals are both more likely to take the program and earn higher salaries regardless.
- Role in PSM: Propensity score matching attempts to balance observed confounders (e.g., age, education, income) between the treated and matched control units, simulating randomization.
Stratified Sampling
Stratified sampling is a probability sampling technique where a population is divided into homogeneous subgroups (strata) based on key characteristics before sampling. It ensures all subgroups are adequately represented.
- Relation to PSM: Propensity score matching can be viewed as a form of post-hoc stratification. Instead of pre-defining strata, units are grouped by their estimated propensity score (e.g., 0.0-0.1, 0.1-0.2), and matching occurs within these 'score strata' to improve balance.
Instrumental Variables
Instrumental Variables is an alternative causal inference method used when unobserved confounding is suspected. It relies on finding a variable (the instrument) that:
- Correlates with the treatment assignment.
- Affects the outcome only through its effect on the treatment (exclusion restriction).
- Comparison to PSM: While PSM addresses observed confounding, IV methods aim to handle unobserved confounding. However, finding a valid instrument is often challenging. The methods are complementary tools in the causal inference toolkit.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us