Inferensys

Glossary

Propensity Score Matching

Propensity score matching is a quasi-experimental method in causal inference that reduces selection bias by matching treated and untreated units with similar probabilities of receiving a treatment based on observed covariates.
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
CAUSAL INFERENCE

What is Propensity Score Matching?

Propensity score matching is a quasi-experimental method used in causal inference to reduce selection bias by matching treated and untreated units with similar probabilities of receiving the treatment based on observed covariates.

Propensity score matching is a statistical technique for estimating causal effects from observational data by simulating the conditions of a randomized experiment. It reduces selection bias by matching units (e.g., users, patients) that received a 'treatment' (like a new AI model) with comparable control units that did not, based on their estimated probability—or propensity score—of receiving that treatment given their observed characteristics. This creates balanced comparison groups for a more reliable estimate of the Average Treatment Effect.

In A/B testing frameworks, PSM is used for post-hoc analysis to validate results or analyze non-randomized data, such as when users self-select into groups. The process involves estimating propensity scores (often via logistic regression), applying a matching algorithm (e.g., nearest neighbor), and checking for covariate balance. While powerful, its validity depends on the ignorability assumption—that all confounding variables are observed and included—making it a cornerstone of rigorous evaluation-driven development for CTOs assessing model impact.

CAUSAL INFERENCE METHOD

Key Characteristics of Propensity Score Matching

Propensity score matching is a quasi-experimental method used in causal inference to reduce selection bias by matching treated and untreated units with similar probabilities of receiving the treatment based on observed covariates.

01

Bias Reduction via Balancing

The core function of propensity score matching is to reduce selection bias by creating a balanced comparison group. It does this by matching each treated unit with one or more control units that have a similar propensity score—the estimated probability of receiving the treatment given their observed covariates (e.g., age, income, prior behavior).

  • Goal: Make the treatment and control groups statistically comparable on observed variables, mimicking randomization.
  • Result: Differences in outcomes between the matched groups can be more credibly attributed to the treatment effect, not pre-existing differences.
02

The Propensity Score (e(X))

The propensity score, denoted as e(X), is a single scalar summary of all observed pre-treatment covariates (X). It is defined as the conditional probability of assignment to a treatment given the observed covariates: e(X) = Pr(T=1 | X).

  • Estimation: Typically estimated using a logistic regression or a machine learning classifier (e.g., gradient boosting) where the treatment assignment is the dependent variable and covariates are predictors.
  • Role: According to the Rosenbaum-Rubin Theorem, if treatment assignment is strongly ignorable given X, then it is also strongly ignorable given the propensity score e(X). This allows for matching on a single dimension.
03

Common Matching Algorithms

Once propensity scores are estimated, units are paired using specific algorithms. The choice affects the quality and variance of the causal estimate.

  • Nearest Neighbor Matching: Each treated unit is matched with the control unit whose propensity score is closest. Can be performed with or without replacement.
  • Caliper Matching: A tolerance level (caliper) is set (e.g., 0.2 standard deviations of the propensity score). Matches are only made if the score difference is within this caliper, improving match quality.
  • Stratification/Subclassification: Units are divided into strata (e.g., quintiles) based on their propensity score. The treatment effect is estimated within each stratum and then averaged.
  • Optimal Matching: Minimizes the total absolute distance across all matches, often producing more balanced samples than greedy nearest-neighbor.
04

Assumption of Strong Ignorability

Propensity score matching relies on the critical Strong Ignorability or Unconfoundedness assumption. This has two parts:

  1. Conditional Independence: The potential outcomes (Y(1), Y(0)) are independent of the treatment assignment (T) given the observed covariates X. Formally: (Y(1), Y(0)) ⟂ T | X.
  2. Positivity/Overlap: For all possible values of X, there is a positive probability of receiving either treatment or control. Formally: 0 < Pr(T=1 | X) < 1.
  • Implication: This assumption means there are no unobserved confounders. Violation of this assumption (i.e., hidden bias) invalidates the causal conclusions from PSM.
05

Post-Matching Diagnostics

After matching, analysts must check if balance was achieved. This is a crucial validation step.

  • Standardized Mean Difference (SMD): The primary metric. For each covariate, calculate the difference in means between treated and control groups, divided by the pooled standard deviation. An SMD below 0.1 is typically considered good balance.
  • Variance Ratios: The ratio of variances for each covariate between groups should be close to 1.
  • Visual Checks: Examine plots like love plots (forest plots of SMDs before/after matching) and propensity score distribution histograms to assess overlap improvement.
06

Contrast with Randomized Experiments

PSM is a quasi-experimental method used when randomized controlled trials (RCTs) are infeasible, unethical, or too costly.

  • RCT Gold Standard: Random assignment ensures groups are balanced on both observed and unobserved covariates on average.
  • PSM Limitation: Only balances observed covariates. It cannot adjust for unobserved confounders, which remains its fundamental weakness.
  • Use Case: Commonly applied in observational studies in economics (e.g., evaluating job training programs), healthcare (e.g., drug effectiveness from electronic health records), and marketing (e.g., measuring campaign impact from customer data).
METHOD COMPARISON

Propensity Score Matching vs. Other Causal Methods

A technical comparison of propensity score matching against other primary methodologies for estimating causal effects from observational data, highlighting core assumptions, implementation complexity, and typical use cases.

Feature / DimensionPropensity Score Matching (PSM)Regression AdjustmentInstrumental Variables (IV)Difference-in-Differences (DiD)

Primary Goal

Reduce selection bias by creating a balanced comparison group

Statistically control for confounding variables

Address unobserved confounding via an external instrument

Control for time-invariant unobserved confounding

Key Assumption

Conditional Independence (Ignorability) & Overlap

Correct model specification (linearity, no omitted variables)

Valid instrument: Relevant & Excludable

Parallel trends in pre-treatment period

Handles Unobserved Confounders?

Data Requirements

Rich observed covariates for matching

Rich observed covariates for modeling

A valid instrumental variable

Panel or repeated cross-sectional data

Implementation Complexity

Medium (matching algorithm, balance checks)

Low (standard regression)

High (instrument validation, 2SLS)

Medium (pre/post period construction)

Output

Estimated Average Treatment Effect on the Treated (ATT)

Conditional Average Treatment Effect (CATE)

Local Average Treatment Effect (LATE)

Average Treatment Effect (ATE)

Common Use Case

Evaluating a medical treatment using patient records

Estimating price elasticity from sales data

Estimating effect of education on earnings using policy changes

Measuring impact of a new law across regions over time

Risk of Model Misspecification

Low (non-parametric matching)

High (relies on functional form)

Medium (relies on IV assumptions)

Medium (relies on parallel trends)

PROPENSITY SCORE MATCHING

Frequently Asked Questions

A quasi-experimental method for estimating causal effects from observational data by reducing selection bias.

Propensity score matching is a quasi-experimental method used in causal inference to estimate the effect of a treatment, policy, or intervention by reducing selection bias from observed confounding variables. It works by modeling the probability (the propensity score) that a unit (e.g., a user, patient, or customer) would receive the treatment based on its observed covariates. Treated units are then matched with untreated control units that have a similar propensity score, creating a balanced comparison group where the distribution of observed confounders is statistically equivalent. The average treatment effect on the treated is then estimated by comparing outcomes between the matched pairs, approximating the conditions of a randomized controlled trial.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.