Recall (Sensitivity) is a classification performance metric that calculates the proportion of actual positive instances correctly identified by a model. Formally, it is defined as True Positives / (True Positives + False Negatives). A high recall score indicates the model is effective at finding most of the relevant cases, minimizing false negatives. This metric is critical in domains like medical diagnosis or fraud detection, where missing a positive case (a disease or fraudulent transaction) is costlier than a false alarm.
Glossary
Recall (Sensitivity)

What is Recall (Sensitivity)?
Recall, also known as sensitivity or true positive rate, is a fundamental classification metric that measures a model's ability to identify all relevant instances of a positive class.
Recall exists in a fundamental trade-off with precision, which measures the model's exactness when it makes a positive prediction. This trade-off is visualized using a Precision-Recall curve. To balance both concerns, practitioners often use the F1 Score, the harmonic mean of precision and recall. Evaluating recall is essential within Evaluation-Driven Development to ensure models meet specific business requirements for completeness, especially when dealing with imbalanced datasets where the positive class is rare.
Recall vs. Precision: Key Differences
A direct comparison of two fundamental classification metrics, highlighting their distinct purposes, mathematical definitions, and trade-offs in model evaluation.
| Feature | Recall (Sensitivity) | Precision |
|---|---|---|
Core Definition | Proportion of actual positives correctly identified. | Proportion of positive predictions that are correct. |
Primary Question | Of all the relevant items, how many did the model find? | Of all the items the model flagged, how many are actually relevant? |
Mathematical Formula | TP / (TP + FN) | TP / (TP + FP) |
Focus (Confusion Matrix) | False Negatives (FN) | False Positives (FP) |
Ideal Use Case | When missing a positive case is costly (e.g., disease screening, fraud detection). | When a false alarm is costly (e.g., spam filtering, quality control). |
Trade-off Relationship | Increasing recall typically decreases precision. | Increasing precision typically decreases recall. |
Sensitivity to Class Imbalance | High; focuses on the minority (positive) class. | Moderate; can be high if the model is conservative. |
Alternative Names | Sensitivity, True Positive Rate (TPR), Hit Rate | Positive Predictive Value (PPV) |
Key Applications and Use Cases
Recall is a critical metric for evaluating a model's ability to identify all relevant instances of a target class. Its importance is paramount in domains where missing a positive case is more costly than a false alarm.
Medical Diagnostics & Disease Screening
Recall is the primary optimization target in life-critical diagnostic systems. A high-recall model ensures minimized false negatives, meaning fewer missed cases of disease.
- Example: A model screening for malignant tumors in radiology scans must prioritize finding all potential cancers, even at the cost of some false positives that can be ruled out by further tests.
- Trade-off: Optimizing for recall often involves lowering the classification threshold, accepting a higher rate of false positives to capture nearly all true positives.
Information Retrieval & Search Systems
In search and retrieval-augmented generation (RAG) pipelines, recall measures the system's ability to retrieve all relevant documents from a corpus for a given query.
- Core Function: A high-recall retrieval system ensures the foundational context provided to a language model is comprehensive, reducing the risk of answer omission or hallucination due to missing data.
- Evaluation: Recall@k (e.g., Recall@10) is a standard metric, measuring the proportion of relevant documents found within the top k retrieved results.
Fraud Detection & Cybersecurity
Security systems are tuned for high recall to flag all potential threats, as the cost of missing a single fraudulent transaction or intrusion can be catastrophic.
- Application: Anomaly detection models in network security or financial transaction monitoring are designed to be highly sensitive to suspicious patterns.
- Operational Reality: The high volume of alerts generated (false positives) is then triaged by secondary systems or human analysts, a workflow justified by the critical need for high recall.
Legal e-Discovery & Document Review
In legal proceedings, models used for electronic discovery must achieve near-perfect recall to ensure no exculpatory or inculpatory evidence is overlooked.
- Process: Machine learning classifiers sift through millions of documents to identify those relevant to a case. A missed document (false negative) could constitute a failure to produce evidence.
- Standard: Legal teams often require recall levels exceeding 95% before deeming an AI-assisted review process defensible in court.
Manufacturing Defect Inspection
Automated visual inspection systems on production lines are calibrated for high recall to prevent defective products from reaching customers.
- Quality Control: A model inspecting circuit boards or pharmaceutical packaging must catch all units with critical flaws, even if it means occasionally rejecting a functional item.
- Cost Analysis: The financial and reputational cost of a defective product in the field typically far outweighs the cost of scrapping or re-checking a small percentage of false positives.
The Precision-Recall Trade-off & F1 Score
Recall cannot be evaluated in isolation; it exists in a fundamental trade-off with precision. Improving recall usually reduces precision, as the model casts a wider net.
- Balancing Metric: The F1 Score is the harmonic mean of precision and recall, providing a single metric to optimize when both false negatives and false positives are important, but one cannot dominate the other.
- Strategic Choice: The optimal point on the Precision-Recall curve is determined by the specific business cost function: is missing a positive case (low recall) or acting on a false alarm (low precision) more expensive?
Frequently Asked Questions
Recall, also known as sensitivity or true positive rate, is a fundamental classification metric for evaluating a model's ability to identify all relevant positive instances. These questions address its calculation, trade-offs, and application in real-world machine learning systems.
Recall, also known as sensitivity or the true positive rate (TPR), is a classification performance metric that measures the proportion of actual positive instances that a model correctly identifies. It is calculated as the number of true positives (TP) divided by the sum of true positives and false negatives (FN): Recall = TP / (TP + FN). This formula quantifies a model's ability to find all relevant cases, making it critical in domains where missing a positive instance is costly, such as medical diagnosis or fraud detection. A recall of 1.0 (or 100%) indicates the model found every single positive case in the dataset, while a recall of 0.0 means it found none.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Recall is a core component of a broader ecosystem of classification and evaluation metrics. Understanding its relationship to these complementary measures is essential for comprehensive model assessment.
Precision
Precision measures the exactness of a model's positive predictions. It is calculated as the ratio of true positives to all predicted positives (true positives + false positives). While recall asks "Of all the actual positives, how many did we find?", precision asks "Of all the items we labeled positive, how many are actually correct?"
- Key Trade-off: In many real-world scenarios (e.g., spam detection, medical screening), there is a direct trade-off between precision and recall. Increasing the classification threshold typically raises precision but lowers recall, and vice-versa.
- Use Case: High precision is critical when the cost of a false positive is high, such as in fraud detection where incorrectly flagging a legitimate transaction damages customer trust.
F1 Score
The F1 Score is the harmonic mean of precision and recall, providing a single metric that balances both concerns. It is especially useful for evaluating performance on imbalanced datasets where one class significantly outnumbers the other.
- Calculation: F1 = 2 * (Precision * Recall) / (Precision + Recall).
- Interpretation: A high F1 score indicates that the model has both good precision and good recall. It is more informative than accuracy when the class distribution is skewed.
- Application: Commonly used in information retrieval, document classification, and medical diagnostics where both false positives and false negatives carry significant cost.
Specificity (True Negative Rate)
Specificity, or the True Negative Rate, is the complement to recall. It measures a model's ability to correctly identify negative instances. It is calculated as the ratio of true negatives to all actual negatives (true negatives + false positives).
- Formula: Specificity = TN / (TN + FP).
- Relationship to Recall: Recall (Sensitivity) focuses on the positive class; Specificity focuses on the negative class. Together, they provide a complete picture of a binary classifier's performance for each class.
- Critical Use: High specificity is paramount in tests where incorrectly labeling a healthy person as sick (a false positive) causes undue stress and further costly testing.
Confusion Matrix
A Confusion Matrix is the foundational table from which recall, precision, and other classification metrics are derived. It is a 2x2 (for binary classification) grid that summarizes the counts of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
- Visual Foundation: All primary classification metrics are calculated directly from these four values. Recall = TP / (TP + FN).
- Diagnostic Tool: It allows for immediate visual diagnosis of a model's error patterns. A model with low recall will have a high count in the False Negative cell.
- Extension: For multi-class problems, the confusion matrix expands to an N x N table, enabling per-class calculation of recall (often called sensitivity for each class).
Precision-Recall Curve
A Precision-Recall (PR) Curve is a graphical plot that illustrates the trade-off between precision and recall for a binary classifier at different probability thresholds. It is particularly informative for imbalanced datasets where the positive class is rare.
- Interpretation: The curve shows how precision typically drops as recall is increased. The area under the PR curve (AUC-PR) is a single-number summary; a higher area indicates better overall performance across thresholds.
- Comparison to ROC: While the ROC curve plots sensitivity vs. (1 - specificity), the PR curve is often more telling when the class distribution is highly skewed, as it focuses directly on the performance concerning the positive (minority) class of interest.
Sensitivity Analysis
Sensitivity Analysis in the context of model evaluation refers to systematically testing how a model's performance metrics, like recall, change in response to variations in input data, model parameters, or classification thresholds.
- Purpose: It assesses the robustness and stability of a model. For instance, analysts may measure how recall degrades when input data contains slight noise or when the model is applied to a slightly different population.
- Engineering Practice: This goes beyond calculating a static metric. It involves probing the model's behavior under different conditions to understand its operational boundaries and failure modes, which is a cornerstone of rigorous Evaluation-Driven Development.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us