Recall at K (R@K) is an information retrieval metric that calculates the proportion of all relevant documents for a query that are successfully found within the top K retrieved results. It is defined as (Relevant Documents Retrieved in Top K) / (Total Relevant Documents in Corpus). Unlike Precision at K (P@K), which focuses on the purity of the top results, R@K measures the system's coverage, making it critical for applications where missing a relevant item is costly, such as in legal e-discovery or comprehensive research.
Glossary
Recall at K (R@K)

What is Recall at K (R@K)?
Recall at K (R@K) is a core information retrieval metric used to evaluate the completeness of a search system by measuring its ability to retrieve all relevant items.
In Retrieval-Augmented Generation (RAG) evaluation, R@K assesses the retrieval component's effectiveness at surfacing necessary context for the language model. A high R@K indicates the retrieved set contains most of the information needed for an accurate answer, directly influencing downstream metrics like Answer Faithfulness and Grounding Score. It is often analyzed alongside Hit Rate and Mean Average Precision (MAP) to provide a complete picture of retrieval quality, balancing recall with result ranking.
Key Characteristics of R@K
Recall at K (R@K) is a fundamental metric for assessing the completeness of a retrieval system. It measures the system's ability to find all relevant information, not just the most relevant at the top of the list.
Definition & Core Formula
Recall at K (R@K) calculates the proportion of all relevant documents for a query that are successfully retrieved within the top K results. It is defined as:
R@K = (Number of Relevant Documents in Top K) / (Total Number of Relevant Documents in Corpus)
- A score of 1.0 indicates all relevant documents were found in the top K.
- A score of 0.0 means none were found.
- It is query-specific and is typically averaged across a test set of queries.
Trade-off with Precision
R@K is intrinsically linked to Precision at K (P@K). They represent a classic retrieval trade-off:
- High R@K, Low P@K: The system is casting a wide net, retrieving many documents to ensure it catches all relevant ones, but many results are irrelevant.
- Low R@K, High P@K: The system is highly selective, returning mostly relevant documents, but it misses many relevant ones.
In RAG systems, optimizing for R@K is often prioritized to ensure the LLM has access to all necessary context, even at the cost of some irrelevant noise that the model can learn to ignore.
The Role of K (Cut-off Rank)
The value of K is a critical hyperparameter that determines the scope of evaluation:
- Small K (e.g., 1, 3, 5): Measures the system's ability to place the most critical, highly relevant documents at the very top. Common for user-facing search where only the first page of results matters.
- Large K (e.g., 10, 50, 100): Assesses the system's overall recall capability, which is crucial for RAG. The LLM's context window can typically hold 5-20 passages, so R@10 or R@20 is a standard benchmark to ensure sufficient grounding material is retrieved.
Plotting R@K as K increases creates a recall curve, showing how quickly the system accumulates relevant documents.
Application in RAG Evaluation
In Retrieval-Augmented Generation, R@K is a leading indicator of answer quality and faithfulness.
- High R@K suggests the generator has access to the facts needed to produce a correct, grounded answer.
- Low R@K directly limits answer quality, as missing context leads to hallucinations or incomplete responses.
It is often evaluated alongside Answer Faithfulness and Answer Relevance. A drop in those metrics with high R@K indicates the generator is failing to properly utilize the retrieved context.
Limitations and Considerations
While essential, R@K has key limitations:
- Requires a Labeled Corpus: You must know the total number of relevant documents per query, which can be expensive to annotate.
- Binary Relevance: Standard R@K treats relevance as a binary (yes/no). It does not account for the graded relevance of documents (e.g., highly vs. partially relevant), a nuance captured by metrics like NDCG@K.
- Ignores Rank Order: It does not reward systems for placing the most relevant document first.
[Relevant, Irrelevant, Relevant]and[Irrelevant, Relevant, Relevant]both yield the same R@3 score, despite the first being more user-friendly.
Related Metrics & Context
R@K is one piece of a comprehensive retrieval evaluation suite:
- Precision at K (P@K): The companion metric for measuring result purity.
- Mean Average Precision (MAP): Averages precision values at the ranks of each relevant document, combining recall and rank-sensitivity.
- Hit Rate @K: A simpler, binary metric: for what fraction of queries was at least one relevant document found in the top K?
- F1 Score @K: The harmonic mean of P@K and R@K, providing a single balanced score when both dimensions are equally important.
Choosing the right K and combining these metrics provides a complete picture of retrieval health.
R@K vs. Precision at K (P@K)
A comparison of two core information retrieval metrics used to evaluate the quality of ranked results, particularly in RAG systems.
| Metric / Feature | Recall at K (R@K) | Precision at K (P@K) |
|---|---|---|
Primary Focus | Completeness of retrieval | Purity of the top results |
Core Question Answered | Did we find all relevant items? | Are the top results good? |
Formula | Relevant items in top K / Total relevant items | Relevant items in top K / K |
Range of Values | 0 to 1 | 0 to 1 |
Sensitivity to K | Increases or stays the same as K increases | Generally decreases as K increases |
Use Case Priority | High recall is critical when missing relevant documents is costly (e.g., legal discovery, systematic reviews). | High precision is critical when user attention is limited and top results must be highly relevant (e.g., web search, chatbot responses). |
Trade-off Relationship | Optimizing for R@K often requires retrieving more items (increasing K), which can hurt P@K. | Optimizing for P@K favors highly confident results, which can miss relevant items, hurting R@K. |
Interpretation of Score = 1 | All relevant documents for the query are contained within the top K results. | Every single one of the top K retrieved documents is relevant to the query. |
Interpretation of Score = 0 | No relevant documents are found in the top K results. | None of the top K retrieved documents are relevant. |
Typical Reporting | Often reported for multiple K values (e.g., R@5, R@10, R@100) to show recall progression. | Often reported for a specific, user-facing K (e.g., P@1, P@5, P@10) to reflect immediate user experience. |
Application in RAG Systems
Recall at K (R@K) is a fundamental metric for assessing the retrieval component of a Retrieval-Augmented Generation system. It quantifies the system's ability to find all relevant information, which is critical for ensuring the language model has sufficient context to generate a complete and accurate answer.
Core Retrieval Performance
Recall at K (R@K) directly measures the coverage of a RAG system's retriever. For a given query, it calculates the fraction of all relevant documents in the knowledge base that are successfully fetched and placed within the top K results returned to the generator.
- High R@K indicates the retriever is effective at surfacing most relevant context, giving the LLM a high chance of having the facts needed for a faithful answer.
- Low R@K signals a major risk of answer incompleteness or hallucination, as critical information is missing from the context window.
Trade-off with Precision at K
In practice, optimizing a RAG system involves balancing Recall at K (R@K) against Precision at K (P@K).
- R@K focuses on completeness: "Did we find all the relevant chunks?"
- P@K focuses on purity: "Are the top K results all relevant?"
A retriever configured for maximum recall (e.g., by fetching a large K) may include irrelevant documents, increasing context noise. Engineers must find the K value and retrieval strategy that provides sufficient recall for answer quality without overwhelming the LLM with excessive, irrelevant context that hurts efficiency and cost.
Setting the K Parameter
The choice of K is a critical engineering decision with direct implications for system performance and cost.
- Small K (e.g., 3-5): Evaluates the retriever's ability to be highly precise at the top ranks. Lower recall here often necessitates a very high-quality embedding model or re-ranker.
- Large K (e.g., 10-20): Assesses the retriever's broad coverage. This is common when using a two-stage retrieve-then-rerank architecture, where a fast, high-recall first-stage fetches many candidates (K=20), and a slower, high-precision cross-encoder re-ranks them down to a smaller final set.
Connection to Answer Faithfulness
Recall at K is a leading indicator for the downstream metric of Answer Faithfulness. If the retriever fails to recall a key document (low R@K), the language model lacks the necessary grounding and is forced to generate a response based on its parametric knowledge, drastically increasing the probability of an unsupported hallucination.
Monitoring R@K in production, alongside faithfulness scores, helps isolate failures to the retrieval stage versus the generation stage, enabling more targeted improvements.
Benchmarking & Evaluation
To calculate R@K, a labeled evaluation set with query-document relevance judgments is required. The standard calculation is:
R@K = (Number of relevant documents retrieved in top K) / (Total number of relevant documents for the query)
- It is typically reported as an average across a diverse set of test queries.
- It is used alongside NDCG@K and MAP to provide a holistic view of retrieval quality, as R@K treats all relevant documents equally regardless of their rank within the top K.
Improving Recall at K
Several technical strategies can directly improve a RAG system's R@K metric:
- Hybrid Search: Combining dense vector search (for semantic recall) with sparse keyword search (for exact term recall).
- Query Expansion: Using the LLM to generate multiple related queries or hypothetical answers to broaden the search.
- Chunking Optimization: Experimenting with overlapping text chunks and varying chunk sizes to prevent relevant information from being split across boundaries.
- Embedding Model Fine-tuning: Adapting the retriever's embedding model on domain-specific data to better align query and document representations.
Frequently Asked Questions
Essential questions and answers about Recall at K (R@K), a core metric for evaluating the completeness of document retrieval in systems like Retrieval-Augmented Generation (RAG).
Recall at K (R@K) is an information retrieval metric that measures the proportion of all relevant documents for a query that are successfully retrieved within the top K results returned by a system. It is defined as:
Recall@K = (Number of relevant documents retrieved in the top K) / (Total number of relevant documents in the corpus)
For example, if there are 10 relevant documents in total for a query and your system retrieves 7 of them within the top 20 results, then Recall@20 = 0.7 (or 70%). A perfect score of 1.0 indicates that all relevant documents were found within the top K, making it a critical measure of retrieval completeness.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Recall at K (R@K) is one of several core metrics used to evaluate the performance of retrieval systems. The following terms define complementary and related measures for assessing retrieval quality, answer generation, and overall system effectiveness.
Precision at K (P@K)
Precision at K (P@K) measures the exactness of a retrieval system by calculating the proportion of relevant documents among the top K retrieved results for a single query. While R@K focuses on finding all relevant items, P@K focuses on the purity of the retrieved list.
- Formula: P@K = (Number of relevant docs in top K) / K
- Interpretation: A high P@K indicates the system returns few irrelevant documents within the top results.
- Trade-off with R@K: Optimizing for high P@K can sometimes lower R@K, as the system becomes more conservative. The choice of K significantly impacts the metric's value.
Mean Average Precision (MAP)
Mean Average Precision (MAP) provides a single-figure measure of quality across multiple queries by averaging the Average Precision (AP) scores for each query. AP itself computes a precision value at each rank position where a relevant document is found, rewarding systems that rank relevant documents higher.
- Use Case: The standard metric for evaluating ranked retrieval in tasks like document search and recommendation.
- Advantage over P@K/R@K: Accounts for the rank order of all relevant documents, not just a binary count within a cutoff.
- Calculation: For each query, calculate AP (the average of P@K values at each relevant document's rank), then average AP across all queries.
Normalized Discounted Cumulative Gain (NDCG)
Normalized Discounted Cumulative Gain (NDCG) evaluates ranked lists where items have graded relevance (e.g., scores of 0, 1, 2, 3), not just binary relevance. It applies a logarithmic discount to relevance scores based on their rank, emphasizing top positions.
- Core Concept: Gain is the relevance score of an item. Cumulative Gain sums these scores. Discounted Cumulative Gain (DCG) reduces the gain of items lower in the list. NDCG normalizes DCG by the ideal DCG, producing a score between 0 and 1.
- Application: Essential for evaluating web search, recommendation engines, and any system where result usefulness is not binary.
- Difference from R@K: NDCG handles multi-level relevance and rank sensitivity, whereas R@K uses binary relevance and a fixed cutoff.
Hit Rate
Hit Rate is a binary, query-level metric that measures the proportion of queries for which at least one relevant document is found within the top K retrieved results. It is closely related to R@K but provides a coarser, system-level success measure.
- Formula: Hit Rate = (Number of queries with ≥1 relevant doc in top K) / (Total number of queries)
- Interpretation: Answers the question: "For what percentage of user queries did we find something useful?"
- Utility: A critical baseline metric for production RAG systems; a low Hit Rate indicates fundamental retrieval failure for many queries, regardless of answer generation quality.
Context Relevance
Context Relevance is a quality metric for the text passages (the "context") retrieved and passed to a Large Language Model (LLM) in a RAG pipeline. It assesses whether the provided context is pertinent, non-redundant, and contains sufficient information to answer the query.
- Evaluation Focus: Measures the utility of the retrieved documents themselves, independent of the LLM's final answer. It is a direct upstream determinant of Answer Faithfulness.
- Calculation: Often evaluated using an LLM-as-a-judge, prompted to score the relevance of the context to the query on a Likert scale (e.g., 1-5).
- Relationship to R@K: While R@K measures the coverage of relevant documents, Context Relevance measures the quality and conciseness of the specific subset of documents selected for the LLM's context window.
Answer Faithfulness
Answer Faithfulness (or Factual Consistency) evaluates the extent to which a generated answer is factually consistent with and logically entailed by the provided source context. It directly measures hallucination within the bounds of the given sources.
- Core Question: "Is every claim in this answer supported by the provided context?"
- Measurement: Typically calculated by using an LLM or a natural language inference (NLI) model to check if each atomic statement in the answer can be inferred from the context. The score is the proportion of supported statements.
- Critical Dependency: High Answer Faithfulness is impossible without high Recall at K (R@K) and high Context Relevance. If the necessary facts are not retrieved (low R@K), the model is forced to hallucinate or provide an incomplete answer.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us