Inferensys

Glossary

Recall at K (R@K)

Recall at K (R@K) is an information retrieval metric that calculates the proportion of all relevant documents for a query that are found within the top K retrieved results.
Developer working on RAG retrieval system, document chunks visible on screen, technical workspace with code editor.
RAG EVALUATION METRIC

What is Recall at K (R@K)?

Recall at K (R@K) is a core information retrieval metric used to evaluate the completeness of a search system by measuring its ability to retrieve all relevant items.

Recall at K (R@K) is an information retrieval metric that calculates the proportion of all relevant documents for a query that are successfully found within the top K retrieved results. It is defined as (Relevant Documents Retrieved in Top K) / (Total Relevant Documents in Corpus). Unlike Precision at K (P@K), which focuses on the purity of the top results, R@K measures the system's coverage, making it critical for applications where missing a relevant item is costly, such as in legal e-discovery or comprehensive research.

In Retrieval-Augmented Generation (RAG) evaluation, R@K assesses the retrieval component's effectiveness at surfacing necessary context for the language model. A high R@K indicates the retrieved set contains most of the information needed for an accurate answer, directly influencing downstream metrics like Answer Faithfulness and Grounding Score. It is often analyzed alongside Hit Rate and Mean Average Precision (MAP) to provide a complete picture of retrieval quality, balancing recall with result ranking.

RAG EVALUATION METRICS

Key Characteristics of R@K

Recall at K (R@K) is a fundamental metric for assessing the completeness of a retrieval system. It measures the system's ability to find all relevant information, not just the most relevant at the top of the list.

01

Definition & Core Formula

Recall at K (R@K) calculates the proportion of all relevant documents for a query that are successfully retrieved within the top K results. It is defined as:

R@K = (Number of Relevant Documents in Top K) / (Total Number of Relevant Documents in Corpus)

  • A score of 1.0 indicates all relevant documents were found in the top K.
  • A score of 0.0 means none were found.
  • It is query-specific and is typically averaged across a test set of queries.
02

Trade-off with Precision

R@K is intrinsically linked to Precision at K (P@K). They represent a classic retrieval trade-off:

  • High R@K, Low P@K: The system is casting a wide net, retrieving many documents to ensure it catches all relevant ones, but many results are irrelevant.
  • Low R@K, High P@K: The system is highly selective, returning mostly relevant documents, but it misses many relevant ones.

In RAG systems, optimizing for R@K is often prioritized to ensure the LLM has access to all necessary context, even at the cost of some irrelevant noise that the model can learn to ignore.

03

The Role of K (Cut-off Rank)

The value of K is a critical hyperparameter that determines the scope of evaluation:

  • Small K (e.g., 1, 3, 5): Measures the system's ability to place the most critical, highly relevant documents at the very top. Common for user-facing search where only the first page of results matters.
  • Large K (e.g., 10, 50, 100): Assesses the system's overall recall capability, which is crucial for RAG. The LLM's context window can typically hold 5-20 passages, so R@10 or R@20 is a standard benchmark to ensure sufficient grounding material is retrieved.

Plotting R@K as K increases creates a recall curve, showing how quickly the system accumulates relevant documents.

04

Application in RAG Evaluation

In Retrieval-Augmented Generation, R@K is a leading indicator of answer quality and faithfulness.

  • High R@K suggests the generator has access to the facts needed to produce a correct, grounded answer.
  • Low R@K directly limits answer quality, as missing context leads to hallucinations or incomplete responses.

It is often evaluated alongside Answer Faithfulness and Answer Relevance. A drop in those metrics with high R@K indicates the generator is failing to properly utilize the retrieved context.

05

Limitations and Considerations

While essential, R@K has key limitations:

  • Requires a Labeled Corpus: You must know the total number of relevant documents per query, which can be expensive to annotate.
  • Binary Relevance: Standard R@K treats relevance as a binary (yes/no). It does not account for the graded relevance of documents (e.g., highly vs. partially relevant), a nuance captured by metrics like NDCG@K.
  • Ignores Rank Order: It does not reward systems for placing the most relevant document first. [Relevant, Irrelevant, Relevant] and [Irrelevant, Relevant, Relevant] both yield the same R@3 score, despite the first being more user-friendly.
06

Related Metrics & Context

R@K is one piece of a comprehensive retrieval evaluation suite:

  • Precision at K (P@K): The companion metric for measuring result purity.
  • Mean Average Precision (MAP): Averages precision values at the ranks of each relevant document, combining recall and rank-sensitivity.
  • Hit Rate @K: A simpler, binary metric: for what fraction of queries was at least one relevant document found in the top K?
  • F1 Score @K: The harmonic mean of P@K and R@K, providing a single balanced score when both dimensions are equally important.

Choosing the right K and combining these metrics provides a complete picture of retrieval health.

RETRIEVAL METRIC COMPARISON

R@K vs. Precision at K (P@K)

A comparison of two core information retrieval metrics used to evaluate the quality of ranked results, particularly in RAG systems.

Metric / FeatureRecall at K (R@K)Precision at K (P@K)

Primary Focus

Completeness of retrieval

Purity of the top results

Core Question Answered

Did we find all relevant items?

Are the top results good?

Formula

Relevant items in top K / Total relevant items

Relevant items in top K / K

Range of Values

0 to 1

0 to 1

Sensitivity to K

Increases or stays the same as K increases

Generally decreases as K increases

Use Case Priority

High recall is critical when missing relevant documents is costly (e.g., legal discovery, systematic reviews).

High precision is critical when user attention is limited and top results must be highly relevant (e.g., web search, chatbot responses).

Trade-off Relationship

Optimizing for R@K often requires retrieving more items (increasing K), which can hurt P@K.

Optimizing for P@K favors highly confident results, which can miss relevant items, hurting R@K.

Interpretation of Score = 1

All relevant documents for the query are contained within the top K results.

Every single one of the top K retrieved documents is relevant to the query.

Interpretation of Score = 0

No relevant documents are found in the top K results.

None of the top K retrieved documents are relevant.

Typical Reporting

Often reported for multiple K values (e.g., R@5, R@10, R@100) to show recall progression.

Often reported for a specific, user-facing K (e.g., P@1, P@5, P@10) to reflect immediate user experience.

EVALUATION METRIC

Application in RAG Systems

Recall at K (R@K) is a fundamental metric for assessing the retrieval component of a Retrieval-Augmented Generation system. It quantifies the system's ability to find all relevant information, which is critical for ensuring the language model has sufficient context to generate a complete and accurate answer.

01

Core Retrieval Performance

Recall at K (R@K) directly measures the coverage of a RAG system's retriever. For a given query, it calculates the fraction of all relevant documents in the knowledge base that are successfully fetched and placed within the top K results returned to the generator.

  • High R@K indicates the retriever is effective at surfacing most relevant context, giving the LLM a high chance of having the facts needed for a faithful answer.
  • Low R@K signals a major risk of answer incompleteness or hallucination, as critical information is missing from the context window.
02

Trade-off with Precision at K

In practice, optimizing a RAG system involves balancing Recall at K (R@K) against Precision at K (P@K).

  • R@K focuses on completeness: "Did we find all the relevant chunks?"
  • P@K focuses on purity: "Are the top K results all relevant?"

A retriever configured for maximum recall (e.g., by fetching a large K) may include irrelevant documents, increasing context noise. Engineers must find the K value and retrieval strategy that provides sufficient recall for answer quality without overwhelming the LLM with excessive, irrelevant context that hurts efficiency and cost.

03

Setting the K Parameter

The choice of K is a critical engineering decision with direct implications for system performance and cost.

  • Small K (e.g., 3-5): Evaluates the retriever's ability to be highly precise at the top ranks. Lower recall here often necessitates a very high-quality embedding model or re-ranker.
  • Large K (e.g., 10-20): Assesses the retriever's broad coverage. This is common when using a two-stage retrieve-then-rerank architecture, where a fast, high-recall first-stage fetches many candidates (K=20), and a slower, high-precision cross-encoder re-ranks them down to a smaller final set.
04

Connection to Answer Faithfulness

Recall at K is a leading indicator for the downstream metric of Answer Faithfulness. If the retriever fails to recall a key document (low R@K), the language model lacks the necessary grounding and is forced to generate a response based on its parametric knowledge, drastically increasing the probability of an unsupported hallucination.

Monitoring R@K in production, alongside faithfulness scores, helps isolate failures to the retrieval stage versus the generation stage, enabling more targeted improvements.

05

Benchmarking & Evaluation

To calculate R@K, a labeled evaluation set with query-document relevance judgments is required. The standard calculation is: R@K = (Number of relevant documents retrieved in top K) / (Total number of relevant documents for the query)

  • It is typically reported as an average across a diverse set of test queries.
  • It is used alongside NDCG@K and MAP to provide a holistic view of retrieval quality, as R@K treats all relevant documents equally regardless of their rank within the top K.
06

Improving Recall at K

Several technical strategies can directly improve a RAG system's R@K metric:

  • Hybrid Search: Combining dense vector search (for semantic recall) with sparse keyword search (for exact term recall).
  • Query Expansion: Using the LLM to generate multiple related queries or hypothetical answers to broaden the search.
  • Chunking Optimization: Experimenting with overlapping text chunks and varying chunk sizes to prevent relevant information from being split across boundaries.
  • Embedding Model Fine-tuning: Adapting the retriever's embedding model on domain-specific data to better align query and document representations.
RAG EVALUATION METRICS

Frequently Asked Questions

Essential questions and answers about Recall at K (R@K), a core metric for evaluating the completeness of document retrieval in systems like Retrieval-Augmented Generation (RAG).

Recall at K (R@K) is an information retrieval metric that measures the proportion of all relevant documents for a query that are successfully retrieved within the top K results returned by a system. It is defined as:

Recall@K = (Number of relevant documents retrieved in the top K) / (Total number of relevant documents in the corpus)

For example, if there are 10 relevant documents in total for a query and your system retrieves 7 of them within the top 20 results, then Recall@20 = 0.7 (or 70%). A perfect score of 1.0 indicates that all relevant documents were found within the top K, making it a critical measure of retrieval completeness.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.