Inferensys

Glossary

Dense Retrieval Metrics

Dense Retrieval Metrics are evaluation measures specifically applied to retrieval systems that use dense vector embeddings (e.g., from bi-encoders) to find semantically similar passages.
Developer working on RAG retrieval system, document chunks visible on screen, technical workspace with code editor.
RAG EVALUATION METRICS

What is Dense Retrieval Metrics?

Dense Retrieval Metrics are evaluation measures specifically applied to retrieval systems that use dense vector embeddings (e.g., from bi-encoders) to find semantically similar passages.

Dense Retrieval Metrics are quantitative measures used to evaluate the performance of semantic search systems that rely on dense vector embeddings. Unlike traditional keyword-based (sparse) retrieval, dense retrieval uses neural networks to map queries and documents into a shared high-dimensional space where semantic similarity is measured by cosine similarity or dot product. Core metrics for these systems include Recall@K (R@K), which measures the proportion of relevant documents found within the top K results, and Mean Reciprocal Rank (MRR), which averages the reciprocal rank of the first relevant result across multiple queries.

Evaluating dense retrieval requires specialized benchmarks like BEIR or MTEB that test generalization across diverse domains. Key metrics also include Normalized Discounted Cumulative Gain (NDCG), which accounts for graded relevance and result ranking, and Hit Rate, a binary measure of whether any relevant document appears in the top K. These metrics are foundational for assessing the embedding model's quality and the overall retrieval effectiveness of a Retrieval-Augmented Generation (RAG) pipeline before answer generation.

RAG EVALUATION METRICS

Core Dense Retrieval Metrics

These metrics specifically evaluate the performance of retrieval systems that use dense vector embeddings (e.g., from bi-encoders like Sentence-BERT) to find semantically similar passages for RAG pipelines.

01

Recall at K (R@K)

Recall at K measures the proportion of all relevant documents for a query that are successfully retrieved within the top K results. It directly assesses a dense retriever's ability to find the complete set of necessary context.

  • Formula: (Relevant docs in top K) / (Total relevant docs in corpus)
  • Use Case: Critical for ensuring high-coverage retrieval in RAG, where missing a key document can lead to incomplete or incorrect answers.
  • Example: If there are 5 relevant documents total and the top 10 results contain 4 of them, R@10 = 0.8 (80%).
02

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank evaluates the rank of the first relevant document across multiple queries. It is the average of the reciprocal of the rank at which the first relevant item appears.

  • Formula: MRR = (1 / |Q|) * Σ (1 / rank_i) for each query i.
  • Interpretation: Higher scores indicate the first relevant result appears earlier in the list. A perfect MRR of 1.0 means the first result is always relevant.
  • Application: Particularly important for user-facing systems where the quality of the very first result drives user satisfaction and can reduce latency by limiting the context window passed to the LLM.
03

Normalized Discounted Cumulative Gain (NDCG)

NDCG evaluates a ranked list by accounting for the graded relevance of each item (e.g., highly relevant, somewhat relevant, not relevant) and penalizing relevant items that appear lower in the list.

  • Core Principle: It uses a logarithmic discount factor, meaning the utility of a relevant document diminishes the further down the list it appears.
  • Graded Relevance: Unlike binary metrics (relevant/not), NDCG works with scores like 0, 1, 2, 3, making it suitable for evaluating retrievers where some passages are more critical than others.
  • Normalization: Scores are normalized against an ideal ranking (IDCG), producing a value between 0 and 1.
04

Hit Rate

Hit Rate is a simple, binary metric measuring the percentage of queries for which at least one relevant document is found within the top K retrieved results.

  • Calculation: (Number of queries with a hit) / (Total number of queries).
  • Utility: Provides a high-level success rate for the retriever. A low Hit Rate indicates the retrieval system is fundamentally failing to find useful context for many queries, which will cascade into poor RAG performance.
  • Threshold Sensitivity: The choice of K (e.g., Hit Rate @ 5 vs. @ 10) significantly impacts the score and should align with the number of passages typically passed to the LLM.
05

Semantic Similarity vs. Traditional Overlap

Dense retrieval evaluation often uses semantic similarity metrics rather than lexical overlap metrics like BLEU or ROUGE.

  • Key Metric: BERTScore computes similarity between a retrieved passage and a query (or ground truth) using contextual embeddings from models like BERT. It correlates better with human judgment of relevance.
  • Advantage: Captures paraphrasing and conceptual similarity that keyword matching (e.g., TF-IDF) would miss.
  • Contrast: Traditional Precision@K and Recall@K still apply but require a binary relevance judgment, which can be derived from a semantic similarity threshold (e.g., a BERTScore > 0.8 is considered relevant).
06

Reranking Effectiveness

This measures the improvement in retrieval quality gained by applying a secondary, more computationally expensive cross-encoder model to rerank the initial candidate set from the dense retriever.

  • Typical Flow: A fast bi-encoder retrieves 100 candidates (high recall), then a precise cross-encoder reranks them to produce the final top 10 (high precision).
  • Measurement: The lift in metrics like NDCG@10 or MAP after reranking versus the initial dense retrieval results.
  • Engineering Trade-off: Evaluates whether the latency and compute cost of the reranker are justified by the quality gain for the downstream generation task.
EVALUATION METRICS

Dense vs. Sparse Retrieval Metrics Comparison

A comparison of key evaluation metrics and their applicability to dense (semantic) and sparse (lexical) retrieval systems.

Metric / CharacteristicDense Retrieval (e.g., DPR, Sentence-BERT)Sparse Retrieval (e.g., BM25, TF-IDF)

Primary Evaluation Paradigm

Semantic / Meaning-Based

Lexical / Keyword-Based

Optimal Metric for Ranking Quality

Normalized Discounted Cumulative Gain (NDCG)

Mean Average Precision (MAP)

Strength in Measuring

Graded relevance of semantically similar passages

Binary relevance of keyword-matching documents

Typical Recall@K Performance

Higher for semantic queries (e.g., "consequences of inflation")

Higher for exact term lookup (e.g., "Python list comprehension syntax")

Query Understanding Dependency

High (performance hinges on embedding quality & query semantics)

Low (performance based on term overlap, less sensitive to phrasing)

Requires Graded Relevance Judgments

Highly Beneficial (for metrics like NDCG)

Beneficial, but binary judgments suffice for P@K, R@K

Sensitivity to Synonymy & Paraphrasing

Robust (handles well)

Fragile (often fails)

Sensitivity to Polysemy (Multiple Meanings)

Can be misled by ambiguous embeddings

Unaffected; matches surface-level terms

DENSE RETRIEVAL METRICS

How to Evaluate Dense Retrieval Systems

Dense Retrieval Metrics are evaluation measures specifically applied to retrieval systems that use dense vector embeddings (e.g., from bi-encoders) to find semantically similar passages.

Evaluating a dense retrieval system requires metrics that assess the quality of its ranked list of semantically relevant passages. Core information retrieval (IR) metrics like Precision at K (P@K), Recall at K (R@K), and Mean Average Precision (MAP) quantify the system's ability to place relevant documents high in the ranking. Normalized Discounted Cumulative Gain (NDCG) is particularly important as it accounts for graded relevance, where some passages are more pertinent than others, aligning with the nuanced matches produced by semantic search.

For end-to-end Retrieval-Augmented Generation (RAG) evaluation, retrieval quality is a critical upstream component. Frameworks like RAGAS decompose final answer quality into metrics like Context Relevance, which directly measures the utility of retrieved passages. Reranking Effectiveness is often measured by the lift in NDCG or MAP after a cross-encoder model re-scores an initial dense retrieval candidate set, providing a clear benchmark for the two-stage retrieval architecture common in production systems.

IMPLEMENTATION FRAMEWEWORKS & TOOLS

Dense Retrieval Metrics

Dense Retrieval Metrics are evaluation measures specifically applied to retrieval systems that use dense vector embeddings (e.g., from bi-encoders) to find semantically similar passages. These metrics assess the quality of the semantic search performed before generation in a RAG pipeline.

01

Recall at K (R@K)

Recall at K measures the proportion of all relevant documents for a query that are successfully retrieved within the top K results. It is the primary metric for assessing the comprehensiveness of a dense retriever.

  • Formula: (Number of relevant docs in top K) / (Total number of relevant docs in corpus).
  • Use Case: Critical for ensuring the generator has access to necessary information. A low R@K indicates the retriever is missing key context, leading to incomplete or incorrect answers.
  • Typical K Values: R@5, R@10, or R@100, depending on how many documents are passed to the reranker or generator.
02

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank evaluates the rank of the first relevant document. It is the average of the reciprocal of the rank at which the first relevant document appears across multiple queries. It emphasizes the system's ability to place a useful result high in the list.

  • Formula: For a set of queries Q, MRR = (1/|Q|) * Σ (1 / rank_i), where rank_i is the position of the first relevant document for the i-th query.
  • Interpretation: A perfect MRR of 1.0 means the first retrieved document is relevant for every query. It is sensitive to the position of the first hit, making it crucial for user-facing systems where the top result is paramount.
03

Normalized Discounted Cumulative Gain (NDCG)

NDCG evaluates a ranked list using graded relevance (e.g., scores of 0, 1, 2, 3), where higher-ranked relevant items contribute more to the score. It is the standard metric for evaluating ranking quality when relevance is not binary.

  • Core Concept: It combines Cumulative Gain (sum of relevance scores), Discounting (reducing weight for lower ranks), and Normalization (comparing to an ideal ranking).
  • Application in Dense Retrieval: Used to evaluate retrievers when documents have varying degrees of usefulness. A retriever that places a highly relevant (score=3) document at rank 1 scores much higher than one that places it at rank 10.
  • NDCG@K (e.g., NDCG@10) is commonly reported.
04

Hit Rate

Hit Rate is a simple, binary metric: for a given cutoff K, it measures the percentage of queries for which at least one relevant document is found in the top K results.

  • Calculation: (Number of queries with a relevant doc in top K) / (Total number of queries).
  • Utility: Provides a high-level success rate. A Hit Rate@5 of 0.95 means for 95% of queries, the system found something useful in the top 5. It is often the first metric checked to see if the retriever is functioning at a basic level.
  • Difference from Recall: Recall measures coverage of all relevant documents per query, while Hit Rate only cares if any relevant document is found.
06

Benchmark Suites: BEIR & MTEB

Standardized benchmarks are essential for comparing dense retrievers. Two primary suites are:

  • BEIR (Benchmarking-IR): A heterogeneous benchmark containing 18 datasets across 9 tasks (e.g., fact-checking, question answering, citation prediction). It evaluates zero-shot retrieval performance, showing how well a model generalizes to new domains without task-specific fine-tuning.
  • MTEB (Massive Text Embedding Benchmark): Evaluates text embeddings across 8 tasks (including retrieval) on 58 datasets. Its Retrieval category includes benchmarks like SciFact and NFCorpus.

Practice: Dense retrievers like Sentence-BERT, OpenAI embeddings, and Cohere models are routinely evaluated on these benchmarks using metrics like NDCG@10 and Recall@100 to establish leaderboards and performance baselines.

18 datasets
BEIR Benchmark Tasks
58 datasets
MTEB Benchmark Scale
DENSE RETRIEVAL METRICS

Frequently Asked Questions

Dense retrieval metrics evaluate systems that use neural embeddings to find semantically similar text. This FAQ covers the core metrics, their calculations, and their role in benchmarking production RAG pipelines.

Precision at K (P@K) is an information retrieval metric that calculates the proportion of relevant documents among the top K retrieved results for a single query. It is defined as P@K = (Number of relevant documents in top K) / K. For example, if 3 out of the top 5 retrieved passages are relevant, P@5 = 0.6. This metric is critical for evaluating the immediate quality of a dense retriever's output, as it directly measures noise in the context window provided to a downstream language model. It is a point metric, calculated per query and often averaged across a test set. In dense retrieval, relevance is typically judged by human annotators or against a curated ground-truth set of query-document pairs.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.