Normalized Discounted Cumulative Gain (NDCG) is a ranking quality metric that evaluates a list of retrieved items by comparing their graded relevance scores against an ideal ordering. It extends Discounted Cumulative Gain (DCG) by applying a logarithmic discount to the relevance of items based on their rank, penalizing relevant items that appear lower in the list. The final NDCG score is obtained by normalizing the DCG by the Ideal DCG (IDCG), which is the maximum possible DCG for the perfect ranking, resulting in a score between 0 and 1.
Glossary
Normalized Discounted Cumulative Gain (NDCG)

What is Normalized Discounted Cumulative Gain (NDCG)?
Normalized Discounted Cumulative Gain (NDCG) is a metric for evaluating the quality of a ranked list of items that accounts for the graded relevance of items and their positions in the list.
In Retrieval-Augmented Generation (RAG) evaluation, NDCG is crucial for assessing dense retrieval systems where documents have varying degrees of relevance (e.g., highly relevant, partially relevant, irrelevant). Unlike binary metrics such as Precision@K, NDCG accounts for this graded relevance and the rank position, making it more informative for tasks where the order of top results critically impacts downstream answer generation quality. It is a core metric in information retrieval benchmarking and is often reported alongside Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR).
Key Characteristics of NDCG
Normalized Discounted Cumulative Gain (NDCG) is a core metric for evaluating ranked lists in information retrieval and RAG systems. Unlike binary metrics, it accounts for graded relevance and the critical importance of ranking order.
Graded Relevance
NDCG moves beyond binary relevance (relevant/not relevant) to evaluate graded relevance scores. Each item in a ranked list is assigned a relevance grade (e.g., 0=irrelevant, 1=somewhat relevant, 2=highly relevant, 3=perfectly relevant). The metric's Cumulative Gain (CG) is the sum of these relevance scores for the top K results, directly rewarding systems that retrieve highly relevant items.
- Example: A search returning documents with relevance scores [3, 2, 1, 0] has a CG@4 = 6.
- This is fundamental for RAG evaluation, where some context passages are critically useful (high grade) while others are merely tangential (low grade).
Positional Discounting
A core innovation of NDCG is its logarithmic discount function, which reduces the contribution of a relevant item based on its rank position. This reflects the user experience principle that items appearing lower in a list are less likely to be seen and used.
- The Discounted Cumulative Gain (DCG) formula applies a discount:
rel_i / log2(i + 1), whereiis the rank position. - Impact: A highly relevant document (score=3) at position 1 contributes 3.0 to DCG. The same document at position 4 contributes only
3 / log2(5) ≈ 1.29. - This forces evaluation to penalize systems that bury critical context deep in the retrieved list, which is detrimental for RAG generation quality.
Ideal Normalization
NDCG is normalized by the Ideal DCG (IDCG), which is the maximum possible DCG achievable for a given set of relevance judgments. This creates a score between 0.0 and 1.0, allowing for comparison across different queries with varying numbers of relevant items.
- Calculation:
NDCG@K = DCG@K / IDCG@K. - IDCG is computed by sorting all items by their true relevance score in descending order and calculating the DCG of that ideal ranking.
- A score of 1.0 represents a perfect ranking. A score of 0.7 indicates the system achieves 70% of the possible gain from an ideal ordering.
- This normalization is essential for reporting a single, interpretable metric across an entire evaluation benchmark.
Rank-Aware Evaluation
Unlike Precision@K or Recall@K, which treat all top-K items equally, NDCG is explicitly rank-aware. It provides a fine-grained measure of ranking quality, making it the metric of choice when the order of results is critical to the downstream task.
- Use Case in RAG: The first retrieved passage provides the primary context for an LLM. NDCG directly measures if the most relevant passage is ranked first.
- Comparison: Two systems may have identical Precision@5, but the system with the more relevant document in position 1 will have a significantly higher NDCG@5.
- This makes NDCG indispensable for evaluating retrievers and rerankers within a RAG pipeline, where ordering dictates context quality.
Standardized Benchmarking
NDCG is the de facto standard for evaluating web search engines and recommendation systems, and by extension, retrieval for RAG. Major academic and industry benchmarks (e.g., MS MARCO, BEIR, TREC tracks) report NDCG as a primary metric, enabling direct comparison between different retrieval architectures.
- It allows comparison between sparse retrievers (BM25), dense retrievers (DPR, Contriever), and cross-encoders used for reranking.
- Standard cut-offs like NDCG@5, NDCG@10, and NDCG@100 are used to measure performance at different depths of the ranking.
- When a paper or system claims state-of-the-art retrieval performance, it is typically validated by an improvement in NDCG on established benchmarks.
Relation to Other Metrics
NDCG exists within a family of ranking metrics, each with specific strengths.
- vs. Mean Average Precision (MAP): MAP also considers rank order but assumes binary relevance. NDCG is preferred for graded relevance scenarios common in RAG.
- vs. Mean Reciprocal Rank (MRR): MRR only considers the rank of the first relevant item. NDCG provides a more comprehensive view of the entire ranking.
- vs. Precision/Recall at K: These metrics are rank-agnostic within the top K. NDCG incorporates rank sensitivity.
- Practical Integration: In a full RAG evaluation suite, NDCG assesses retrieval quality, while metrics like Answer Faithfulness and Answer Relevance assess the final generation. High NDCG is a strong leading indicator for good final answer quality.
NDCG vs. Other Ranking Metrics
A comparison of key characteristics between NDCG and other common metrics used to evaluate ranked retrieval results, highlighting their suitability for different evaluation scenarios.
| Metric / Feature | NDCG | Precision/Recall at K | Mean Average Precision (MAP) | Mean Reciprocal Rank (MRR) |
|---|---|---|---|---|
Core Evaluation Focus | Graded relevance & ranking quality | Binary relevance at a cutoff | Binary relevance & rank of all relevant items | Rank of the first relevant item |
Handles Graded Relevance | ||||
Position-Sensitive Discounting | ||||
Requires Complete Relevance Judgments | ||||
Interpretation Range | 0.0 to 1.0 | 0.0 to 1.0 | 0.0 to 1.0 | 0.0 to 1.0 |
Ideal Use Case | Evaluating web search, recommendation systems | Quick sanity checks, top-K performance | Benchmarking academic retrieval systems | Question answering, tasks with a single correct answer |
Primary Weakness | Requires defining a gain function and discount | Ignores relevance beyond cutoff K | Assumes all relevant items are known | Ignores performance after first relevant item |
Common in RAG Evaluation |
Frequently Asked Questions
Normalized Discounted Cumulative Gain (NDCG) is a core metric for evaluating ranked lists where items have graded relevance. These questions address its mechanics, calculation, and application in modern AI systems.
Normalized Discounted Cumulative Gain (NDCG) is an information retrieval metric that evaluates the quality of a ranked list by accounting for both the graded relevance of items and their positional rank, with higher scores given to relevant items placed at the top of the list. It works by comparing the Discounted Cumulative Gain (DCG) of a predicted ranking to the Ideal DCG (IDCG) of a perfectly ordered list. The core principle is that the utility of a relevant document is discounted logarithmically based on its rank, reflecting the user's diminishing attention. The final NDCG score is the ratio DCG/IDCG, resulting in a normalized value between 0 and 1, where 1 represents a perfect ranking.
Key Mechanism: For a ranked list, the gain from each item is its relevance score (e.g., 0, 1, 2, 3). This gain is then divided by a log discount factor (e.g., log2(rank + 1)). The sum of these discounted gains is the DCG. NDCG is this DCG divided by the maximum possible DCG (the IDCG), which is obtained by sorting all items by their true relevance in descending order.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
NDCG is a core metric for evaluating ranked retrieval quality. It operates within a broader ecosystem of quantitative measures for assessing search, ranking, and generation systems.
Mean Average Precision (MAP)
Mean Average Precision (MAP) calculates the mean of the Average Precision scores across a set of queries, providing a single-figure measure of quality for a ranking system. Unlike NDCG, MAP assumes binary relevance (relevant/not relevant).
- Core Calculation: For each query, Average Precision is the average of precision values at each rank where a relevant document is found. MAP is the mean of these values across all queries.
- Use Case: Best for tasks where the primary goal is to retrieve all relevant items, and their order among themselves is less critical than simply getting them to the top.
- Comparison to NDCG: MAP does not account for graded relevance levels or the specific utility decay of lower ranks as explicitly as NDCG's discount function.
Precision at K (P@K) & Recall at K (R@K)
Precision at K (P@K) and Recall at K (R@K) are fundamental cut-off metrics for evaluating retrieval systems at a specific depth K.
- P@K: Measures the proportion of relevant documents among the top
Kretrieved results. It answers: "Of the first K results shown, how many were good?" - R@K: Measures the proportion of all relevant documents for a query that were found within the top
Kresults. It answers: "How much of the total relevant content did I find in the first K results?" - Relationship to NDCG: These are simpler, position-bound metrics. NDCG provides a more nuanced single score by aggregating performance across all ranks, weighting by position and relevance grade.
Mean Reciprocal Rank (MRR)
Mean Reciprocal Rank (MRR) is a metric focused on the rank of the first relevant item. It is the average of the reciprocal of the rank of the first correct answer across multiple queries.
- Calculation: For each query, compute
1 / rank_of_first_relevant_result. MRR is the average of these values. If no relevant item is found, the reciprocal is 0. - Utility: Highly interpretable and critical for use cases where the user's goal is to find a single correct answer quickly (e.g., question answering, voice assistants).
- Contrast with NDCG: MRR is insensitive to the quality of results after the first relevant one. NDCG evaluates the entire ranked list, making it more suitable for exploratory search or recommendation.
Semantic Similarity Metrics
Metrics like BERTScore and Semantic Similarity evaluate the meaning-based alignment between text, often used to assess generated answers against references.
- BERTScore: Computes similarity using contextual embeddings from models like BERT. It matches words in candidate and reference texts based on cosine similarity in embedding space, producing precision, recall, and F1 measures.
- General Semantic Similarity: Often computed using sentence embeddings (e.g., from Sentence-BERT) to gauge the conceptual relatedness of two passages, such as a generated answer and a ground truth.
- Role in Evaluation: While NDCG evaluates retrieval ranking, these metrics often evaluate the final generation quality in RAG pipelines, assessing answer quality independently of the retrieval process.
RAGAS Framework Metrics
RAGAS (Retrieval-Augmented Generation Assessment) is an open-source framework for reference-free evaluation of RAG pipelines. It defines several metrics that complement retrieval-focused metrics like NDCG.
- Faithfulness: Measures factual consistency between the generated answer and the retrieved context. A low score indicates hallucination.
- Answer Relevance: Assesses how directly the generated answer addresses the original query, independent of the context.
- Context Relevance: Evaluates the pertinence of the retrieved context to the query. This is closely related to the precision of the retrieval stage.
- Synthetic Data: RAGAS often uses an LLM to generate synthetic test cases and evaluations, enabling assessment without human-written references.
Reranking Effectiveness
Reranking Effectiveness quantifies the improvement in retrieval quality achieved by applying a secondary, more computationally intensive model to an initial set of candidate documents.
- Two-Stage Process: A fast retriever (e.g., a dense vector search) fetches a broad set of candidates (e.g., top 100). A slow but accurate reranker (e.g., a cross-encoder) rescores this set to produce the final top-K results.
- Measurement: The lift in metrics like NDCG@K or MAP after reranking, compared to the initial retrieval, defines effectiveness. For example, an initial retrieval might have NDCG@10 of 0.65, which improves to 0.82 after reranking.
- System Design Goal: The core engineering trade-off is between the improved accuracy of the final ranked list and the added latency and compute cost of the reranking stage.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us