Inferensys

Glossary

Normalized Discounted Cumulative Gain (NDCG)

Normalized Discounted Cumulative Gain (NDCG) is a ranking evaluation metric that measures the quality of a ranked list by accounting for both the graded relevance of items and their positions.
AI evaluator reviewing output quality on laptop, comparison metrics visible, casual evaluation session.
RAG EVALUATION METRICS

What is Normalized Discounted Cumulative Gain (NDCG)?

Normalized Discounted Cumulative Gain (NDCG) is a metric for evaluating the quality of a ranked list of items that accounts for the graded relevance of items and their positions in the list.

Normalized Discounted Cumulative Gain (NDCG) is a ranking quality metric that evaluates a list of retrieved items by comparing their graded relevance scores against an ideal ordering. It extends Discounted Cumulative Gain (DCG) by applying a logarithmic discount to the relevance of items based on their rank, penalizing relevant items that appear lower in the list. The final NDCG score is obtained by normalizing the DCG by the Ideal DCG (IDCG), which is the maximum possible DCG for the perfect ranking, resulting in a score between 0 and 1.

In Retrieval-Augmented Generation (RAG) evaluation, NDCG is crucial for assessing dense retrieval systems where documents have varying degrees of relevance (e.g., highly relevant, partially relevant, irrelevant). Unlike binary metrics such as Precision@K, NDCG accounts for this graded relevance and the rank position, making it more informative for tasks where the order of top results critically impacts downstream answer generation quality. It is a core metric in information retrieval benchmarking and is often reported alongside Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR).

RAG EVALUATION METRICS

Key Characteristics of NDCG

Normalized Discounted Cumulative Gain (NDCG) is a core metric for evaluating ranked lists in information retrieval and RAG systems. Unlike binary metrics, it accounts for graded relevance and the critical importance of ranking order.

01

Graded Relevance

NDCG moves beyond binary relevance (relevant/not relevant) to evaluate graded relevance scores. Each item in a ranked list is assigned a relevance grade (e.g., 0=irrelevant, 1=somewhat relevant, 2=highly relevant, 3=perfectly relevant). The metric's Cumulative Gain (CG) is the sum of these relevance scores for the top K results, directly rewarding systems that retrieve highly relevant items.

  • Example: A search returning documents with relevance scores [3, 2, 1, 0] has a CG@4 = 6.
  • This is fundamental for RAG evaluation, where some context passages are critically useful (high grade) while others are merely tangential (low grade).
02

Positional Discounting

A core innovation of NDCG is its logarithmic discount function, which reduces the contribution of a relevant item based on its rank position. This reflects the user experience principle that items appearing lower in a list are less likely to be seen and used.

  • The Discounted Cumulative Gain (DCG) formula applies a discount: rel_i / log2(i + 1), where i is the rank position.
  • Impact: A highly relevant document (score=3) at position 1 contributes 3.0 to DCG. The same document at position 4 contributes only 3 / log2(5) ≈ 1.29.
  • This forces evaluation to penalize systems that bury critical context deep in the retrieved list, which is detrimental for RAG generation quality.
03

Ideal Normalization

NDCG is normalized by the Ideal DCG (IDCG), which is the maximum possible DCG achievable for a given set of relevance judgments. This creates a score between 0.0 and 1.0, allowing for comparison across different queries with varying numbers of relevant items.

  • Calculation: NDCG@K = DCG@K / IDCG@K.
  • IDCG is computed by sorting all items by their true relevance score in descending order and calculating the DCG of that ideal ranking.
  • A score of 1.0 represents a perfect ranking. A score of 0.7 indicates the system achieves 70% of the possible gain from an ideal ordering.
  • This normalization is essential for reporting a single, interpretable metric across an entire evaluation benchmark.
04

Rank-Aware Evaluation

Unlike Precision@K or Recall@K, which treat all top-K items equally, NDCG is explicitly rank-aware. It provides a fine-grained measure of ranking quality, making it the metric of choice when the order of results is critical to the downstream task.

  • Use Case in RAG: The first retrieved passage provides the primary context for an LLM. NDCG directly measures if the most relevant passage is ranked first.
  • Comparison: Two systems may have identical Precision@5, but the system with the more relevant document in position 1 will have a significantly higher NDCG@5.
  • This makes NDCG indispensable for evaluating retrievers and rerankers within a RAG pipeline, where ordering dictates context quality.
05

Standardized Benchmarking

NDCG is the de facto standard for evaluating web search engines and recommendation systems, and by extension, retrieval for RAG. Major academic and industry benchmarks (e.g., MS MARCO, BEIR, TREC tracks) report NDCG as a primary metric, enabling direct comparison between different retrieval architectures.

  • It allows comparison between sparse retrievers (BM25), dense retrievers (DPR, Contriever), and cross-encoders used for reranking.
  • Standard cut-offs like NDCG@5, NDCG@10, and NDCG@100 are used to measure performance at different depths of the ranking.
  • When a paper or system claims state-of-the-art retrieval performance, it is typically validated by an improvement in NDCG on established benchmarks.
06

Relation to Other Metrics

NDCG exists within a family of ranking metrics, each with specific strengths.

  • vs. Mean Average Precision (MAP): MAP also considers rank order but assumes binary relevance. NDCG is preferred for graded relevance scenarios common in RAG.
  • vs. Mean Reciprocal Rank (MRR): MRR only considers the rank of the first relevant item. NDCG provides a more comprehensive view of the entire ranking.
  • vs. Precision/Recall at K: These metrics are rank-agnostic within the top K. NDCG incorporates rank sensitivity.
  • Practical Integration: In a full RAG evaluation suite, NDCG assesses retrieval quality, while metrics like Answer Faithfulness and Answer Relevance assess the final generation. High NDCG is a strong leading indicator for good final answer quality.
INFORMATION RETRIEVAL COMPARISON

NDCG vs. Other Ranking Metrics

A comparison of key characteristics between NDCG and other common metrics used to evaluate ranked retrieval results, highlighting their suitability for different evaluation scenarios.

Metric / FeatureNDCGPrecision/Recall at KMean Average Precision (MAP)Mean Reciprocal Rank (MRR)

Core Evaluation Focus

Graded relevance & ranking quality

Binary relevance at a cutoff

Binary relevance & rank of all relevant items

Rank of the first relevant item

Handles Graded Relevance

Position-Sensitive Discounting

Requires Complete Relevance Judgments

Interpretation Range

0.0 to 1.0

0.0 to 1.0

0.0 to 1.0

0.0 to 1.0

Ideal Use Case

Evaluating web search, recommendation systems

Quick sanity checks, top-K performance

Benchmarking academic retrieval systems

Question answering, tasks with a single correct answer

Primary Weakness

Requires defining a gain function and discount

Ignores relevance beyond cutoff K

Assumes all relevant items are known

Ignores performance after first relevant item

Common in RAG Evaluation

NDCG

Frequently Asked Questions

Normalized Discounted Cumulative Gain (NDCG) is a core metric for evaluating ranked lists where items have graded relevance. These questions address its mechanics, calculation, and application in modern AI systems.

Normalized Discounted Cumulative Gain (NDCG) is an information retrieval metric that evaluates the quality of a ranked list by accounting for both the graded relevance of items and their positional rank, with higher scores given to relevant items placed at the top of the list. It works by comparing the Discounted Cumulative Gain (DCG) of a predicted ranking to the Ideal DCG (IDCG) of a perfectly ordered list. The core principle is that the utility of a relevant document is discounted logarithmically based on its rank, reflecting the user's diminishing attention. The final NDCG score is the ratio DCG/IDCG, resulting in a normalized value between 0 and 1, where 1 represents a perfect ranking.

Key Mechanism: For a ranked list, the gain from each item is its relevance score (e.g., 0, 1, 2, 3). This gain is then divided by a log discount factor (e.g., log2(rank + 1)). The sum of these discounted gains is the DCG. NDCG is this DCG divided by the maximum possible DCG (the IDCG), which is obtained by sorting all items by their true relevance in descending order.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.