Inferensys

Glossary

Source Citation Recall

Source Citation Recall is an evaluation metric that measures the proportion of factual statements in a generated answer that are correctly attributed to their originating source documents.
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
RAG EVALUATION METRIC

What is Source Citation Recall?

Source Citation Recall is a critical metric for evaluating the attribution integrity of Retrieval-Augmented Generation (RAG) systems.

Source Citation Recall is a retrieval-augmented generation (RAG) evaluation metric that measures the proportion of factual statements or claims in a generated answer that are correctly attributed to their originating source documents. It is a recall-oriented measure focused on ensuring all information presented as fact is traceable, directly addressing the risk of unattributed hallucinations. A high score indicates the model successfully cites sources for most of its claims, a key requirement for verifiable AI in enterprise applications.

This metric is calculated by comparing the atomic facts in a generated answer against a ground truth mapping of those facts to source passages. It complements Source Citation Precision, which measures if citations are accurate. Together, they form a complete picture of a system's attribution quality. High Source Citation Recall is essential for building trust in domains like legal analysis, healthcare, and finance, where auditability and factual grounding are non-negotiable.

RAG EVALUATION METRICS

Key Characteristics of Source Citation Recall

Source Citation Recall is a critical metric for assessing the attribution integrity of Retrieval-Augmented Generation (RAG) systems. It quantifies how thoroughly a generated answer credits the source documents that contain its factual claims.

01

Core Definition & Formula

Source Citation Recall measures the proportion of source statements or facts used in a generated answer that are correctly attributed to their originating documents. It is calculated as:

(Number of Correctly Attributed Facts) / (Total Number of Verifiable Facts in the Answer)

  • A fact is a discrete, atomic piece of information derived from the source context.
  • Attribution is typically a direct citation (e.g., [Doc1]) or an indirect reference that can be unambiguously traced to a specific source passage.
  • A score of 1.0 indicates perfect attribution; a score of 0.0 indicates no facts are correctly cited, even if the answer is factually correct.
02

Contrast with Source Citation Precision

While Source Citation Recall measures attribution completeness, Source Citation Precision measures attribution accuracy. They are complementary metrics that form a complete picture of citation quality.

  • Recall (Completeness): "Of all the facts in the answer, how many were cited?" High recall minimizes unattributed claims.
  • Precision (Accuracy): "Of all the citations provided, how many were correct?" High precision minimizes incorrect or hallucinated citations.

An ideal RAG system optimizes for both. A system can have high recall but low precision (cites everything, but many citations are wrong) or high precision but low recall (citations are accurate but many facts are left uncited).

03

Relationship to Answer Faithfulness

Source Citation Recall is a stricter, more granular subset of the broader Answer Faithfulness metric.

  • Answer Faithfulness asks: "Is the entire answer supported by the provided context?" It's a holistic, often binary or graded judgment.
  • Source Citation Recall asks: "For each specific factual claim within the answer, can you point to its exact source?" It requires traceability.

A high Source Citation Recall score is strong evidence for high Answer Faithfulness, as it provides an audit trail. However, it is possible for an answer to be faithful (all info is in the context) but have low citation recall if the model fails to explicitly link claims to sources.

04

Operationalization & Evaluation

Measuring Source Citation Recall requires structured evaluation, often involving:

  • Atomic Fact Decomposition: Breaking the generated answer into individual, verifiable propositions (e.g., "The project launched in 2023," "It used a transformer architecture").
  • Source Provenance Mapping: For each atomic fact, determining if it exists in the retrieved source documents and identifying the specific passage.
  • Automated Methods: Using Named Entity Recognition (NER) and Semantic Similarity (e.g., Sentence-BERT) to align answer claims with source sentences. Frameworks like RAGAS and TruLens implement automated approximations.
  • Human-in-the-Loop: For high-stakes applications, human annotators perform the final mapping to establish a ground-truth benchmark for automated scorers.
05

Impact on Hallucination Mitigation

High Source Citation Recall is a primary engineering defense against model hallucination in RAG systems. It enforces a discipline of provenance.

  • Attribution Pressure: Designing the system to require citations for key claims forces the generator to ground its output.
  • Debugging Aid: When a hallucination occurs, low citation recall for that fact immediately flags it as unsupported, speeding up root-cause analysis (e.g., was the fact missing from retrieval, or did the generator ignore the source?).
  • User Trust: Answers with clear, comprehensive citations allow users to verify information, increasing confidence in the system's outputs, especially in domains like legal, medical, and financial analysis.
06

Common Failure Modes & Challenges

Achieving high Source Citation Recall presents specific engineering challenges:

  • Fact Fusion: The generator synthesizes information from multiple sources into a single sentence. Should this be one citation or several? Recall scoring must handle multi-source facts.
  • Paraphrasing & Summarization: The answer paraphrases a source. Automated metrics must recognize semantic equivalence beyond lexical overlap.
  • Implicit Knowledge: The generator adds universally true or common-sense facts not explicitly in the sources (e.g., "Paris is the capital of France"). Evaluators must decide if this requires a citation.
  • Noise in Retrieval: If retrieved documents contain conflicting or subtly incorrect data, the generator may correctly cite a source that is itself wrong. This highlights the need for Source Citation Precision alongside recall.
RAG EVALUATION METRICS COMPARISON

Source Citation Recall vs. Related Metrics

A comparison of Source Citation Recall with other key metrics used to evaluate the attribution, factual grounding, and overall quality of Retrieval-Augmented Generation (RAG) system outputs.

MetricSource Citation RecallSource Citation PrecisionAnswer FaithfulnessGrounding Score

Primary Focus

Completeness of attribution for all facts in the answer

Accuracy of the citations that are provided

Factual consistency of the answer with the source context

Overall strength of evidential support for the answer

Core Question

What proportion of the answer's factual statements are cited?

What proportion of the provided citations are correct?

Does the answer contain any unsupported or contradictory claims?

How well is the answer substantiated by the source materials?

Measurement Target

Statements/Facts in the generated answer

Citation tokens/pointers in the generated answer

Entire generated answer

Relationship between answer and source context

Relationship to Hallucination

Directly measures omission of citations, a prerequisite for detecting source-less hallucinations

Measures citation errors, which can mask or create hallucinations

Directly quantifies the rate of unsupported claims (hallucinations)

A composite often incorporating faithfulness; low score indicates high hallucination risk

Typical Calculation

(Number of answer facts with correct citation) / (Total number of answer facts)

(Number of correct citations) / (Total number of citations provided)

1 - (Number of unsupported or contradictory claims / Total claims)

Aggregate of semantic similarity and citation-based checks between answer sentences and source passages

Key Weakness

Does not assess citation accuracy; a fact can be cited to the wrong source

Does not penalize missing citations; answer can be uncited but precise

Does not measure relevance of the answer to the original query

Can be a black-box aggregate, making root-cause analysis difficult

Use Case in Debugging

Identifies answers that are poorly sourced or where the model is 'making up' facts without attribution.

Identifies errors in the citation mechanism or retrieval of incorrect passages.

Flags answers that contradict or invent information relative to the provided context.

Provides a single score for overall grounding quality, useful for high-level monitoring.

Dependency on Inputs

Requires annotated ground truth of which answer facts map to which source documents.

Requires annotated ground truth of correct source-document mappings for citations.

Can be evaluated reference-free using NLI models between answer and context.

Often uses a combination of model-based metrics (NLI, similarity) and rule-based checks.

RAG EVALUATION METRICS

Frameworks and Tools for Measuring Source Citation Recall

Measuring Source Citation Recall requires specialized frameworks that automate the comparison of generated text against source documents. These tools provide the quantitative rigor needed for Evaluation-Driven Development.

06

Custom Pipeline with Embedding Similarity

A custom evaluation pipeline can be built using sentence embeddings to compute recall. This method involves breaking the answer and source documents into claims and measuring semantic overlap.

  • Steps:
    1. Decompose the gold-standard source documents and the generated answer into atomic factual statements.
    2. Generate embeddings for each statement using a model like Sentence-BERT.
    3. For each source statement, compute if there is a semantically similar (cosine similarity above a threshold) statement in the answer.
  • Metric: Source Citation Recall = (Matched Source Statements) / (Total Source Statements).
  • Advantage: Provides a deterministic, model-agnostic score less prone to judge LLM variability.
SOURCE CITATION RECALL

Frequently Asked Questions

Source Citation Recall is a critical metric for evaluating the attribution integrity of Retrieval-Augmented Generation (RAG) systems. These questions address its definition, calculation, and role in ensuring verifiable AI outputs.

Source Citation Recall is a metric that measures the proportion of source statements or facts used in a generated answer that are correctly attributed to their originating documents. It quantifies an AI system's ability to provide traceable provenance for its claims, which is foundational for trust and auditability in enterprise RAG applications.

Unlike Answer Faithfulness, which checks if an answer is consistent with the source context, Source Citation Recall specifically evaluates the attribution mechanism itself. A high score indicates that the system's citations are comprehensive and accurately map generated content back to the specific passages that support it, minimizing unattributed synthesis.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.