Source Citation Recall is a retrieval-augmented generation (RAG) evaluation metric that measures the proportion of factual statements or claims in a generated answer that are correctly attributed to their originating source documents. It is a recall-oriented measure focused on ensuring all information presented as fact is traceable, directly addressing the risk of unattributed hallucinations. A high score indicates the model successfully cites sources for most of its claims, a key requirement for verifiable AI in enterprise applications.
Glossary
Source Citation Recall

What is Source Citation Recall?
Source Citation Recall is a critical metric for evaluating the attribution integrity of Retrieval-Augmented Generation (RAG) systems.
This metric is calculated by comparing the atomic facts in a generated answer against a ground truth mapping of those facts to source passages. It complements Source Citation Precision, which measures if citations are accurate. Together, they form a complete picture of a system's attribution quality. High Source Citation Recall is essential for building trust in domains like legal analysis, healthcare, and finance, where auditability and factual grounding are non-negotiable.
Key Characteristics of Source Citation Recall
Source Citation Recall is a critical metric for assessing the attribution integrity of Retrieval-Augmented Generation (RAG) systems. It quantifies how thoroughly a generated answer credits the source documents that contain its factual claims.
Core Definition & Formula
Source Citation Recall measures the proportion of source statements or facts used in a generated answer that are correctly attributed to their originating documents. It is calculated as:
(Number of Correctly Attributed Facts) / (Total Number of Verifiable Facts in the Answer)
- A fact is a discrete, atomic piece of information derived from the source context.
- Attribution is typically a direct citation (e.g.,
[Doc1]) or an indirect reference that can be unambiguously traced to a specific source passage. - A score of 1.0 indicates perfect attribution; a score of 0.0 indicates no facts are correctly cited, even if the answer is factually correct.
Contrast with Source Citation Precision
While Source Citation Recall measures attribution completeness, Source Citation Precision measures attribution accuracy. They are complementary metrics that form a complete picture of citation quality.
- Recall (Completeness): "Of all the facts in the answer, how many were cited?" High recall minimizes unattributed claims.
- Precision (Accuracy): "Of all the citations provided, how many were correct?" High precision minimizes incorrect or hallucinated citations.
An ideal RAG system optimizes for both. A system can have high recall but low precision (cites everything, but many citations are wrong) or high precision but low recall (citations are accurate but many facts are left uncited).
Relationship to Answer Faithfulness
Source Citation Recall is a stricter, more granular subset of the broader Answer Faithfulness metric.
- Answer Faithfulness asks: "Is the entire answer supported by the provided context?" It's a holistic, often binary or graded judgment.
- Source Citation Recall asks: "For each specific factual claim within the answer, can you point to its exact source?" It requires traceability.
A high Source Citation Recall score is strong evidence for high Answer Faithfulness, as it provides an audit trail. However, it is possible for an answer to be faithful (all info is in the context) but have low citation recall if the model fails to explicitly link claims to sources.
Operationalization & Evaluation
Measuring Source Citation Recall requires structured evaluation, often involving:
- Atomic Fact Decomposition: Breaking the generated answer into individual, verifiable propositions (e.g., "The project launched in 2023," "It used a transformer architecture").
- Source Provenance Mapping: For each atomic fact, determining if it exists in the retrieved source documents and identifying the specific passage.
- Automated Methods: Using Named Entity Recognition (NER) and Semantic Similarity (e.g., Sentence-BERT) to align answer claims with source sentences. Frameworks like RAGAS and TruLens implement automated approximations.
- Human-in-the-Loop: For high-stakes applications, human annotators perform the final mapping to establish a ground-truth benchmark for automated scorers.
Impact on Hallucination Mitigation
High Source Citation Recall is a primary engineering defense against model hallucination in RAG systems. It enforces a discipline of provenance.
- Attribution Pressure: Designing the system to require citations for key claims forces the generator to ground its output.
- Debugging Aid: When a hallucination occurs, low citation recall for that fact immediately flags it as unsupported, speeding up root-cause analysis (e.g., was the fact missing from retrieval, or did the generator ignore the source?).
- User Trust: Answers with clear, comprehensive citations allow users to verify information, increasing confidence in the system's outputs, especially in domains like legal, medical, and financial analysis.
Common Failure Modes & Challenges
Achieving high Source Citation Recall presents specific engineering challenges:
- Fact Fusion: The generator synthesizes information from multiple sources into a single sentence. Should this be one citation or several? Recall scoring must handle multi-source facts.
- Paraphrasing & Summarization: The answer paraphrases a source. Automated metrics must recognize semantic equivalence beyond lexical overlap.
- Implicit Knowledge: The generator adds universally true or common-sense facts not explicitly in the sources (e.g., "Paris is the capital of France"). Evaluators must decide if this requires a citation.
- Noise in Retrieval: If retrieved documents contain conflicting or subtly incorrect data, the generator may correctly cite a source that is itself wrong. This highlights the need for Source Citation Precision alongside recall.
Source Citation Recall vs. Related Metrics
A comparison of Source Citation Recall with other key metrics used to evaluate the attribution, factual grounding, and overall quality of Retrieval-Augmented Generation (RAG) system outputs.
| Metric | Source Citation Recall | Source Citation Precision | Answer Faithfulness | Grounding Score |
|---|---|---|---|---|
Primary Focus | Completeness of attribution for all facts in the answer | Accuracy of the citations that are provided | Factual consistency of the answer with the source context | Overall strength of evidential support for the answer |
Core Question | What proportion of the answer's factual statements are cited? | What proportion of the provided citations are correct? | Does the answer contain any unsupported or contradictory claims? | How well is the answer substantiated by the source materials? |
Measurement Target | Statements/Facts in the generated answer | Citation tokens/pointers in the generated answer | Entire generated answer | Relationship between answer and source context |
Relationship to Hallucination | Directly measures omission of citations, a prerequisite for detecting source-less hallucinations | Measures citation errors, which can mask or create hallucinations | Directly quantifies the rate of unsupported claims (hallucinations) | A composite often incorporating faithfulness; low score indicates high hallucination risk |
Typical Calculation | (Number of answer facts with correct citation) / (Total number of answer facts) | (Number of correct citations) / (Total number of citations provided) | 1 - (Number of unsupported or contradictory claims / Total claims) | Aggregate of semantic similarity and citation-based checks between answer sentences and source passages |
Key Weakness | Does not assess citation accuracy; a fact can be cited to the wrong source | Does not penalize missing citations; answer can be uncited but precise | Does not measure relevance of the answer to the original query | Can be a black-box aggregate, making root-cause analysis difficult |
Use Case in Debugging | Identifies answers that are poorly sourced or where the model is 'making up' facts without attribution. | Identifies errors in the citation mechanism or retrieval of incorrect passages. | Flags answers that contradict or invent information relative to the provided context. | Provides a single score for overall grounding quality, useful for high-level monitoring. |
Dependency on Inputs | Requires annotated ground truth of which answer facts map to which source documents. | Requires annotated ground truth of correct source-document mappings for citations. | Can be evaluated reference-free using NLI models between answer and context. | Often uses a combination of model-based metrics (NLI, similarity) and rule-based checks. |
Frameworks and Tools for Measuring Source Citation Recall
Measuring Source Citation Recall requires specialized frameworks that automate the comparison of generated text against source documents. These tools provide the quantitative rigor needed for Evaluation-Driven Development.
Custom Pipeline with Embedding Similarity
A custom evaluation pipeline can be built using sentence embeddings to compute recall. This method involves breaking the answer and source documents into claims and measuring semantic overlap.
- Steps:
- Decompose the gold-standard source documents and the generated answer into atomic factual statements.
- Generate embeddings for each statement using a model like Sentence-BERT.
- For each source statement, compute if there is a semantically similar (cosine similarity above a threshold) statement in the answer.
- Metric: Source Citation Recall = (Matched Source Statements) / (Total Source Statements).
- Advantage: Provides a deterministic, model-agnostic score less prone to judge LLM variability.
Frequently Asked Questions
Source Citation Recall is a critical metric for evaluating the attribution integrity of Retrieval-Augmented Generation (RAG) systems. These questions address its definition, calculation, and role in ensuring verifiable AI outputs.
Source Citation Recall is a metric that measures the proportion of source statements or facts used in a generated answer that are correctly attributed to their originating documents. It quantifies an AI system's ability to provide traceable provenance for its claims, which is foundational for trust and auditability in enterprise RAG applications.
Unlike Answer Faithfulness, which checks if an answer is consistent with the source context, Source Citation Recall specifically evaluates the attribution mechanism itself. A high score indicates that the system's citations are comprehensive and accurately map generated content back to the specific passages that support it, minimizing unattributed synthesis.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Source Citation Recall is one component of a comprehensive evaluation suite for Retrieval-Augmented Generation systems. These related metrics measure different facets of retrieval quality, answer fidelity, and overall system performance.
Source Citation Precision
The counterpart to recall, Source Citation Precision measures the accuracy of the citations themselves. It calculates the proportion of citations in a generated answer that correctly and accurately reference the source of the stated information.
- High precision, low recall: The answer cites few sources, but those citations are correct. Information may be missing citations.
- Balanced with recall: A system with high scores in both metrics produces answers where most information is cited (high recall) and those citations are accurate (high precision).
Answer Faithfulness
Answer Faithfulness (or Factual Consistency) measures the extent to which a generated answer is factually consistent with and supported by the provided source context. It is a prerequisite for accurate citation.
- Core Question: Does the answer contain any statements that cannot be inferred from the provided sources?
- Relationship to Citation: A faithful answer can be fully supported by citations, making high citation recall possible. An unfaithful answer contains hallucinations, which by definition cannot be cited.
Grounding Score
A Grounding Score evaluates the degree to which a model's generated output is substantiated by specific, attributable information from its provided source materials. It is a broader concept that often encompasses both faithfulness and citation quality.
- Implementation: Can be measured by cross-referencing generated claims with source text, often using Natural Language Inference (NLI) models or entailment classifiers.
- Key Difference: While Citation Recall/Precision check for explicit citation markers, grounding scores evaluate the implicit factual support, which is crucial for evaluating answers where citations are not formatted.
Context Relevance
Context Relevance assesses the degree to which the text passages retrieved and provided to the LLM are pertinent and useful for answering the specific query. It is an upstream metric that directly impacts citation quality.
- Cause and Effect: If retrieved context is irrelevant (low context relevance), the LLM cannot generate a well-supported answer, leading to low citation recall and potential hallucinations.
- Measurement: Often scored by having an LLM judge, on a scale, whether a retrieved passage contains necessary information to answer the query.
Retrieval Recall
In the retrieval stage, Retrieval Recall measures the proportion of all relevant documents in a corpus that are successfully retrieved for a given query. This is a foundational metric for the entire RAG pipeline.
- Direct Impact: Low retrieval recall means relevant source documents never reach the LLM, placing a hard ceiling on the potential Source Citation Recall. The LLM cannot cite what it never sees.
- Optimization Target: Improving retrieval recall (via better embedding models, hybrid search, or chunking strategies) is often the first step to improving end-to-end citation performance.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us