Glossary

Source Citation Recall

Source Citation Recall is an evaluation metric that measures the proportion of factual statements in a generated answer that are correctly attributed to their originating source documents.

Get in touch Learn more

Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.

RAG EVALUATION METRIC

What is Source Citation Recall?

Source Citation Recall is a critical metric for evaluating the attribution integrity of Retrieval-Augmented Generation (RAG) systems.

Source Citation Recall is a retrieval-augmented generation (RAG) evaluation metric that measures the proportion of factual statements or claims in a generated answer that are correctly attributed to their originating source documents. It is a recall-oriented measure focused on ensuring all information presented as fact is traceable, directly addressing the risk of unattributed hallucinations. A high score indicates the model successfully cites sources for most of its claims, a key requirement for verifiable AI in enterprise applications.

This metric is calculated by comparing the atomic facts in a generated answer against a ground truth mapping of those facts to source passages. It complements Source Citation Precision, which measures if citations are accurate. Together, they form a complete picture of a system's attribution quality. High Source Citation Recall is essential for building trust in domains like legal analysis, healthcare, and finance, where auditability and factual grounding are non-negotiable.

RAG EVALUATION METRICS

Key Characteristics of Source Citation Recall

Source Citation Recall is a critical metric for assessing the attribution integrity of Retrieval-Augmented Generation (RAG) systems. It quantifies how thoroughly a generated answer credits the source documents that contain its factual claims.

Core Definition & Formula

Source Citation Recall measures the proportion of source statements or facts used in a generated answer that are correctly attributed to their originating documents. It is calculated as:

(Number of Correctly Attributed Facts) / (Total Number of Verifiable Facts in the Answer)

A fact is a discrete, atomic piece of information derived from the source context.
Attribution is typically a direct citation (e.g., [Doc1]) or an indirect reference that can be unambiguously traced to a specific source passage.
A score of 1.0 indicates perfect attribution; a score of 0.0 indicates no facts are correctly cited, even if the answer is factually correct.

Contrast with Source Citation Precision

While Source Citation Recall measures attribution completeness, Source Citation Precision measures attribution accuracy. They are complementary metrics that form a complete picture of citation quality.

Recall (Completeness): "Of all the facts in the answer, how many were cited?" High recall minimizes unattributed claims.
Precision (Accuracy): "Of all the citations provided, how many were correct?" High precision minimizes incorrect or hallucinated citations.

An ideal RAG system optimizes for both. A system can have high recall but low precision (cites everything, but many citations are wrong) or high precision but low recall (citations are accurate but many facts are left uncited).

Relationship to Answer Faithfulness

Source Citation Recall is a stricter, more granular subset of the broader Answer Faithfulness metric.

Answer Faithfulness asks: "Is the entire answer supported by the provided context?" It's a holistic, often binary or graded judgment.
Source Citation Recall asks: "For each specific factual claim within the answer, can you point to its exact source?" It requires traceability.

A high Source Citation Recall score is strong evidence for high Answer Faithfulness, as it provides an audit trail. However, it is possible for an answer to be faithful (all info is in the context) but have low citation recall if the model fails to explicitly link claims to sources.

Operationalization & Evaluation

Measuring Source Citation Recall requires structured evaluation, often involving:

Atomic Fact Decomposition: Breaking the generated answer into individual, verifiable propositions (e.g., "The project launched in 2023," "It used a transformer architecture").
Source Provenance Mapping: For each atomic fact, determining if it exists in the retrieved source documents and identifying the specific passage.
Automated Methods: Using Named Entity Recognition (NER) and Semantic Similarity (e.g., Sentence-BERT) to align answer claims with source sentences. Frameworks like RAGAS and TruLens implement automated approximations.
Human-in-the-Loop: For high-stakes applications, human annotators perform the final mapping to establish a ground-truth benchmark for automated scorers.

Impact on Hallucination Mitigation

High Source Citation Recall is a primary engineering defense against model hallucination in RAG systems. It enforces a discipline of provenance.

Attribution Pressure: Designing the system to require citations for key claims forces the generator to ground its output.
Debugging Aid: When a hallucination occurs, low citation recall for that fact immediately flags it as unsupported, speeding up root-cause analysis (e.g., was the fact missing from retrieval, or did the generator ignore the source?).
User Trust: Answers with clear, comprehensive citations allow users to verify information, increasing confidence in the system's outputs, especially in domains like legal, medical, and financial analysis.

Common Failure Modes & Challenges

Achieving high Source Citation Recall presents specific engineering challenges:

Fact Fusion: The generator synthesizes information from multiple sources into a single sentence. Should this be one citation or several? Recall scoring must handle multi-source facts.
Paraphrasing & Summarization: The answer paraphrases a source. Automated metrics must recognize semantic equivalence beyond lexical overlap.
Implicit Knowledge: The generator adds universally true or common-sense facts not explicitly in the sources (e.g., "Paris is the capital of France"). Evaluators must decide if this requires a citation.
Noise in Retrieval: If retrieved documents contain conflicting or subtly incorrect data, the generator may correctly cite a source that is itself wrong. This highlights the need for Source Citation Precision alongside recall.

RAG EVALUATION METRICS COMPARISON

Source Citation Recall vs. Related Metrics

A comparison of Source Citation Recall with other key metrics used to evaluate the attribution, factual grounding, and overall quality of Retrieval-Augmented Generation (RAG) system outputs.

Metric	Source Citation Recall	Source Citation Precision	Answer Faithfulness	Grounding Score
Primary Focus	Completeness of attribution for all facts in the answer	Accuracy of the citations that are provided	Factual consistency of the answer with the source context	Overall strength of evidential support for the answer
Core Question	What proportion of the answer's factual statements are cited?	What proportion of the provided citations are correct?	Does the answer contain any unsupported or contradictory claims?	How well is the answer substantiated by the source materials?
Measurement Target	Statements/Facts in the generated answer	Citation tokens/pointers in the generated answer	Entire generated answer	Relationship between answer and source context
Relationship to Hallucination	Directly measures omission of citations, a prerequisite for detecting source-less hallucinations	Measures citation errors, which can mask or create hallucinations	Directly quantifies the rate of unsupported claims (hallucinations)	A composite often incorporating faithfulness; low score indicates high hallucination risk
Typical Calculation	(Number of answer facts with correct citation) / (Total number of answer facts)	(Number of correct citations) / (Total number of citations provided)	1 - (Number of unsupported or contradictory claims / Total claims)	Aggregate of semantic similarity and citation-based checks between answer sentences and source passages
Key Weakness	Does not assess citation accuracy; a fact can be cited to the wrong source	Does not penalize missing citations; answer can be uncited but precise	Does not measure relevance of the answer to the original query	Can be a black-box aggregate, making root-cause analysis difficult
Use Case in Debugging	Identifies answers that are poorly sourced or where the model is 'making up' facts without attribution.	Identifies errors in the citation mechanism or retrieval of incorrect passages.	Flags answers that contradict or invent information relative to the provided context.	Provides a single score for overall grounding quality, useful for high-level monitoring.
Dependency on Inputs	Requires annotated ground truth of which answer facts map to which source documents.	Requires annotated ground truth of correct source-document mappings for citations.	Can be evaluated reference-free using NLI models between answer and context.	Often uses a combination of model-based metrics (NLI, similarity) and rule-based checks.

RAG EVALUATION METRICS

Frameworks and Tools for Measuring Source Citation Recall

Measuring Source Citation Recall requires specialized frameworks that automate the comparison of generated text against source documents. These tools provide the quantitative rigor needed for Evaluation-Driven Development.

RAGAS (Retrieval-Augmented Generation Assessment)

RAGAS is an open-source, reference-free framework for evaluating RAG pipelines. It calculates answer faithfulness and context recall, which directly measure Source Citation Recall by assessing if all ground truth statements from source documents are present in the generated answer.

Key Metric: context_recall measures the recall of the retrieved context against an annotated answer.
Methodology: Uses LLMs as judges to compare answer statements to source content.
Advantage: Does not require human-written reference answers, enabling scalable evaluation.

EXPLORE

TruLens

TruLens is an observability and evaluation library that provides a suite of LLM-based feedback functions. Its Groundness or Answer Groundedness metric evaluates if an answer is supported by its provided context, which correlates with Source Citation Recall.

Implementation: Uses a separate LLM to judge the attribution of each claim in the answer to specific source passages.
Framework: Integrates with LangChain and LlamaIndex for tracing and evaluation.
Output: Provides a fine-grained score and highlights which statements are well-grounded or ungrounded.

EXPLORE

ARES (Automatic RAG Evaluation System)

ARES is a framework that uses synthetic data and LLM judges to evaluate RAG components with minimal human labeling. It assesses retrieval quality and answer quality, where the latter includes faithfulness metrics akin to Source Citation Recall.

Process: Generates synthetic Q&A pairs from a corpus, then uses LLMs to perturb questions and answers to create test suites.
Judges: Employs fine-tuned judge models (like T5) for cost-effective, high-throughput scoring.
Benefit: Reduces the need for expensive human-annotated test sets for reliable evaluation.

EXPLORE

LlamaIndex Evaluation Modules

LlamaIndex provides built-in evaluation modules for faithfulness and relevancy. The FaithfulnessEvaluator directly measures Source Citation Recall by checking if the generated answer is a subset of the retrieved source nodes.

Mechanism: The evaluator prompts an LLM to extract statements from the answer and verify their presence in source contexts.
Integration: Seamlessly works with LlamaIndex query engines and pipelines.
Result: Returns a binary score per statement and an aggregate faithfulness score.

EXPLORE

LangChain Evaluation

LangChain's langchain.evaluation package includes criteria like groundedness for evaluating chain outputs. Using CriteriaEvalChain, you can define custom checks for citation accuracy.

String Evaluators: The LabeledCriteriaEvalChain can be configured with criteria such as "Is the response grounded in the provided context?"
QA Evaluation: Paired with QAEvalChain, it can assess answer correctness against ground truth, which implies verifying citations.
Flexibility: Allows integration of multiple LLM judges (GPT-4, Claude) for scoring.

EXPLORE

Custom Pipeline with Embedding Similarity

A custom evaluation pipeline can be built using sentence embeddings to compute recall. This method involves breaking the answer and source documents into claims and measuring semantic overlap.

Steps:
1. Decompose the gold-standard source documents and the generated answer into atomic factual statements.
2. Generate embeddings for each statement using a model like Sentence-BERT.
3. For each source statement, compute if there is a semantically similar (cosine similarity above a threshold) statement in the answer.
Metric: Source Citation Recall = (Matched Source Statements) / (Total Source Statements).
Advantage: Provides a deterministic, model-agnostic score less prone to judge LLM variability.

SOURCE CITATION RECALL

Frequently Asked Questions

Source Citation Recall is a critical metric for evaluating the attribution integrity of Retrieval-Augmented Generation (RAG) systems. These questions address its definition, calculation, and role in ensuring verifiable AI outputs.

Source Citation Recall is a metric that measures the proportion of source statements or facts used in a generated answer that are correctly attributed to their originating documents. It quantifies an AI system's ability to provide traceable provenance for its claims, which is foundational for trust and auditability in enterprise RAG applications.

Unlike Answer Faithfulness, which checks if an answer is consistent with the source context, Source Citation Recall specifically evaluates the attribution mechanism itself. A high score indicates that the system's citations are comprehensive and accurately map generated content back to the specific passages that support it, minimizing unattributed synthesis.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

RAG EVALUATION METRICS

Related Terms

Source Citation Recall is one component of a comprehensive evaluation suite for Retrieval-Augmented Generation systems. These related metrics measure different facets of retrieval quality, answer fidelity, and overall system performance.

Source Citation Precision

The counterpart to recall, Source Citation Precision measures the accuracy of the citations themselves. It calculates the proportion of citations in a generated answer that correctly and accurately reference the source of the stated information.

High precision, low recall: The answer cites few sources, but those citations are correct. Information may be missing citations.
Balanced with recall: A system with high scores in both metrics produces answers where most information is cited (high recall) and those citations are accurate (high precision).

Answer Faithfulness

Answer Faithfulness (or Factual Consistency) measures the extent to which a generated answer is factually consistent with and supported by the provided source context. It is a prerequisite for accurate citation.

Core Question: Does the answer contain any statements that cannot be inferred from the provided sources?
Relationship to Citation: A faithful answer can be fully supported by citations, making high citation recall possible. An unfaithful answer contains hallucinations, which by definition cannot be cited.

Grounding Score

A Grounding Score evaluates the degree to which a model's generated output is substantiated by specific, attributable information from its provided source materials. It is a broader concept that often encompasses both faithfulness and citation quality.

Implementation: Can be measured by cross-referencing generated claims with source text, often using Natural Language Inference (NLI) models or entailment classifiers.
Key Difference: While Citation Recall/Precision check for explicit citation markers, grounding scores evaluate the implicit factual support, which is crucial for evaluating answers where citations are not formatted.

Context Relevance

Context Relevance assesses the degree to which the text passages retrieved and provided to the LLM are pertinent and useful for answering the specific query. It is an upstream metric that directly impacts citation quality.

Cause and Effect: If retrieved context is irrelevant (low context relevance), the LLM cannot generate a well-supported answer, leading to low citation recall and potential hallucinations.
Measurement: Often scored by having an LLM judge, on a scale, whether a retrieved passage contains necessary information to answer the query.

Retrieval Recall

In the retrieval stage, Retrieval Recall measures the proportion of all relevant documents in a corpus that are successfully retrieved for a given query. This is a foundational metric for the entire RAG pipeline.

Direct Impact: Low retrieval recall means relevant source documents never reach the LLM, placing a hard ceiling on the potential Source Citation Recall. The LLM cannot cite what it never sees.
Optimization Target: Improving retrieval recall (via better embedding models, hybrid search, or chunking strategies) is often the first step to improving end-to-end citation performance.

RAGAS Framework

RAGAS (Retrieval-Augmented Generation Assessment) is an open-source framework for reference-free evaluation of RAG pipelines. It provides automated metrics that correlate with the concepts of citation and grounding.

Faithfulness Metric: Directly measures factual alignment between answer and context.
Answer Relevance Metric: Evaluates if the answer addresses the query.
Context Precision/Recall Metrics: Assess the quality of the retrieved context itself.
Utility: Allows for quantitative, scalable evaluation of RAG systems without requiring manually written ground-truth answers, enabling rapid iteration.

EXPLORE

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Source Citation Recall

What is Source Citation Recall?

Key Characteristics of Source Citation Recall

Core Definition & Formula

Contrast with Source Citation Precision

Relationship to Answer Faithfulness

Operationalization & Evaluation

Impact on Hallucination Mitigation

Common Failure Modes & Challenges

Source Citation Recall vs. Related Metrics

Frameworks and Tools for Measuring Source Citation Recall

RAGAS (Retrieval-Augmented Generation Assessment)

TruLens

ARES (Automatic RAG Evaluation System)

LlamaIndex Evaluation Modules

LangChain Evaluation

Custom Pipeline with Embedding Similarity

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

RAGAS Framework

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there