Glossary

BERTScore

BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT.

Get in touch Learn more

Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.

RAG EVALUATION METRICS

What is BERTScore?

BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT.

BERTScore is an automatic evaluation metric for text generation that computes a similarity score between a candidate text and one or more reference texts using contextual embeddings from a pre-trained model like BERT. Unlike traditional metrics such as BLEU or ROUGE that rely on exact lexical overlap, BERTScore leverages the semantic understanding encoded in transformer model embeddings to assess meaning-based similarity. It calculates precision, recall, and an F1 score by matching each token in the candidate to the most semantically similar token in the reference, using cosine similarity on their contextual embeddings.

For Retrieval-Augmented Generation (RAG) evaluation, BERTScore is often used to measure answer faithfulness and answer relevance by comparing a model's generated output to a ground truth answer or to the provided source context. Its reliance on semantic similarity makes it robust to paraphrasing and varied wording, providing a more nuanced assessment than n-gram matching. However, it is computationally more expensive and can be influenced by the choice of the underlying embedding model, requiring careful benchmark selection for consistent model benchmarking.

CONTEXTUAL EVALUATION

Key Features of BERTScore

BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT. Unlike traditional metrics, it captures semantic similarity beyond surface-level token overlap.

Contextual Embedding Similarity

BERTScore leverages contextual embeddings from pre-trained transformer models like BERT, RoBERTa, or XLNet. Instead of matching exact words, it computes the cosine similarity between the embedding vectors of each token in the candidate text and each token in the reference text. This allows it to match synonyms and paraphrases (e.g., 'car' and 'automobile') that n-gram metrics like BLEU or ROUGE would miss, providing a more nuanced measure of semantic equivalence.

Precision, Recall, and F1 Calculation

The metric decomposes into three core components:

BERTScore Precision: Measures how much of the candidate's content is reflected in the reference. It is computed by matching each token in the candidate to the most similar token in the reference.
BERTScore Recall: Measures how much of the reference's content is captured by the candidate. It is computed by matching each token in the reference to the most similar token in the candidate.
BERTScore F1: The harmonic mean of precision and recall, providing a single, balanced score. This triad allows developers to diagnose whether a generated text is overly verbose (low precision) or incomplete (low recall).

Importance Weighting (IDF)

BERTScore can apply inverse document frequency (IDF) weighting to the token similarity scores. This emphasizes the importance of rare, content-bearing words (e.g., 'eclipse', 'quantum') over common function words (e.g., 'the', 'is'). The IDF statistics are typically calculated from a large corpus like Wikipedia. When enabled, this feature ensures the final score is more sensitive to the accurate generation of key factual terms, which is critical for evaluating technical summaries or answer faithfulness in RAG systems.

Model and Layer Selection

The score is not monolithic; its behavior depends on the chosen underlying model and the specific transformer layer from which embeddings are extracted.

Model Choice: Using roberta-large typically yields more robust results than bert-base-uncased due to its training methodology and larger size.
Layer Choice: Similarities computed from middle layers (e.g., layer 8-12 in a 24-layer model) often align better with human judgment than the final output layer, as they capture richer contextual information. This configurability allows engineers to tailor the metric to their specific domain.

Human Correlation and Robustness

BERTScore is explicitly designed to have higher correlation with human judgments than n-gram metrics. Studies on machine translation and summarization benchmarks show it consistently outperforms BLEU and ROUGE in aligning with human ratings of fluency and adequacy. It is also more robust to synonyms and paraphrasing, reducing the penalty for valid linguistic variation. However, it is not perfect; it can be fooled by antonyms with similar contextual embeddings or may not fully capture high-level discourse structure.

Practical Computation and Baseline Rescaling

In practice, raw BERTScore values are often rescaled using common baselines to improve interpretability. The official implementation computes scores for a set of simple baselines (like the reference itself or trivial n-gram copies) and then applies a linear rescaling so that these baselines achieve expected scores (e.g., a copy gets a score near 1.0). This mitigates the issue of scores being model-dependent and clustered in a narrow range. Computationally, it requires running inference through a transformer model for both candidate and reference texts, which is more expensive than n-gram counting but is easily batched for evaluation sets.

AUTOMATED TEXT EVALUATION

BERTScore vs. Traditional Metrics

A comparison of BERTScore's contextual embedding-based approach against traditional n-gram and token-matching metrics for evaluating generated text in tasks like summarization, translation, and RAG.

Metric / Feature	BERTScore	ROUGE / BLEU	Exact Match / F1
Underlying Mechanism	Contextual embeddings from models like BERT	N-gram (word sequence) overlap	Exact string or token set matching
Semantic Understanding
Handles Synonyms & Paraphrasing
Sensitivity to Word Order	Moderate (via attention)	High (exact sequence match)	None (F1) / Absolute (EM)
Reference Requirements	Single or multiple references	Typically multiple references	Single reference common
Output Granularity	Precision, Recall, F1 (token-level similarity)	Precision, Recall, F1 (n-gram counts)	Single score (EM) or Precision/Recall/F1
Common Use Cases	Text generation, summarization, RAG answer evaluation	Summarization (ROUGE), MT (BLEU)	Question Answering, Classification
Computational Cost	High (requires forward pass of BERT model)	Low (string operations)	Very low (string/token comparison)

APPLICATION DOMAINS

Where BERTScore is Used

BERTScore is a versatile metric for evaluating text generation quality. Its primary applications span domains where semantic similarity is more critical than exact word matching.

Machine Translation Evaluation

BERTScore is a robust alternative to BLEU and ROUGE for evaluating machine translation outputs. It excels where translations are semantically correct but use different synonyms or sentence structures than the reference. It correlates better with human judgment, especially for languages with rich morphology or flexible word order, by using contextual embeddings to capture meaning beyond n-gram overlap.

Text Summarization Assessment

In automatic text summarization, BERTScore measures how well a generated summary captures the key semantic content of the source document or reference summaries. It is less sensitive to paraphrasing than ROUGE, making it suitable for evaluating abstractive summarization models that rephrase content. It helps assess factual consistency and informativeness by comparing the semantic gist of the summary to the source.

Dialogue Response Generation

For chatbots and conversational AI, BERTScore evaluates the appropriateness and relevance of generated responses against human references. It is used to benchmark models in tasks like the ConvAI2 challenge or for evaluating retrieval-augmented generation (RAG) outputs in customer service agents. The metric's ability to handle diverse, contextually appropriate paraphrases is critical in open-domain dialogue.

Data-to-Text and Code Generation

BERTScore is applied to evaluate systems that generate descriptive text from structured data (e.g., weather reports from tables) or natural language generation (NLG) from knowledge graphs. It is also used in code generation to assess the functional similarity between generated and reference code snippets by embedding code as text, though specialized metrics like CodeBLEU may be more precise for syntactic correctness.

RAG Pipeline Evaluation

Within Retrieval-Augmented Generation (RAG) systems, BERTScore is a component for measuring answer faithfulness and answer relevance. It can compute the similarity between a generated answer and the retrieved source context to gauge grounding, or between the answer and a reference golden answer. Frameworks like RAGAS may use BERTScore-derived measures as part of a holistic evaluation suite.

Model Fine-Tuning and Hyperparameter Search

During model development, BERTScore serves as an automatic evaluation metric for validation sets, guiding hyperparameter tuning and checkpoint selection. It provides a faster, automated proxy for human evaluation in iterative training cycles for text generation models like T5 or GPT-style architectures. Its high correlation with human judgment makes it a cost-effective quality signal during experimentation.

BERTSCORE

Frequently Asked Questions

BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT.

BERTScore is an automatic evaluation metric for text generation that computes a similarity score between a candidate (generated) text and one or more reference texts using contextual embeddings from a pre-trained model like BERT. It works by:

Generating Embeddings: Feeding the candidate and reference sentences through a model like BERT to obtain contextual token embeddings.
Computing Pairwise Similarity: Calculating the cosine similarity between each token embedding in the candidate and each token embedding in the reference.
Greedy Matching: For each token in the candidate, finding the most similar token in the reference (and vice-versa) using a greedy matching algorithm based on the similarity matrix.
Averaging Scores: Computing precision (how many candidate tokens are matched to reference tokens), recall (how many reference tokens are matched to candidate tokens), and the F1 score (their harmonic mean), which is the final BERTScore. This process captures semantic similarity far better than n-gram overlap metrics like BLEU or ROUGE.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

RAG EVALUATION METRICS

Related Terms

BERTScore is one of several metrics used to evaluate the quality of text generation. These related terms define other critical measures for assessing retrieval and generation components in RAG systems.

ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization and machine translation. It operates by comparing overlapping units like n-grams, word sequences, and word pairs between a candidate text and one or more reference texts.

ROUGE-N: Measures n-gram overlap (e.g., ROUGE-1 for unigrams, ROUGE-2 for bigrams).
ROUGE-L: Uses the longest common subsequence to assess fluency and word order.
Key Difference from BERTScore: ROUGE is a surface-level lexical overlap metric, while BERTScore uses contextual embeddings to capture semantic similarity.

BLEU

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of machine-translated text. It calculates a modified n-gram precision score by comparing a candidate translation against one or more human reference translations.

Core Mechanism: Counts matching n-grams, but applies a brevity penalty to penalize overly short outputs.
Primary Use Case: The long-standing standard for machine translation benchmarking.
Key Difference from BERTScore: BLEU is based on exact token matching and is known to correlate poorly with human judgment for tasks requiring semantic understanding, unlike BERTScore's embedding-based approach.

Semantic Similarity

Semantic Similarity is a general concept quantifying the likeness in meaning between two texts. BERTScore is a specific, popular implementation of this concept.

Implementation Models: Often calculated using embeddings from sentence-transformers like Sentence-BERT (SBERT) or other pre-trained models.
Mechanism: Encodes sentences into dense vector embeddings and computes a similarity measure (e.g., cosine similarity) between them.
Relation to BERTScore: BERTScore refines this by computing a precision, recall, and F1 match using pairwise token embeddings from BERT, rather than a single sentence-level vector.

Answer Faithfulness

Answer Faithfulness is a metric that evaluates whether a generated answer is factually consistent with and entirely supported by the provided source context. It is critical for Retrieval-Augmented Generation (RAG) evaluation.

Core Question: "Does the answer contain any statements not substantiated by the source?"
Contrast with BERTScore: BERTScore measures semantic similarity to a reference. Answer Faithfulness measures factual grounding to a source, regardless of a reference answer. A text can be semantically similar (high BERTScore) but unfaithful if it introduces unsourced facts.

RAGAS

RAGAS (Retrieval-Augmented Generation Assessment) is an open-source framework for reference-free evaluation of RAG pipelines. It provides metrics that do not require human-written ground truth answers.

Key Metrics it Measures: Faithfulness, Answer Relevance, Context Precision, and Context Recall.
Relation to BERTScore: RAGAS offers a broader suite for holistic RAG evaluation. BERTScore could be used within a RAGAS-like framework to compare a generated answer to a reference if one is available, but RAGAS focuses on reference-less, aspect-based metrics.

EXPLORE

METEOR

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is an automatic metric for machine translation evaluation that addresses known weaknesses in BLEU.

Core Improvements: It incorporates synonymy (via WordNet) and stemming to match words beyond exact forms, and includes a penalty for poor word order.
Alignment: Uses a harmonic mean of unigram precision and recall, with a chunk-based fragmentation penalty.
Comparison to BERTScore: Like BERTScore, METEOR moves beyond exact n-gram matching. However, METEOR uses linguistic modules (synonym lists, stems), while BERTScore uses contextual embeddings from a deep neural network to understand semantic similarity.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

BERTScore

What is BERTScore?

Key Features of BERTScore

Contextual Embedding Similarity

Precision, Recall, and F1 Calculation

Importance Weighting (IDF)

Model and Layer Selection

Human Correlation and Robustness

Practical Computation and Baseline Rescaling

BERTScore vs. Traditional Metrics

Where BERTScore is Used

Machine Translation Evaluation

Text Summarization Assessment

Dialogue Response Generation

Data-to-Text and Code Generation

RAG Pipeline Evaluation

Model Fine-Tuning and Hyperparameter Search

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

RAGAS

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there