Inferensys

Glossary

BERTScore

BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT.
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
RAG EVALUATION METRICS

What is BERTScore?

BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT.

BERTScore is an automatic evaluation metric for text generation that computes a similarity score between a candidate text and one or more reference texts using contextual embeddings from a pre-trained model like BERT. Unlike traditional metrics such as BLEU or ROUGE that rely on exact lexical overlap, BERTScore leverages the semantic understanding encoded in transformer model embeddings to assess meaning-based similarity. It calculates precision, recall, and an F1 score by matching each token in the candidate to the most semantically similar token in the reference, using cosine similarity on their contextual embeddings.

For Retrieval-Augmented Generation (RAG) evaluation, BERTScore is often used to measure answer faithfulness and answer relevance by comparing a model's generated output to a ground truth answer or to the provided source context. Its reliance on semantic similarity makes it robust to paraphrasing and varied wording, providing a more nuanced assessment than n-gram matching. However, it is computationally more expensive and can be influenced by the choice of the underlying embedding model, requiring careful benchmark selection for consistent model benchmarking.

CONTEXTUAL EVALUATION

Key Features of BERTScore

BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT. Unlike traditional metrics, it captures semantic similarity beyond surface-level token overlap.

01

Contextual Embedding Similarity

BERTScore leverages contextual embeddings from pre-trained transformer models like BERT, RoBERTa, or XLNet. Instead of matching exact words, it computes the cosine similarity between the embedding vectors of each token in the candidate text and each token in the reference text. This allows it to match synonyms and paraphrases (e.g., 'car' and 'automobile') that n-gram metrics like BLEU or ROUGE would miss, providing a more nuanced measure of semantic equivalence.

02

Precision, Recall, and F1 Calculation

The metric decomposes into three core components:

  • BERTScore Precision: Measures how much of the candidate's content is reflected in the reference. It is computed by matching each token in the candidate to the most similar token in the reference.
  • BERTScore Recall: Measures how much of the reference's content is captured by the candidate. It is computed by matching each token in the reference to the most similar token in the candidate.
  • BERTScore F1: The harmonic mean of precision and recall, providing a single, balanced score. This triad allows developers to diagnose whether a generated text is overly verbose (low precision) or incomplete (low recall).
03

Importance Weighting (IDF)

BERTScore can apply inverse document frequency (IDF) weighting to the token similarity scores. This emphasizes the importance of rare, content-bearing words (e.g., 'eclipse', 'quantum') over common function words (e.g., 'the', 'is'). The IDF statistics are typically calculated from a large corpus like Wikipedia. When enabled, this feature ensures the final score is more sensitive to the accurate generation of key factual terms, which is critical for evaluating technical summaries or answer faithfulness in RAG systems.

04

Model and Layer Selection

The score is not monolithic; its behavior depends on the chosen underlying model and the specific transformer layer from which embeddings are extracted.

  • Model Choice: Using roberta-large typically yields more robust results than bert-base-uncased due to its training methodology and larger size.
  • Layer Choice: Similarities computed from middle layers (e.g., layer 8-12 in a 24-layer model) often align better with human judgment than the final output layer, as they capture richer contextual information. This configurability allows engineers to tailor the metric to their specific domain.
05

Human Correlation and Robustness

BERTScore is explicitly designed to have higher correlation with human judgments than n-gram metrics. Studies on machine translation and summarization benchmarks show it consistently outperforms BLEU and ROUGE in aligning with human ratings of fluency and adequacy. It is also more robust to synonyms and paraphrasing, reducing the penalty for valid linguistic variation. However, it is not perfect; it can be fooled by antonyms with similar contextual embeddings or may not fully capture high-level discourse structure.

06

Practical Computation and Baseline Rescaling

In practice, raw BERTScore values are often rescaled using common baselines to improve interpretability. The official implementation computes scores for a set of simple baselines (like the reference itself or trivial n-gram copies) and then applies a linear rescaling so that these baselines achieve expected scores (e.g., a copy gets a score near 1.0). This mitigates the issue of scores being model-dependent and clustered in a narrow range. Computationally, it requires running inference through a transformer model for both candidate and reference texts, which is more expensive than n-gram counting but is easily batched for evaluation sets.

AUTOMATED TEXT EVALUATION

BERTScore vs. Traditional Metrics

A comparison of BERTScore's contextual embedding-based approach against traditional n-gram and token-matching metrics for evaluating generated text in tasks like summarization, translation, and RAG.

Metric / FeatureBERTScoreROUGE / BLEUExact Match / F1

Underlying Mechanism

Contextual embeddings from models like BERT

N-gram (word sequence) overlap

Exact string or token set matching

Semantic Understanding

Handles Synonyms & Paraphrasing

Sensitivity to Word Order

Moderate (via attention)

High (exact sequence match)

None (F1) / Absolute (EM)

Reference Requirements

Single or multiple references

Typically multiple references

Single reference common

Output Granularity

Precision, Recall, F1 (token-level similarity)

Precision, Recall, F1 (n-gram counts)

Single score (EM) or Precision/Recall/F1

Common Use Cases

Text generation, summarization, RAG answer evaluation

Summarization (ROUGE), MT (BLEU)

Question Answering, Classification

Computational Cost

High (requires forward pass of BERT model)

Low (string operations)

Very low (string/token comparison)

APPLICATION DOMAINS

Where BERTScore is Used

BERTScore is a versatile metric for evaluating text generation quality. Its primary applications span domains where semantic similarity is more critical than exact word matching.

01

Machine Translation Evaluation

BERTScore is a robust alternative to BLEU and ROUGE for evaluating machine translation outputs. It excels where translations are semantically correct but use different synonyms or sentence structures than the reference. It correlates better with human judgment, especially for languages with rich morphology or flexible word order, by using contextual embeddings to capture meaning beyond n-gram overlap.

02

Text Summarization Assessment

In automatic text summarization, BERTScore measures how well a generated summary captures the key semantic content of the source document or reference summaries. It is less sensitive to paraphrasing than ROUGE, making it suitable for evaluating abstractive summarization models that rephrase content. It helps assess factual consistency and informativeness by comparing the semantic gist of the summary to the source.

03

Dialogue Response Generation

For chatbots and conversational AI, BERTScore evaluates the appropriateness and relevance of generated responses against human references. It is used to benchmark models in tasks like the ConvAI2 challenge or for evaluating retrieval-augmented generation (RAG) outputs in customer service agents. The metric's ability to handle diverse, contextually appropriate paraphrases is critical in open-domain dialogue.

04

Data-to-Text and Code Generation

BERTScore is applied to evaluate systems that generate descriptive text from structured data (e.g., weather reports from tables) or natural language generation (NLG) from knowledge graphs. It is also used in code generation to assess the functional similarity between generated and reference code snippets by embedding code as text, though specialized metrics like CodeBLEU may be more precise for syntactic correctness.

05

RAG Pipeline Evaluation

Within Retrieval-Augmented Generation (RAG) systems, BERTScore is a component for measuring answer faithfulness and answer relevance. It can compute the similarity between a generated answer and the retrieved source context to gauge grounding, or between the answer and a reference golden answer. Frameworks like RAGAS may use BERTScore-derived measures as part of a holistic evaluation suite.

06

Model Fine-Tuning and Hyperparameter Search

During model development, BERTScore serves as an automatic evaluation metric for validation sets, guiding hyperparameter tuning and checkpoint selection. It provides a faster, automated proxy for human evaluation in iterative training cycles for text generation models like T5 or GPT-style architectures. Its high correlation with human judgment makes it a cost-effective quality signal during experimentation.

BERTSCORE

Frequently Asked Questions

BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT.

BERTScore is an automatic evaluation metric for text generation that computes a similarity score between a candidate (generated) text and one or more reference texts using contextual embeddings from a pre-trained model like BERT. It works by:

  1. Generating Embeddings: Feeding the candidate and reference sentences through a model like BERT to obtain contextual token embeddings.
  2. Computing Pairwise Similarity: Calculating the cosine similarity between each token embedding in the candidate and each token embedding in the reference.
  3. Greedy Matching: For each token in the candidate, finding the most similar token in the reference (and vice-versa) using a greedy matching algorithm based on the similarity matrix.
  4. Averaging Scores: Computing precision (how many candidate tokens are matched to reference tokens), recall (how many reference tokens are matched to candidate tokens), and the F1 score (their harmonic mean), which is the final BERTScore. This process captures semantic similarity far better than n-gram overlap metrics like BLEU or ROUGE.
Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.