BERTScore is an automatic evaluation metric for text generation that computes a similarity score between a candidate text and one or more reference texts using contextual embeddings from a pre-trained model like BERT. Unlike traditional metrics such as BLEU or ROUGE that rely on exact lexical overlap, BERTScore leverages the semantic understanding encoded in transformer model embeddings to assess meaning-based similarity. It calculates precision, recall, and an F1 score by matching each token in the candidate to the most semantically similar token in the reference, using cosine similarity on their contextual embeddings.
Glossary
BERTScore

What is BERTScore?
BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT.
For Retrieval-Augmented Generation (RAG) evaluation, BERTScore is often used to measure answer faithfulness and answer relevance by comparing a model's generated output to a ground truth answer or to the provided source context. Its reliance on semantic similarity makes it robust to paraphrasing and varied wording, providing a more nuanced assessment than n-gram matching. However, it is computationally more expensive and can be influenced by the choice of the underlying embedding model, requiring careful benchmark selection for consistent model benchmarking.
Key Features of BERTScore
BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT. Unlike traditional metrics, it captures semantic similarity beyond surface-level token overlap.
Contextual Embedding Similarity
BERTScore leverages contextual embeddings from pre-trained transformer models like BERT, RoBERTa, or XLNet. Instead of matching exact words, it computes the cosine similarity between the embedding vectors of each token in the candidate text and each token in the reference text. This allows it to match synonyms and paraphrases (e.g., 'car' and 'automobile') that n-gram metrics like BLEU or ROUGE would miss, providing a more nuanced measure of semantic equivalence.
Precision, Recall, and F1 Calculation
The metric decomposes into three core components:
- BERTScore Precision: Measures how much of the candidate's content is reflected in the reference. It is computed by matching each token in the candidate to the most similar token in the reference.
- BERTScore Recall: Measures how much of the reference's content is captured by the candidate. It is computed by matching each token in the reference to the most similar token in the candidate.
- BERTScore F1: The harmonic mean of precision and recall, providing a single, balanced score. This triad allows developers to diagnose whether a generated text is overly verbose (low precision) or incomplete (low recall).
Importance Weighting (IDF)
BERTScore can apply inverse document frequency (IDF) weighting to the token similarity scores. This emphasizes the importance of rare, content-bearing words (e.g., 'eclipse', 'quantum') over common function words (e.g., 'the', 'is'). The IDF statistics are typically calculated from a large corpus like Wikipedia. When enabled, this feature ensures the final score is more sensitive to the accurate generation of key factual terms, which is critical for evaluating technical summaries or answer faithfulness in RAG systems.
Model and Layer Selection
The score is not monolithic; its behavior depends on the chosen underlying model and the specific transformer layer from which embeddings are extracted.
- Model Choice: Using
roberta-largetypically yields more robust results thanbert-base-uncaseddue to its training methodology and larger size. - Layer Choice: Similarities computed from middle layers (e.g., layer 8-12 in a 24-layer model) often align better with human judgment than the final output layer, as they capture richer contextual information. This configurability allows engineers to tailor the metric to their specific domain.
Human Correlation and Robustness
BERTScore is explicitly designed to have higher correlation with human judgments than n-gram metrics. Studies on machine translation and summarization benchmarks show it consistently outperforms BLEU and ROUGE in aligning with human ratings of fluency and adequacy. It is also more robust to synonyms and paraphrasing, reducing the penalty for valid linguistic variation. However, it is not perfect; it can be fooled by antonyms with similar contextual embeddings or may not fully capture high-level discourse structure.
Practical Computation and Baseline Rescaling
In practice, raw BERTScore values are often rescaled using common baselines to improve interpretability. The official implementation computes scores for a set of simple baselines (like the reference itself or trivial n-gram copies) and then applies a linear rescaling so that these baselines achieve expected scores (e.g., a copy gets a score near 1.0). This mitigates the issue of scores being model-dependent and clustered in a narrow range. Computationally, it requires running inference through a transformer model for both candidate and reference texts, which is more expensive than n-gram counting but is easily batched for evaluation sets.
BERTScore vs. Traditional Metrics
A comparison of BERTScore's contextual embedding-based approach against traditional n-gram and token-matching metrics for evaluating generated text in tasks like summarization, translation, and RAG.
| Metric / Feature | BERTScore | ROUGE / BLEU | Exact Match / F1 |
|---|---|---|---|
Underlying Mechanism | Contextual embeddings from models like BERT | N-gram (word sequence) overlap | Exact string or token set matching |
Semantic Understanding | |||
Handles Synonyms & Paraphrasing | |||
Sensitivity to Word Order | Moderate (via attention) | High (exact sequence match) | None (F1) / Absolute (EM) |
Reference Requirements | Single or multiple references | Typically multiple references | Single reference common |
Output Granularity | Precision, Recall, F1 (token-level similarity) | Precision, Recall, F1 (n-gram counts) | Single score (EM) or Precision/Recall/F1 |
Common Use Cases | Text generation, summarization, RAG answer evaluation | Summarization (ROUGE), MT (BLEU) | Question Answering, Classification |
Computational Cost | High (requires forward pass of BERT model) | Low (string operations) | Very low (string/token comparison) |
Where BERTScore is Used
BERTScore is a versatile metric for evaluating text generation quality. Its primary applications span domains where semantic similarity is more critical than exact word matching.
Machine Translation Evaluation
BERTScore is a robust alternative to BLEU and ROUGE for evaluating machine translation outputs. It excels where translations are semantically correct but use different synonyms or sentence structures than the reference. It correlates better with human judgment, especially for languages with rich morphology or flexible word order, by using contextual embeddings to capture meaning beyond n-gram overlap.
Text Summarization Assessment
In automatic text summarization, BERTScore measures how well a generated summary captures the key semantic content of the source document or reference summaries. It is less sensitive to paraphrasing than ROUGE, making it suitable for evaluating abstractive summarization models that rephrase content. It helps assess factual consistency and informativeness by comparing the semantic gist of the summary to the source.
Dialogue Response Generation
For chatbots and conversational AI, BERTScore evaluates the appropriateness and relevance of generated responses against human references. It is used to benchmark models in tasks like the ConvAI2 challenge or for evaluating retrieval-augmented generation (RAG) outputs in customer service agents. The metric's ability to handle diverse, contextually appropriate paraphrases is critical in open-domain dialogue.
Data-to-Text and Code Generation
BERTScore is applied to evaluate systems that generate descriptive text from structured data (e.g., weather reports from tables) or natural language generation (NLG) from knowledge graphs. It is also used in code generation to assess the functional similarity between generated and reference code snippets by embedding code as text, though specialized metrics like CodeBLEU may be more precise for syntactic correctness.
RAG Pipeline Evaluation
Within Retrieval-Augmented Generation (RAG) systems, BERTScore is a component for measuring answer faithfulness and answer relevance. It can compute the similarity between a generated answer and the retrieved source context to gauge grounding, or between the answer and a reference golden answer. Frameworks like RAGAS may use BERTScore-derived measures as part of a holistic evaluation suite.
Model Fine-Tuning and Hyperparameter Search
During model development, BERTScore serves as an automatic evaluation metric for validation sets, guiding hyperparameter tuning and checkpoint selection. It provides a faster, automated proxy for human evaluation in iterative training cycles for text generation models like T5 or GPT-style architectures. Its high correlation with human judgment makes it a cost-effective quality signal during experimentation.
Frequently Asked Questions
BERTScore is an automatic evaluation metric for text generation that computes similarity scores between candidate and reference sentences using contextual embeddings from models like BERT.
BERTScore is an automatic evaluation metric for text generation that computes a similarity score between a candidate (generated) text and one or more reference texts using contextual embeddings from a pre-trained model like BERT. It works by:
- Generating Embeddings: Feeding the candidate and reference sentences through a model like BERT to obtain contextual token embeddings.
- Computing Pairwise Similarity: Calculating the cosine similarity between each token embedding in the candidate and each token embedding in the reference.
- Greedy Matching: For each token in the candidate, finding the most similar token in the reference (and vice-versa) using a greedy matching algorithm based on the similarity matrix.
- Averaging Scores: Computing precision (how many candidate tokens are matched to reference tokens), recall (how many reference tokens are matched to candidate tokens), and the F1 score (their harmonic mean), which is the final BERTScore. This process captures semantic similarity far better than n-gram overlap metrics like BLEU or ROUGE.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
BERTScore is one of several metrics used to evaluate the quality of text generation. These related terms define other critical measures for assessing retrieval and generation components in RAG systems.
ROUGE
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization and machine translation. It operates by comparing overlapping units like n-grams, word sequences, and word pairs between a candidate text and one or more reference texts.
- ROUGE-N: Measures n-gram overlap (e.g., ROUGE-1 for unigrams, ROUGE-2 for bigrams).
- ROUGE-L: Uses the longest common subsequence to assess fluency and word order.
- Key Difference from BERTScore: ROUGE is a surface-level lexical overlap metric, while BERTScore uses contextual embeddings to capture semantic similarity.
BLEU
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of machine-translated text. It calculates a modified n-gram precision score by comparing a candidate translation against one or more human reference translations.
- Core Mechanism: Counts matching n-grams, but applies a brevity penalty to penalize overly short outputs.
- Primary Use Case: The long-standing standard for machine translation benchmarking.
- Key Difference from BERTScore: BLEU is based on exact token matching and is known to correlate poorly with human judgment for tasks requiring semantic understanding, unlike BERTScore's embedding-based approach.
Semantic Similarity
Semantic Similarity is a general concept quantifying the likeness in meaning between two texts. BERTScore is a specific, popular implementation of this concept.
- Implementation Models: Often calculated using embeddings from sentence-transformers like Sentence-BERT (SBERT) or other pre-trained models.
- Mechanism: Encodes sentences into dense vector embeddings and computes a similarity measure (e.g., cosine similarity) between them.
- Relation to BERTScore: BERTScore refines this by computing a precision, recall, and F1 match using pairwise token embeddings from BERT, rather than a single sentence-level vector.
Answer Faithfulness
Answer Faithfulness is a metric that evaluates whether a generated answer is factually consistent with and entirely supported by the provided source context. It is critical for Retrieval-Augmented Generation (RAG) evaluation.
- Core Question: "Does the answer contain any statements not substantiated by the source?"
- Contrast with BERTScore: BERTScore measures semantic similarity to a reference. Answer Faithfulness measures factual grounding to a source, regardless of a reference answer. A text can be semantically similar (high BERTScore) but unfaithful if it introduces unsourced facts.
METEOR
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is an automatic metric for machine translation evaluation that addresses known weaknesses in BLEU.
- Core Improvements: It incorporates synonymy (via WordNet) and stemming to match words beyond exact forms, and includes a penalty for poor word order.
- Alignment: Uses a harmonic mean of unigram precision and recall, with a chunk-based fragmentation penalty.
- Comparison to BERTScore: Like BERTScore, METEOR moves beyond exact n-gram matching. However, METEOR uses linguistic modules (synonym lists, stems), while BERTScore uses contextual embeddings from a deep neural network to understand semantic similarity.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us