BERTScore is an automatic evaluation metric for assessing the quality of generated text, such as machine translation or summarization outputs. It calculates a similarity score by comparing a candidate sentence to one or more reference sentences using contextual embeddings from a pre-trained language model like BERT. Unlike traditional metrics that rely on exact n-gram matches, BERTScore leverages the model's deep semantic understanding to match words based on their meaning in context, providing a more nuanced assessment of semantic equivalence.
Glossary
BERTScore

What is BERTScore?
BERTScore is an automatic evaluation metric for text generation that computes similarity scores by matching words in candidate and reference sentences using contextual embeddings from models like BERT.
The metric computes precision, recall, and an F1 score by finding the maximum cosine similarity between each token's embedding in the candidate and reference texts. This approach makes it robust to synonym substitution and paraphrasing, common weaknesses of metrics like BLEU or ROUGE. BERTScore correlates more highly with human judgments of text quality, especially for tasks where fluency and meaning preservation are critical. It is a cornerstone of modern evaluation-driven development for generative AI systems.
Key Features of BERTScore
BERTScore is an automatic evaluation metric for text generation that computes similarity scores by matching words in candidate and reference sentences using contextual embeddings from models like BERT.
Contextual Embedding Matching
Unlike traditional metrics that rely on exact n-gram overlap, BERTScore computes similarity using contextual embeddings from a pre-trained model like BERT. Each word in the candidate and reference sentences is represented as a high-dimensional vector that captures its meaning within the full sentence context. The score is calculated by finding the maximum cosine similarity for each token in the candidate to any token in the reference (and vice versa for recall). This allows it to match synonyms and paraphrases that have similar contextual meaning, providing a more semantically-aware evaluation.
- Core Mechanism: Computes pairwise cosine similarity between token vectors.
- Model Agnostic: Can use embeddings from BERT, RoBERTa, or other transformer models.
- Semantic Flexibility: Recognizes that 'automobile' and 'car' should be considered a match.
Precision, Recall, and F1 Decomposition
BERTScore decomposes the overall similarity into three interpretable components, mirroring classic information retrieval metrics:
- BERT-Precision: Measures how much of the generated (candidate) text is relevant or contained within the reference. It is the average similarity of each candidate token to its most similar reference token.
- BERT-Recall: Measures how much of the reference text is captured by the generation. It is the average similarity of each reference token to its most similar candidate token.
- BERT-F1: The harmonic mean of BERT-Precision and BERT-Recall, providing a single balanced score.
This decomposition allows developers to diagnose specific failure modes—e.g., a model with high precision but low recall is generating factually correct but incomplete text.
Importance Weighting via IDF
To prevent common but uninformative words (e.g., 'the', 'is') from dominating the similarity score, BERTScore incorporates Inverse Document Frequency (IDF) weighting. Tokens that are rare across a corpus are assigned higher importance. The final, weighted similarity for each token pair is its cosine similarity multiplied by the IDF of the reference token (for recall) or candidate token (for precision).
- Corpus-Dependent: IDF statistics are calculated from a large background corpus, making the score sensitive to term importance.
- Practical Impact: The word 'transformer' in a machine learning context receives a higher weight than the word 'model', leading to scores that better reflect informative content match.
Human Correlation and Robustness
BERTScore was designed to correlate more highly with human judgment than n-gram-based metrics like BLEU or ROUGE. Empirical studies across machine translation, text summarization, and image captioning tasks show it achieves superior human correlation. Its robustness stems from several factors:
- Handles Paraphrases: Effectively matches different phrasings that convey the same meaning.
- Resilient to Word Order Changes: Because matching is based on token similarity rather than contiguous sequences, it is less penalized by grammatical reordering.
- Model-Based Calibration: The scores from larger, more capable embedding models (e.g., RoBERTa-large) typically show higher correlation with human ratings.
It is particularly effective for evaluating modern generative models where fluency and semantic correctness are paramount.
Computational Considerations and Baseline Rescaling
A key implementation detail is baseline rescaling. Raw BERTScore values are not intuitively scaled (e.g., they are not between 0 and 1). To improve interpretability and comparability across samples, scores are rescaled using baseline values computed from common, low-quality generations.
- Typical Baselines: A length-matched sequence of the period '.' character or random words from the vocabulary.
- Rescaling Formula:
(score - baseline_score) / (1 - baseline_score), which tends to bound the rescaled score.
Computational Cost: Generating embeddings for the candidate and reference texts is the primary cost, making it more expensive than n-gram metrics but often necessary for accurate evaluation. Batch processing is recommended for efficiency.
Common Use Cases and Limitations
Primary Use Cases:
- Evaluating machine translation quality, especially for languages with flexible word order.
- Assessing text summarization systems for content selection and factual consistency.
- Benchmarking image captioning or data-to-text generation models.
- Fine-tuning language models using BERTScore as a reward signal in reinforcement learning.
Key Limitations to Consider:
- Reference Dependence: Still requires one or more high-quality human references, like all reference-based metrics.
- Embedding Artifacts: Scores can be influenced by quirks of the chosen pre-trained embedding model.
- No Explicit Fact Checking: Measures semantic similarity, not factual veracity. A candidate sentence that is semantically similar to a reference but contains a factual error may still receive a high score.
- Computational Overhead: Not suitable for real-time, latency-critical evaluation during inference.
BERTScore vs. Traditional Metrics
A technical comparison of BERTScore's contextual embedding-based approach against traditional n-gram and token-matching metrics for evaluating text generation.
| Evaluation Dimension | BERTScore | BLEU / ROUGE / METEOR | Human Evaluation (Gold Standard) |
|---|---|---|---|
Underlying Mechanism | Contextual embeddings from models like BERT | N-gram or word sequence overlap | Human judgment based on semantics, fluency, and coherence |
Semantic Understanding | |||
Handles Synonyms & Paraphrases | |||
Word Order Sensitivity | High (via attention) | High (exact sequence match) | High |
Correlation with Human Judgment | High (Pearson ~0.9 on some tasks) | Moderate to Low (Pearson ~0.3-0.5) | Perfect by definition |
Reference Requirements | Single or multiple references | Typically multiple references | Human annotators |
Computational Cost | High (requires forward pass through BERT) | Very Low (string operations) | Extremely High (time, cost, scalability) |
Common Use Cases | Summarization, dialogue, machine translation evaluation | Machine translation (BLEU), summarization (ROUGE) evaluation | Final validation, creating benchmark datasets |
Output Granularity | Token-level and sentence-level similarity scores | Sentence or corpus-level aggregate score | Ordinal rankings or Likert-scale ratings |
Common Use Cases for BERTScore
BERTScore is a versatile metric that leverages contextual embeddings to evaluate text generation quality. Its primary applications span from benchmarking foundational models to fine-tuning production pipelines.
Machine Translation Evaluation
BERTScore provides a more nuanced evaluation of machine translation outputs than traditional n-gram metrics like BLEU. By using contextual embeddings, it captures semantic equivalence even when the candidate translation uses different words or sentence structure than the reference.
- Key Advantage: It correlates better with human judgment on meaning preservation, especially for languages with flexible word order.
- Example: A translation from English to German that uses a synonym not present in the reference sentence would receive a low BLEU score but could achieve a high BERTScore if the meaning is preserved.
- Limitation: It requires high-quality reference translations, making it less suitable for tasks where only a single reference is available.
Text Summarization Assessment
In text summarization, the goal is to condense information while preserving core meaning. BERTScore evaluates the semantic content overlap between a generated summary and reference summaries.
- Superior to ROUGE: While ROUGE measures lexical overlap, BERTScore assesses if the summary captures the same concepts and entities, even with paraphrasing.
- Use in Training: It can be used as a reward signal for reinforcement learning-based summarization models, directly optimizing for semantic fidelity.
- Practical Consideration: It is often used alongside ROUGE and human evaluation to get a comprehensive view of summary quality (informativeness vs. fluency).
Dialogue Response Generation
Evaluating chatbot or dialogue system responses is challenging due to the many possible valid replies. BERTScore compares a generated response to multiple acceptable reference responses.
- Handles Diversity: A good response may be semantically correct but lexically distinct from any single reference. BERTScore's embedding-based matching can identify this semantic alignment.
- Contextual Understanding: It can be applied to multi-turn dialogue by concatenating the conversation history with the response, allowing the metric to evaluate contextual appropriateness.
- Industry Application: Used to A/B test different dialogue model architectures by comparing their average BERTScore against a set of gold-standard test conversations.
Model Fine-Tuning & Hyperparameter Search
BERTScore serves as an automatic evaluation metric during the model development lifecycle, guiding decisions without constant human intervention.
- Validation Metric: During fine-tuning of a text generation model (e.g., a T5 or GPT model), BERTScore on a held-out validation set can be used for early stopping to prevent overfitting to n-gram-based metrics.
- Hyperparameter Optimization: It can be the objective function in a search for optimal learning rates, batch sizes, or architectural variants, directly optimizing for semantic output quality.
- Pipeline Integration: In Evaluation-Driven Development, BERTScore provides a quantitative gate before a model progresses from staging to canary deployment.
Data-to-Text Generation
This task involves generating fluent text from structured data (e.g., sports statistics, weather data). BERTScore evaluates how well the generated text conveys the factual content of the source data.
- Factual Consistency: It measures alignment between the information in the generated text and the information in the reference text derived from the same data.
- Critical for RAG: This use case is directly analogous to evaluating the faithfulness of a Retrieval-Augmented Generation (RAG) system's output against the retrieved source documents.
- Challenge: Requires precise references that contain all and only the information present in the source data, making dataset construction crucial.
Benchmarking Foundation Models
When comparing large language models (LLMs) on standard tasks, BERTScore offers a consistent, automated metric that complements human evaluation.
- Standardized Comparison: Used in leaderboards for tasks like summarization (CNN/Daily Mail) and translation (WMT) to provide a reproducible semantic score.
- Cost-Effective Scaling: It automates a significant portion of evaluation at scale, where human evaluation of every model output is prohibitively expensive.
- Research Insight: Analyzing where BERTScore and n-gram metrics diverge can reveal specific model strengths (e.g., paraphrasing ability) and weaknesses (e.g., hallucination).
Frequently Asked Questions
BERTScore is an automatic evaluation metric for text generation that uses contextual embeddings from models like BERT to compute similarity scores between candidate and reference sentences. This FAQ addresses common technical questions about its implementation, interpretation, and use cases.
BERTScore is an automatic evaluation metric for text generation that computes similarity by matching words in a candidate sentence to words in a reference sentence using contextual embeddings from a pre-trained model like BERT. It works by first generating contextual embeddings for each token in both the candidate and reference sentences using a model like bert-base-uncased. It then computes a pairwise cosine similarity matrix between all tokens. Finally, it performs a greedy matching (or optionally, maximum similarity matching) to align each token in the candidate to the most semantically similar token in the reference, aggregating these similarity scores—typically using precision, recall, and their harmonic mean, the F1 score—to produce the final metric. Unlike n-gram metrics such as BLEU, it leverages deep contextual understanding, making it more aligned with human judgment for tasks like machine translation, summarization, and dialogue generation.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
BERTScore is part of a broader ecosystem of metrics for evaluating generative and discriminative AI models. These related concepts provide context for its specific role in assessing semantic similarity.
Embedding-Based Metrics
This is the broader category to which BERTScore belongs. These metrics use dense vector representations (embeddings) from pre-trained models to evaluate text similarity, moving beyond exact token matching.
- Core Principle: Sentences with similar meanings should have similar embedding vectors in a high-dimensional space.
- Common Similarity Measures: Cosine similarity or pairwise distance (e.g., Euclidean) between sentence or token embeddings.
- Examples: Besides BERTScore, other variants use embeddings from models like GloVe, InferSent, or Sentence-BERT. BERTScore is distinguished by its use of token-level contextual embeddings and precision, recall, and F1 formulation.
Semantic Textual Similarity (STS)
Semantic Textual Similarity is the general task of quantifying how semantically similar two pieces of text are, typically scored on a scale (e.g., 0 to 5). BERTScore is an automatic metric designed to approximate human STS judgments without human raters.
- Benchmarks: Tasks like the STS Benchmark provide human-annotated datasets for evaluating model performance on STS.
- Human Correlation: The primary validation for metrics like BERTScore is their high correlation (e.g., Pearson or Spearman correlation) with human-assigned similarity scores on STS benchmarks.
- Application: Used to evaluate paraphrase generation, summarization, and dialogue response quality, where meaning preservation is key.
Model-Based Evaluation
BERTScore exemplifies model-based evaluation, where a separate ML model (the evaluator) is used to assess the output of a primary ML model (the generator). This creates a paradigm shift from rule-based metrics.
- Evaluator Model: In BERTScore, a pre-trained BERT model serves as the frozen evaluator, providing contextual embeddings.
- Advantages: Can capture complex, learned notions of language quality and semantics.
- Risks: Inherits the biases and limitations of the evaluator model. Performance is tied to the quality and relevance of the pre-trained embeddings used.
- Trend: Part of a larger movement towards learned metrics like BARTScore, which uses a sequence-to-sequence model to score generation quality.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us