BERTScore
A metric for evaluating text generation that compares outputs to references at the token level using contextual embeddings, rather than requiring exact word overlap.
For each token in the candidate text, BERTScore finds the most similar token in the reference (and vice versa), then aggregates these matches into precision, recall, and F1 scores.
Why it exists: Traditional metrics like BLEU (counts exact n-gram overlaps) penalize valid paraphrases. “The cat sat on the mat” vs “A feline rested on the rug” scores poorly on BLEU but high on BERTScore, because the embeddings capture semantic equivalence.
Limitation: Poor at detecting numerical divergence — “$100” and “$10,000” have similar token embeddings.
See also
- Cosine Similarity — the underlying comparison operation
- Embeddings — the token representations being matched
- Prompt Evaluation — where BERTScore is used as an eval metric