BERTScore

A metric for evaluating text generation that compares outputs to references at the token level using contextual embeddings, rather than requiring exact word overlap.

For each token in the candidate text, BERTScore finds the most similar token in the reference (and vice versa), then aggregates these matches into precision, recall, and F1 scores.

Why it exists: Traditional metrics like BLEU (counts exact n-gram overlaps) penalize valid paraphrases. “The cat sat on the mat” vs “A feline rested on the rug” scores poorly on BLEU but high on BERTScore, because the embeddings capture semantic equivalence.

Limitation: Poor at detecting numerical divergence — “$100” and “$10,000” have similar token embeddings.

Edmondo's Vault

Explorer

BERTScore

BERTScore

See also

Graph View

Table of Contents

Backlinks