Table 1 A brief overview of intrinsic metrics for LLMs

From: Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI

Name

Focus

Measure

Model

BLEU71,72

 

Calculates precision based on the number of mutual n consecutive words between reference and generated text.

FLAN73, BART74, DialoGPT75, GPT-358

ROUGE5,72

 

Calculates F1-score based on the number of mutual n consecutive words between reference and generated text.

BART74, T5, GPT-276, BiomedGPT77

Perplexity72,78

 

Likelihood of the model generating the reference text.

LIMA39, BART74, Meena9

BERTScore20,72

 

Creates a similarity matrix between reference and generated text and calculates the weighted sum of maximum similarity in the matrix.

BERT, RoBERT20

METEOR72,79

 

Calculates F1-Score (with more weight on recall) based on the number of matched words considering synonyms in the reference and generated text.

GPT-3.569

Precision18,72,80

General

Is calculated by dividing the number of correctly generated relevant words by the total number of generated words.

BioGPT81, ChatDoctor82, medAlpaca28

Recall18,80

 

Is calculated by dividing the number of correctly generated relevant words by the total number of possible relevant words.

BioGPT81, ChatDoctor82, medAlpaca28

F1-Score5,83

 

Is calculated as the harmonic mean of precision and recall.

BioGPT81, ChatDoctor82, medAlpaca28

TER84

 

Is computed based on the minimum number of edits required to transform the generated text into the reference text.

GPT-485

MoverScore86

 

Like BERTScore calculates similarity matrix but considers many-to-one word relationships.

GPT-3.569

NIST84

 

Similar to BLEU with the difference that it gives higher weight to more valuable mutual n consecutive words.

BART87, GPT-287

Dialog Accuracy88,89,90,91,92

 

Calculating the percentage of successful diagnosis.

Refuel88,89,91, KR_DS90

Match-rate88,89,90,91,92

Dialog

Evaluating the chatbot’s ability to accurately inquire about relevant symptoms.

Refuel88,89, KR_DS90

Average Request Turn88,89,90,91,92

 

Averaging number of turns the average number of turns or interactions between the user and chatbot.

Refuel88,89, KR_DS90

  1. METEOR Metric for Evaluation of Translation with Explicit ORdering, TER translation edit rate.