npj Digital Medicine

Table 1 A brief overview of intrinsic metrics for LLMs

From: Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI

Name	Focus	Measure	Model
BLEU^71,72		Calculates precision based on the number of mutual n consecutive words between reference and generated text.	FLAN⁷³, BART⁷⁴, DialoGPT⁷⁵, GPT-3⁵⁸
ROUGE^5,72		Calculates F1-score based on the number of mutual n consecutive words between reference and generated text.	BART⁷⁴, T5, GPT-2⁷⁶, BiomedGPT⁷⁷
Perplexity^72,78		Likelihood of the model generating the reference text.	LIMA³⁹, BART⁷⁴, Meena⁹
BERTScore^20,72		Creates a similarity matrix between reference and generated text and calculates the weighted sum of maximum similarity in the matrix.	BERT, RoBERT²⁰
METEOR^72,79		Calculates F1-Score (with more weight on recall) based on the number of matched words considering synonyms in the reference and generated text.	GPT-3.5⁶⁹
Precision^18,72,80	General	Is calculated by dividing the number of correctly generated relevant words by the total number of generated words.	BioGPT⁸¹, ChatDoctor⁸², medAlpaca²⁸
Recall^18,80		Is calculated by dividing the number of correctly generated relevant words by the total number of possible relevant words.	BioGPT⁸¹, ChatDoctor⁸², medAlpaca²⁸
F1-Score^5,83		Is calculated as the harmonic mean of precision and recall.	BioGPT⁸¹, ChatDoctor⁸², medAlpaca²⁸
TER⁸⁴		Is computed based on the minimum number of edits required to transform the generated text into the reference text.	GPT-4⁸⁵
MoverScore⁸⁶		Like BERTScore calculates similarity matrix but considers many-to-one word relationships.	GPT-3.5⁶⁹
NIST⁸⁴		Similar to BLEU with the difference that it gives higher weight to more valuable mutual n consecutive words.	BART⁸⁷, GPT-2⁸⁷
Dialog Accuracy^{88,89,90,91,92}		Calculating the percentage of successful diagnosis.	Refuel^88,89,91, KR_DS⁹⁰
Match-rate^{88,89,90,91,92}	Dialog	Evaluating the chatbot’s ability to accurately inquire about relevant symptoms.	Refuel^88,89, KR_DS⁹⁰
Average Request Turn^{88,89,90,91,92}		Averaging number of turns the average number of turns or interactions between the user and chatbot.	Refuel^88,89, KR_DS⁹⁰

METEOR Metric for Evaluation of Translation with Explicit ORdering, TER translation edit rate.

Back to article page

Search

Advanced search

Quick links