Table 2 Overview of evaluation metrics for diagnostic tasks

From: Large language models for disease diagnosis: a scoping review

Type	Evaluation metric	Purpose	Scenario	Representative task
Automated evaluation	Accuracy²¹⁶	The ratio of all correct predictions to the total predictions	G	DD¹⁵⁴, DDx²¹⁷, CD²¹⁸, RP²¹⁹, DRG¹⁰⁵, MHD²²⁰
	Precision⁵⁵	The ratio of true positives to the total number of positive predictions	G	DD⁵⁵, CD²²¹, MIC⁴⁴, RP²¹⁹, DRG¹⁰⁵
	Recall⁵⁵	The ratio of true positives to the total number of actual positive cases	G	DD⁵⁵, CD²²¹, RP²¹⁹, DRG¹⁰⁵
	F1¹³³	Calculated as the harmonic mean of precision and recall	G	DD⁵⁵, DDx²²², CD²²¹, MIC²²³, RP²¹⁹, DRG¹⁰⁵
	AUC²²⁴	The area under the Receiver Operating Characteristic curve	G	DD⁵⁹, CD²²⁵, MIC²²⁶, RP²¹⁹, DRG¹⁰⁵, MHD²²⁷
	AUPR²²⁸	The area under the precision-recall curve	G	DD²²⁹, MIC²²⁸, RP²³⁰, DRG²²⁹
	Top-k accuracy¹⁴⁰	The ratio of instances with the true label in the top k predictions to total instances	G	DD¹⁴⁰, DDx¹⁶⁸
	Top-k precision⁶⁰	The ratio of true positives to total positive predictions within the top k predictions	G	DD¹⁴⁰, DDx²²²
	Top-k recall²³¹	The ratio of true positives within the top k predictions to actual positive cases	G	DD¹⁴⁰, DDx²²²
	Mean square error¹⁴²	The average of the squared differences between predicted and actual values	G	DD¹⁴², RP¹⁴¹
	Mean absolute error¹⁴¹	The average of the absolute differences between predicted and actual values	G	DD¹⁴², RP¹⁴¹
	Cohen’s κ²³²	Measure the agreement between predicted score and actual score	G	DD²³²
	BLUE¹¹⁵	Calculate precision by matching n-grams between reference and generated text	T	DD²³³, CD²³⁴, MIC²³⁵, DRG¹¹⁵
	ROUGE¹⁸⁷	Calculate F1-score by matching n-grams between reference and generated text	T	DD²³³, CD¹⁸⁷, MIC²³⁵, DRG¹¹⁵
	CIDEr¹⁰²	Evaluate n-gram similarity, emphasizing alignment across multiple reference texts	T	CD¹⁰², MIC²³⁶, DRG²³⁷
	BERTScore⁸¹	Measure similarity by comparing embeddings of reference and generated text	T	DD²³⁸, DDx¹⁴³, CD¹⁸⁷, DRG⁸⁷
	METEOR²³⁴	Evaluate text similarity by considering precision, recall, word order, and synonym matches	T	DDx¹⁴³, CD²³⁴, MIC²³⁶, DRG¹¹⁵
Human evaluation	Necessity¹⁸⁷	Whether the response or prediction assists in advancing the diagnosis	T	CD¹⁸⁷
	Acceptance²³⁹	The degree of acceptance of the response without any revision	T	DD⁵⁴, CD²⁴⁰
	Reliability¹⁷⁶	The trustworthiness of the evidence in the response or prediction	T	DD¹⁴⁴, CD¹⁷⁶
	Explainability⁸⁸	Whether the response or prediction is explainable	T	DDx²⁴¹, CD²¹⁸
Human or LLM evaluation	Correctness²⁴²	Whether the response or prediction is medically correct	T	DD¹³⁴, DDx²¹⁷, CD¹⁸⁷, DRG²⁴³, MHD¹⁷⁶
	Consistency⁹⁹	Whether the response or prediction is consistent with the ground-truth or input	T	DD¹⁰⁸, DDx²⁴¹, CD⁹⁹, MHD¹⁷⁶
	Clarity⁸⁰	Whether the response or prediction is clearly clarified	T	DD¹⁴⁹, CD²⁴⁴
	Professionality¹⁷⁶	The rationality of the evidence based on domain knowledge	T	CD¹⁴⁹, MHD¹⁷⁶
	Completeness¹⁸⁷	Whether the response or prediction is sufficient and comprehensive	T	DDx¹⁴³, CD²¹⁸, DRG²⁴³
	Satisfaction²⁴⁵	Whether the response or prediction is satisfying	T	CD²⁴⁰, DRG²³⁷
	Hallucination⁹⁹	Response contains inconsistent or unmentioned information with previous context	T	DDx²²², CD²¹⁸, DRG²⁴⁶
	Relevance⁸⁰	Whether the response or prediction is relevant to the context	T	CD⁸⁰, DRG²⁴⁶
	Coherence²⁴⁷	Assess logical consistency with the dialog history	T	CD¹⁰⁰, DRG¹⁹⁰

Since diagnostic tasks might include explanations alongside the predicted diagnosis, existing studies also evaluated these explanatory descriptions. We categorized the metrics based on their application scenarios: G denotes that the metric requires ground-truth diagnosis for evaluation, while T indicates those applicable to textual descriptions (e.g., generated explanations). Notably, we only present a selection of representative diagnostic tasks from the included papers: disease diagnosis (DD), differential diagnosis (DDx), conversational diagnosis (CD), medical image classification (MIC), risk prediction (RP), mental health disorder detection (MHD), and diagnostic report generation (DRG).

Back to article page

Table 2 Overview of evaluation metrics for diagnostic tasks

Search

Quick links