Table 2 Overview of evaluation metrics for diagnostic tasks

From: Large language models for disease diagnosis: a scoping review

Type

Evaluation metric

Purpose

Scenario

Representative task

Automated evaluation

Accuracy216

The ratio of all correct predictions to the total predictions

G

DD154, DDx217, CD218, RP219, DRG105, MHD220

 

Precision55

The ratio of true positives to the total number of positive predictions

G

DD55, CD221, MIC44, RP219, DRG105

 

Recall55

The ratio of true positives to the total number of actual positive cases

G

DD55, CD221, RP219, DRG105

 

F1133

Calculated as the harmonic mean of precision and recall

G

DD55, DDx222, CD221, MIC223, RP219, DRG105

 

AUC224

The area under the Receiver Operating Characteristic curve

G

DD59, CD225, MIC226, RP219, DRG105, MHD227

 

AUPR228

The area under the precision-recall curve

G

DD229, MIC228, RP230, DRG229

 

Top-k accuracy140

The ratio of instances with the true label in the top k predictions to total instances

G

DD140, DDx168

 

Top-k precision60

The ratio of true positives to total positive predictions within the top k predictions

G

DD140, DDx222

 

Top-k recall231

The ratio of true positives within the top k predictions to actual positive cases

G

DD140, DDx222

 

Mean square error142

The average of the squared differences between predicted and actual values

G

DD142, RP141

 

Mean absolute error141

The average of the absolute differences between predicted and actual values

G

DD142, RP141

 

Cohen’s κ232

Measure the agreement between predicted score and actual score

G

DD232

 

BLUE115

Calculate precision by matching n-grams between reference and generated text

T

DD233, CD234, MIC235, DRG115

 

ROUGE187

Calculate F1-score by matching n-grams between reference and generated text

T

DD233, CD187, MIC235, DRG115

 

CIDEr102

Evaluate n-gram similarity, emphasizing alignment across multiple reference texts

T

CD102, MIC236, DRG237

 

BERTScore81

Measure similarity by comparing embeddings of reference and generated text

T

DD238, DDx143, CD187, DRG87

 

METEOR234

Evaluate text similarity by considering precision, recall, word order, and synonym matches

T

DDx143, CD234, MIC236, DRG115

Human evaluation

Necessity187

Whether the response or prediction assists in advancing the diagnosis

T

CD187

 

Acceptance239

The degree of acceptance of the response without any revision

T

DD54, CD240

 

Reliability176

The trustworthiness of the evidence in the response or prediction

T

DD144, CD176

 

Explainability88

Whether the response or prediction is explainable

T

DDx241, CD218

Human or LLM evaluation

Correctness242

Whether the response or prediction is medically correct

T

DD134, DDx217, CD187, DRG243, MHD176

 

Consistency99

Whether the response or prediction is consistent with the ground-truth or input

T

DD108, DDx241, CD99, MHD176

 

Clarity80

Whether the response or prediction is clearly clarified

T

DD149, CD244

 

Professionality176

The rationality of the evidence based on domain knowledge

T

CD149, MHD176

 

Completeness187

Whether the response or prediction is sufficient and comprehensive

T

DDx143, CD218, DRG243

 

Satisfaction245

Whether the response or prediction is satisfying

T

CD240, DRG237

 

Hallucination99

Response contains inconsistent or unmentioned information with previous context

T

DDx222, CD218, DRG246

 

Relevance80

Whether the response or prediction is relevant to the context

T

CD80, DRG246

 

Coherence247

Assess logical consistency with the dialog history

T

CD100, DRG190

  1. Since diagnostic tasks might include explanations alongside the predicted diagnosis, existing studies also evaluated these explanatory descriptions. We categorized the metrics based on their application scenarios: G denotes that the metric requires ground-truth diagnosis for evaluation, while T indicates those applicable to textual descriptions (e.g., generated explanations). Notably, we only present a selection of representative diagnostic tasks from the included papers: disease diagnosis (DD), differential diagnosis (DDx), conversational diagnosis (CD), medical image classification (MIC), risk prediction (RP), mental health disorder detection (MHD), and diagnostic report generation (DRG).