Table 2 Overview of evaluation metrics for diagnostic tasks
From: Large language models for disease diagnosis: a scoping review
Type | Evaluation metric | Purpose | Scenario | Representative task |
|---|---|---|---|---|
Automated evaluation | Accuracy216 | The ratio of all correct predictions to the total predictions | G | |
| Â | Precision55 | The ratio of true positives to the total number of positive predictions | G | |
| Â | Recall55 | The ratio of true positives to the total number of actual positive cases | G | |
| Â | F1133 | Calculated as the harmonic mean of precision and recall | G | |
| Â | AUC224 | The area under the Receiver Operating Characteristic curve | G | |
| Â | AUPR228 | The area under the precision-recall curve | G | |
| Â | Top-k accuracy140 | The ratio of instances with the true label in the top k predictions to total instances | G | |
| Â | Top-k precision60 | The ratio of true positives to total positive predictions within the top k predictions | G | |
| Â | Top-k recall231 | The ratio of true positives within the top k predictions to actual positive cases | G | |
| Â | Mean square error142 | The average of the squared differences between predicted and actual values | G | |
| Â | Mean absolute error141 | The average of the absolute differences between predicted and actual values | G | |
|  | Cohen’s κ232 | Measure the agreement between predicted score and actual score | G | DD232 |
| Â | BLUE115 | Calculate precision by matching n-grams between reference and generated text | T | |
| Â | ROUGE187 | Calculate F1-score by matching n-grams between reference and generated text | T | |
| Â | CIDEr102 | Evaluate n-gram similarity, emphasizing alignment across multiple reference texts | T | |
| Â | BERTScore81 | Measure similarity by comparing embeddings of reference and generated text | T | |
| Â | METEOR234 | Evaluate text similarity by considering precision, recall, word order, and synonym matches | T | |
Human evaluation | Necessity187 | Whether the response or prediction assists in advancing the diagnosis | T | CD187 |
| Â | Acceptance239 | The degree of acceptance of the response without any revision | T | |
| Â | Reliability176 | The trustworthiness of the evidence in the response or prediction | T | |
| Â | Explainability88 | Whether the response or prediction is explainable | T | |
Human or LLM evaluation | Correctness242 | Whether the response or prediction is medically correct | T | |
| Â | Consistency99 | Whether the response or prediction is consistent with the ground-truth or input | T | |
| Â | Clarity80 | Whether the response or prediction is clearly clarified | T | |
| Â | Professionality176 | The rationality of the evidence based on domain knowledge | T | |
| Â | Completeness187 | Whether the response or prediction is sufficient and comprehensive | T | |
| Â | Satisfaction245 | Whether the response or prediction is satisfying | T | |
| Â | Hallucination99 | Response contains inconsistent or unmentioned information with previous context | T | |
| Â | Relevance80 | Whether the response or prediction is relevant to the context | T | |
| Â | Coherence247 | Assess logical consistency with the dialog history | T |