Table 1 Comparison of running time and cost for rationale assessment among text-similarity metrics, LLM-w-Rationale, and expert evaluation on the MedThink-Bench dataset
From: Automating expert-level medical reasoning evaluation of large language models
Evaluation | Time (minutes) | Cost ($) |
|---|---|---|
Text-similarity metric | 9.0 ± 0.6 | 0 |
LLM-w-Rationale (GPT-4o-mini) | 51.8 ± 5.6 | 0.8 ± 0.1 |
LLM-w-Rationale (HuatuoGPT-o1-70B) | 310.7 ± 14.1 | 0 |
LLM-w-Rationale (MedGemma-27B) | 74.8 ± 8.9 | 0 |
Expert evaluation | 3708.3 ± 175.3 | NA |