Table 1 Comparison of running time and cost for rationale assessment among text-similarity metrics, LLM-w-Rationale, and expert evaluation on the MedThink-Bench dataset

From: Automating expert-level medical reasoning evaluation of large language models

Evaluation

Time (minutes)

Cost ($)

Text-similarity metric

9.0 ± 0.6

0

LLM-w-Rationale (GPT-4o-mini)

51.8 ± 5.6

0.8 ± 0.1

LLM-w-Rationale (HuatuoGPT-o1-70B)

310.7 ± 14.1

0

LLM-w-Rationale (MedGemma-27B)

74.8 ± 8.9

0

Expert evaluation

3708.3 ± 175.3

NA

  1. Running time for text-similarity metrics represents the cumulative time across all evaluated metrics. Cost calculations are limited to API service fees; GPU running costs are excluded, as local GPUs on a Linux server were utilized for experiments without incurring additional hardware operation expenses.
  2. NA not applicable.