Fig. 4: Open-source model performance evaluation and low-rank adaptation (LoRA) fine-tuning results.

a Systematic evaluation of open-source models spanning 4B to 671B parameters using micro-averaged metrics (macro-averaged results in Supplementary Fig. 4). Point sizes are proportional to model parameter counts. The quadrant chart demonstrates correlation between model size and performance, with QwQ 32B achieving optimal balance between performance (F1 = 0.602) and efficiency while surpassing human expert benchmark (F1 = 0.575, orange dashed line). Reasoning models (blue) and general models (green) are distinguished by color coding. b LoRA fine-tuning convergence curves for QwQ 32B, showing original loss (blue) and smoothed loss (orange) trajectories across training epochs, demonstrating stable convergence within reasonable computational budgets. c Performance comparison before (blue) and after (red) LoRA fine-tuning across F1 score, recall, and precision metrics for both validation datasets using micro-averaged metrics and 95% confidence intervals from five repeated inferences with patient-level bootstrap paired testing (macro-averaged results in Supplementary Fig. 5; detailed statistics, effect sizes, and CIs in Supplementary Tables 5–6). Box plots show distribution of results with no statistically significant improvements observed (all p > 0.05). Orange dashed lines represent human expert benchmarks for each metric in respective centers. Figure created using Python matplotlib library; final composition assembled using Canva (www.canva.com).