Fig. 4: Comparison of the performances between Fu-LLM (finetune_qwen2_7b) and five other popular public LLM models (DeepSeek-v3 (2024_12_26), GPT-3.5-turbo (2025_01_25), GPT-4o (2024_11_20), claude 3.5-sonnet (2024_10_22) and gemini-2.0-pro (2025_02_05)) in the study dataset.

GPT generative pretrained transformer; NPV negative predictive value; PPV positive predictive value. Within-group differences of the overall agreement, sensitivity and specificity between Fu-LLM and five other popular public LLM models were assessed using Cochran’s Q statistic. For comparison of NPV and PPV to those of five other popular public LLM models, χ2 test was applied. The statistical tests were two-sided with significance set at p < 0.05. p had been adjusted by Bonferroni correction. Source data are provided as a Source data file.