Table 1 Model performance comparison across tasks and evaluation methods

From: Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis

User setting

Model

Triage level

Specialty

Diagnosis

Average

Exact match

Range

Matched

At least one

Matched

At least one

General User

RAG-Assisted LLM

64.10

78.20

77.12

86.35

69.43

80.85

76.01

Claude 3.5 Sonnet

62.20

82.80

78.26

88.05

70.22

82.00

77.26

Claude 3 Sonnet

58.35

74.40

78.10

87.70

70.17

81.55

75.05

Claude 3 Haiku

57.70

71.80

77.86

87.10

67.39

79.60

73.58

Clinical User

RAG-Assisted LLM

65.75

77.15

77.28

86.45

69.77

81.70

76.35

Claude 3.5 Sonnet

64.40

82.40

78.86

88.55

70.26

82.10

77.76

Claude 3 Sonnet

61.65

74.55

77.72

87.15

70.51

82.05

75.61

Claude 3 Haiku

59.00

66.15

78.02

87.05

67.46

79.30

72.83

  1. Performance is presented as accuracy [%] on all tasks and with all evaluation methods. A bold value indicates the best-performing model and an underlined value indicates the second-best-performing model, determined separately within each user setting (general or clinical user) and within each evaluation method (exact match/matched or range/at least one) for each prediction task (triage level, specialty or diagnosis).