Table 5 Clinician validation of llm-generated diagnoses: inter-rater agreement

From: Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis

Clinician

Inter-Rater Accuracy [%]

Inter-Rater Union Accuracy [%]

Inter-rater intersection accuracy [%]

Claude 3.5 sonnet

RAG-assisted LLM

Claude 3.5 sonnet

RAG-assisted LLM

Claude 3.5 Sonnet

RAG-assisted LLM

Clinician 1

90.43

90.11

95.22

95.37

75.44

73.84

Clinician 2

80.22

79.11

Clinician 3

76.51

77.78

96.03

94.44

66.03

65.87

Clinician 4

85.56

82.52

Average

83.18

82.38

95.62

94.91

70.74

69.86

  1. Average Inter-Rater-Agreement Accuracy [%] for Claude 3.5 Sonnet and RAG-assisted LLM compared to clinician evaluations. Inter-Rater Accuracy is reported for each clinician, along with combined values for the Union and Intersection evaluations. For the Inter-Rater Union and Intersection, a single value is reported for each pair of clinicians: one value for the Union and one value for the Intersection of Clinicians 1 and 2, and similarly for Clinicians 3 and 4.