Table 5 Clinician validation of llm-generated diagnoses: inter-rater agreement

Clinician	Inter-Rater Accuracy [%]		Inter-Rater Union Accuracy [%]		Inter-rater intersection accuracy [%]
Clinician	Claude 3.5 sonnet	RAG-assisted LLM	Claude 3.5 sonnet	RAG-assisted LLM	Claude 3.5 Sonnet	RAG-assisted LLM
Clinician 1	90.43	90.11	95.22	95.37	75.44	73.84
Clinician 2	80.22	79.11	95.22	95.37	75.44	73.84
Clinician 3	76.51	77.78	96.03	94.44	66.03	65.87
Clinician 4	85.56	82.52	96.03	94.44	66.03	65.87
Average	83.18	82.38	95.62	94.91	70.74	69.86

Average Inter-Rater-Agreement Accuracy [%] for Claude 3.5 Sonnet and RAG-assisted LLM compared to clinician evaluations. Inter-Rater Accuracy is reported for each clinician, along with combined values for the Union and Intersection evaluations. For the Inter-Rater Union and Intersection, a single value is reported for each pair of clinicians: one value for the Union and one value for the Intersection of Clinicians 1 and 2, and similarly for Clinicians 3 and 4.

Quick links

Search