Table 2 Two emergency room physicians' validations of GPT-4 generated USMLE questions and inter-rater agreement rates.

From: Evaluating prompt engineering on GPT-3.5’s performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4

Review category

Human 1 agreement (percentage)

Human 2 agreement (percentage)

Cohen’s Kappa

Appropriateness of question

90%

90%

0.78

Appropriateness of type

100%

100%

1.00

Appropriateness of subtype

98%

98%

1.00

Agreement on difficulty level

58% (medium: 37.9%, hard: 37.9%, easy: 24.1%)

44% (medium: 36.4%, easy: 36.4%, hard: 27.3%)

0.41

Appropriateness of clinical field

90%

96%

1.00

Correctness of answer

76%

76%

0.78