Table 2 Two emergency room physicians' validations of GPT-4 generated USMLE questions and inter-rater agreement rates.

Review category	Human 1 agreement (percentage)	Human 2 agreement (percentage)	Cohen’s Kappa
Appropriateness of question	90%	90%	0.78
Appropriateness of type	100%	100%	1.00
Appropriateness of subtype	98%	98%	1.00
Agreement on difficulty level	58% (medium: 37.9%, hard: 37.9%, easy: 24.1%)	44% (medium: 36.4%, easy: 36.4%, hard: 27.3%)	0.41
Appropriateness of clinical field	90%	96%	1.00
Correctness of answer	76%	76%	0.78

Quick links

Search