Table 2 Two emergency room physicians' validations of GPT-4 generated USMLE questions and inter-rater agreement rates.
Review category | Human 1 agreement (percentage) | Human 2 agreement (percentage) | Cohen’s Kappa |
---|---|---|---|
Appropriateness of question | 90% | 90% | 0.78 |
Appropriateness of type | 100% | 100% | 1.00 |
Appropriateness of subtype | 98% | 98% | 1.00 |
Agreement on difficulty level | 58% (medium: 37.9%, hard: 37.9%, easy: 24.1%) | 44% (medium: 36.4%, easy: 36.4%, hard: 27.3%) | 0.41 |
Appropriateness of clinical field | 90% | 96% | 1.00 |
Correctness of answer | 76% | 76% | 0.78 |