Fig. 1 | Scientific Reports

Fig. 1

From: The pitfalls of multiple-choice questions in generative AI and medical education

Fig. 1

LLMs performance on FreeMedQA. Performance of gpt-4o-2024-08-06, gpt-3.5-turbo-0125, and llama3-70B-chat on FreeMedQA (n=10,278 for both MC and FR), as well as Medical Students on sample forms from the FreeMedQA (n=175). All three models displayed depreciated performances in the free-response category when compared to the multiple-choice with a 39.43% average drop in performance. Medical students had 22.29% decline in performance with the transition from multiple-choice to free-response. For the AI models, error bars represent the standard deviation from five independent experimental runs. For human, error bars represent the standard deviation for the medical students’ performance would be calculated with respect to their mean score.

Back to article page