Abstract
The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.
Similar content being viewed by others
Introduction
In recent years, large language models have become more prevalent in medical research1,2,3,4,5. These models have purportedly demonstrated high performance across a variety of fields in medicine, and easily passed formal assessments of medical knowledge such as the United States Medical Licensing Exam2,3,4. One of the more prominent benchmarks used to report the performance of LLMs is the MultiMedQA, which encompasses questions from many fields and stages of medical training2,6,7,8. Notably, this benchmark and others are composed of multiple-choice questions, which may present limitations for the accurate assessment of LLMs9,10,11. While recent works such as CRAFT-MD have focused on converting these multiple-choice questions into more real-world assessments involving multi-turn conversations, there are still no rigorous evaluations of the quality of these multiple-choice benchmarks themselves12.
We hypothesized that existing multiple-choice question (MCQ) benchmarks are poor metrics for assessing the medical knowledge and capabilities of LLMs. To test this, we developed a benchmark of paired free-response and multiple-choice questions and developed a technique for automatically assessing free-response answers. We then compared the performance of GPT-4o13, GPT-3.514, and Llama-3-70B15to answer questions when presented in both multiple-choice and free-response formats. We further studied the performance of these LLMs when the question stems were progressively masked in both free-response and multiple-choice formats. We hypothesized that multiple-choice performance should approach random chance at 25% as information is increasingly lost to masking. Lastly, we conducted human evaluations with medical students to establish human baselines and provide context for LLM results.
Results
FreeMedQA creation
Starting with 14,965 candidate questions from the MultiMedQA and using an LLM-based pipeline (see Methods, Extended Data Fig. 1), we created 10,278 questions with paired free-response and MCQ versions (FreeMedQA). We also built an evaluative method using GPT-4o as a judge to score free-response answers based on MCQ answers (Extended Data Fig. 2).
Evaluation of LLMs’ performance in free-response compared to multiple-choice
Using this novel benchmark, we found that GPT-4o, GPT-3.5, and Llama-3-70B-Chat report significant drops in performance when evaluated using a free-response question format as opposed to MCQ. On average, the models attained a 39.43% (combined p = 1.3 * 10-5) absolute drop in performance from multiple-choice to free-response answering capabilities. Llama 3 exhibited the greatest absolute drop of 46.59% (relative drop of 59.08%; p = 0.006), followed by GPT-4o reporting an absolute 37.50% drop (relative drop of 43.23%;p = 0.004), and GPT-3.5 reporting the lowest drop at 34.20% (relative drop of 56.51%; p = 0.004) (Fig. 1).
LLMs performance on FreeMedQA. Performance of gpt-4o-2024-08-06, gpt-3.5-turbo-0125, and llama3-70B-chat on FreeMedQA (n=10,278 for both MC and FR), as well as Medical Students on sample forms from the FreeMedQA (n=175). All three models displayed depreciated performances in the free-response category when compared to the multiple-choice with a 39.43% average drop in performance. Medical students had 22.29% decline in performance with the transition from multiple-choice to free-response. For the AI models, error bars represent the standard deviation from five independent experimental runs. For human, error bars represent the standard deviation for the medical students’ performance would be calculated with respect to their mean score.
Evaluation of medical students’ performance in free-response compared to multiple-choice
To contextualize these findings, we also assessed medical trainees using a subset of 175 unique questions from FreeMedQA MC and a subset of 175 unique questions from FreeMedQA
FR for a total sample size of 350 questions. We found that senior medical students experienced a 22.29% decrease in performance when transitioning from multiple-choice to free-response questions (p = 0.008), with scores dropping from 39.43% on multiple-choice to 17.79% on free-response questions (Fig. 1).
Evaluation of LLMs’ performance with masked inputs
To investigate this relative performance of LLMs further, we performed a masking study where we progressively masked out the question stems of FreeMedQA questions. For the multiple-choice component, the answer options were presented without any masking. All models deteriorated in performance in both multiple-choice and free-response categories as increasing portions of the input were masked. A notable discrepancy is at 100% masking, where the multiple-choice performance is on average 6.70% greater than that of a random chance of 25% for all models, implying an innate process of pattern recognition being used by the LLMs. GPT-4o had the greatest deviation from random chance with an accuracy of 37.34%, 12.34% higher than random chance, despite complete masking of the inputs (p = 0.031). Comparatively,
across all models, the free-response performance declines to 0.15%, where the deviation from 0% is observed by the evaluative model’s noise, random static that the model picks up, not a sign that the AI actually got any answers right. (Fig. 2).
Discussion
We present a straightforward revision of existing multiple-choice benchmarks for medical LLMs – converting them to free-response questions – that may aid researchers and clinicians in elucidating LLMs’ strengths and weaknesses. Our approach uses a highly challenging dataset, where even human medical trainees struggle, averaging 39% on the multiple choice questions, thus providing a more rigorous test of a model’s clinical reasoning. We find that medical LLMs have learned processes to determine the answer in a multiple-choice setting that is independent of their ability to answer the question being asked, as their performance declines significantly but remains above chance when input information is masked up to the entire input. Therefore, medical LLM performance on benchmarks composed of multiple-choice questions does not reflect their genuine understanding of medical concepts in a more general setting, and reformulating LLM assessments as free-response questions or multi-turn dialogues12 seems prudent.
A degradation in performance in the free-response format compared to the multiple-choice structure across all models was also observed. We attribute this to learned mechanisms utilized by LLMs to recognize the answer reliant solely on the answer options. We observed very little variability in correctness for either question format, which suggests that for these fact-based tasks, the models tend to have a high degree of certainty. For the medical students, despite the difference in form sampling, the observed decline in their performance from multiple-choice to free-response aligns with established psychological principles, suggesting that the format, rather than random variation, was the primary driver of the observed effect. Interestingly, the performance decrease of models is similar to that of humans, suggesting that some of this gap is due to a more general test-taking strategy that is leveraged by both medical LLMs and medical trainees. In fact, a parallel could be drawn from the different levels of learning witnessed in humans via Bloom’s Taxonomy16to similar mechanisms in LLM learning. While LLMs may excel at multiple-choice tasks by leveraging the provided options as cues, this does not necessarily mean they possess a true understanding of the subject matter, as their performance is less reliable on free-recall questions that require them to generate the correct answer without any assistance. Mere recognition of the correct choice becomes a more feasible task than recalling all associated information.17
While both humans and LLMs show a drop in performance, the greater degradation of LLMs compared to humans hints that LLMs may be better than humans at maximizing test strategies such as testwiseness, chance guessing, and cueing. Humans may be burdened by cognitive load or fatigue, preventing their prime performance from being presented consistently. In contrast, the model may be more capable of effectively “reverse engineering” the question, using the provided options to guide its response and eliminate incorrect choices. A feat that becomes infeasible when the options are eliminated.
Passing a test is a necessary but not sufficient condition for competence, as it does not account for the collaborative and contextual nature of actual practice18. Doctors operate within a system of checks and balances with other professionals and technology, a factor that is entirely absent in a test-based assessment. Other recent works have echoed this concern over the use of multiple-choice questions for assessing medical LLMs19. The recently released CRAFT-MD benchmark adapts medical multiple-choice questions to multi-turn dialogues, which poses an interesting alternative to free-response questions12. Also, calls to consider evaluating medical LLMs using processes similar to medical trainees seem to be increasingly justified in light of such results20.
It is to be acknowledged that medical exams, while a solid test of factual knowledge, are a fundamentally incomplete way to gauge a model’s true clinical readiness. They fail to capture the complex, collaborative dance of a real hospital floor, where doctors leverage a network of colleagues and technology. Most importantly, these benchmarks can’t measure the emotional intelligence and empathy that are the very heart of patient care. This is particularly relevant given that these models are known to be easily influenced by misleading cues that are common in clinical contexts21,22. This raises the concern that the LLMs may be optimized for a benchmark that is detached from the reality of clinical practice and the actual exams medical students face.
This study is not without limitations. To maintain the integrity of the study, we removed 31.32% of the questions in the MultiMedQA that required knowledge of the answer options to identify the correct answer. These questions were characterized as having prompts that relied heavily on the multiple choice options.This decreased the size of our derived FreeMedQA benchmark and also raised more general concerns over test question quality. Our study was conducted only on a subset of one popular medical benchmark, but there are other popular medical benchmarks used to report LLM performance.23 This study was restricted to the English language, but medical benchmarks in other languages also exist24. Our FreeMedEval approach utilizes GPT-4o, which while efficient, is stochastic and not void of error, although GPT-4 is commonly utilized in this manner in other studies12. Our findings also indicate differential performance of Llama-3-70B as it performs closer to GPT-4o on multiple-choice questions but drops to GPT-3.5 levels on free-response tasks. We were unable to provide an explanation for this observation since it requires detailed knowledge of model-specific training methodology, which is unfortunately not available for industry-grade models in the field. Furthermore, our filtering methodology, which relies on an LLM to determine question answerability, also presents a limitation. Our FreeMedEval approach utilizes GPT-4o for evaluation, which, by its stochastic nature, can introduce noise, predominantly in the form of false positives, into the scoring process. As demonstrated by the human-LLM disagreement, our filter may have retained a notable number of questions that are inherently unsolvable in a free-response format, which could partially explain the performance decrement observed in our models. The evaluation of natural language generation is an open problem in natural language processing, which we leave for future work. For future avenues of investigation, this study could be conducted on a dataset specific to a specialty in medicine25,26. The study could be scaled to additional modalities or scaled to include multiple popular datasets27,28,29. Lastly, further efforts to improve the evaluation of natural language generation and free-response answers are sorely needed.
Conclusion
Our study and other recent studies show that the evaluation of medical LLMs and medical AI in general is clearly an open problem. The push to develop and introduce AI models into clinical care needs to be met with equally strong and creative approaches to appropriately evaluate these models to ensure their safety and reliability.
Methods
Compliance and ethical approval
All methods were performed in accordance with relevant guidelines and regulations. The study’s experimental protocols were approved by NYU Langone Medical Center, and informed consent was verbally obtained from all subjects.
FreeMedQA filtering
We utilized the MultiMedQA dataset acquired from the Hugging Face Hub30. For each question, we prompted GPT-4o for whether it could be answered without multiple-choice options included. We employed few-shot prompting, a technique where a model is given a few examples to guide its response, by providing ten manually curated examples of correctly sorted questions with each call to the model for this task. This process yielded a subset of 10,278 MultiMedQA questions that were deemed appropriate to be converted to a free response format. This dataset can be found under “Supplementary Materials.” This dataset can be The multiple-choice versions of the dataset composed the MC portion of our new FreeMedQA dataset, also found under “Supplementary Materials”. ( Extended Data Fig. 1.)
Filtering quality control
A manual review was conducted to evaluate GPT-4o’s ability to categorize questions as answerable or not without answer choices, a manual review was conducted. 100 random questions were sampled from the MultiMedQA dataset, and judged to require answer options or not by a senior medical student who was blinded to the GPT-4o’s decision. (Extended Data Fig. 2.) In this evaluation, we found GPT-4o to tend to be conservative, excluding more questions than the human reviewer. We deemed this to be suitable for the task—while it decreases the size of the dataset, this minimizes the number of unanswerable questions.
Free-response adaptation
We employed regular expressions (RegEx) string matching to identify and replace phrases that are specific to multiple-choice questions. For example, "which of the following" was replaced with “what” to align with the free-response structure. The resulting questions with the corresponding correct answers composed the free response (FR) portion of our FreeMedQA dataset.
Performance assessment
We performed a comparative study of three different industry-grade LLMs - GPT-4o, GPT-3.5, and Llama-3-70B. We first prompted each model to answer the multiple-choice version of the question in FreeMedQA and evaluated the result using string matching. We used a maximum context length of 1024, and a temperature of 0.0, and repeated each experiment five times to obtain statistical bounds. We then presented each model with a free-response version of each question. We used GPT-4o to evaluate the correctness of the answer by presenting it with the correct answer choice and the candidate’s answer choice and prompting it to evaluate if two answers are similar (one is contained in the other) (Extended Data Fig. 3). Notably, it was blinded to the question being asked. We similarly repeated the experiment five times.
We obtained error bars by computing standard deviations over the reruns of our experiments (Fig. 1). We established statistical significance by using the Mann-Whitney U test between the five of each repeat. We aggregated the p-values using Fisher’s method.
Masked study
We systematically evaluated the medical LLM’s performance by progressively masking parts of the multiple-choice questions it was presented with. We used the results from the first experiment as our baseline with unmasked questions to record its maximum accuracy. Next, we tokenized each question and created three progressively masked versions: one with the last 25% of tokens hidden, another with the last 50%, one with the last 75% obscured, and a final one with 100% of the tokens hidden. (Supplemental Fig. 1) The masked portions were replaced with a generic [MASK] token, simulating a scenario where key information is missing. (Supplemental Fig. 2) The full list of answer choices remained visible for all versions, ensuring the model’s task was to select the correct option from the same set, but with less contextual information from the question stem.
For each masking level (25%, 50%, 75%, and 100%), we created a single, static masked dataset. We then ran each model on that same masked dataset five separate times to calculate the standard deviations for our error bars. To establish statistical significance in the multiple choice with 100% masking study, we used the Wilcoxon signed-rank test with a null hypothesis of random chance probability of 25% and an alternative that models perform better than chance.
Medical student knowledge evaluation
We created a non-overlapping set of 350 unique questions by selecting 175 questions from each of the two FreeMedQA subsets. These questions were distributed across seven Google Forms, each containing 50 questions, 25 multiple-choice and 25 free-response. The forms featured a randomized arrangement of multiple-choice and free-response questions. The multiple-choice questions were evaluated by comparison to the correct answer stored in the FreeMedQA by GPT-4o. The free-response questions were assessed via synonymity assessment by GPT-4o. We performed a one-sided paired Wilcoxon signed-rank test on multiple choice vs free response averages achieved on every form by the med students with an alternative of better performance on the multiple-choice forms.
Data availability
All data generated or analysed during this study are included in this published article.
References
Bommasani, R. et al. On the Opportunities and risks of foundation models. arXiv [cs.LG] (2021).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv [cs.CL] (2023).
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on Medical challenge problems. arXiv [cs.CL] (2023).
Alyakin, A. et al. Repurposing the scientific literature with vision-language models. arXiv [cs.AI] (2025).
Blanco, J., Lambert, C. & Thompson, O. GPT-Neo with LoRA for better medical knowledge performance on MultiMedQA dataset. https://doi.org/10.31219/osf.io/njupy (2024).
Bolton, E. et al. Assessing the potential of mid-sized language models for clinical QA. arXiv [cs.CL] (2024).
Hamzah, F. & Sulaiman, N. Optimizing llama 7B for medical question answering: a study on fine-tuning strategies and performance on the MultiMedQA dataset. https://osf.io/g5aes/download.
Li, W. et al. Can multiple-choice questions really be useful in detecting the abilities of LLMs? arXiv [cs.CL] (2024).
Balepur, N. & Rudinger, R. is your large language model knowledgeable or a choices-only cheater? arXiv [cs.CL] (2024).
Schubert, M. C., Wick, W. & Venkataramani, V. Performance of large language models on a neurology board-style examination. JAMA Netw Open 6, e2346721–e2346721 (2023).
Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31(77), 86 (2025).
OpenAI et al. GPT-4 Technical Report. arXiv [cs.CL] (2023).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Grattafiori, A. et al. The Llama 3 herd of models. arXiv [cs.AI] (2024).
Bilon, E. Using bloom’s taxonomy to write effective learning objectives: The Abcds of Writing learning objectives: a basic guide. (Independently Published, 2019).
Models of human memory. Google Books https://books.google.com/books/about/Models_of_Human_Memory.html?id=sGQhBQAAQB AJ.
McClelland, D. C. Testing for competence rather than for intelligence. (1973).
Griot, M., Vanderdonckt, J., Yuksel, D. & Hemptinne, C. Multiple choice questions and large languages models: a case study with fictional medical data. arXiv [cs.CL] (2024).
Rajpurkar, P. & Topol, E. J. A clinical certification pathway for generalist medical AI systems. Lancet 405, 20 (2025).
Vishwanath, K. et al. Medical large language models are easily distracted. (2025).
Vishwanath, K. et al. Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons. (2025).
Xu, J. et al. Data set and benchmark (MedGPTEval) to evaluate responses from large language models in medicine: evaluation development and validation. JMIR Med. Inform. 12, e57674 (2024).
Cai, Y. et al. MedBench: a large-scale chinese benchmark for evaluating medical large language models. AAAI 38, 17709–17717 (2024).
Longwell, J. B. et al. Performance of large language models on medical oncology examination questions. JAMA Netw. Open 7, e2417641 (2024).
Pellegrini, C., Keicher, M., Özsoy, E. & Navab, N. Rad-Restruct: A novel VQA benchmark and method for structured radiology reporting. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 409–419 (Springer Nature Switzerland, 2023).
Adams, L. et al. LongHealth: A question answering benchmark with long clinical documents. arXiv [cs.CL] (2024).
Dada, A. et al. CLUE: A clinical language understanding evaluation for LLMs. arXiv [cs.CL] (2024).
Chen, Q. & Deng, C. Bioinfo-Bench: A simple benchmark framework for LLM bioinformatics skills evaluation. bioRxiv 2023.10.18.563023 https://doi.org/10.1101/2023.10.18.563023. (2023)
MultiMedQA - a openlifescienceai Collection. https://huggingface.co/collections/openlifescienceai/multimedqa-66098a5b280539974cefe4 85.
Author information
Authors and Affiliations
Contributions
E.K.O. conceptualized and supervised the project. S.S. and A.A. developed the dataset conversion pipeline and the LLM evaluation pipeline. D.A.A. and A.P.S.T. performed the pipeline quality assurance. S.S. created forms for human evaluation. D.A.A, A.P.S.T., K.S., N.G., M.D.L.P., M.H., and K.Y.P. performed the medical student knowledge evaluation. S.S. and A.A. performed the statistical analysis. S.S., A.A., D.A.A., and E.K.O. drafted the manuscript text and designed the figures. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Singh, S., Alyakin, A., Alber, D.A. et al. The pitfalls of multiple-choice questions in generative AI and medical education. Sci Rep 15, 42096 (2025). https://doi.org/10.1038/s41598-025-26036-7
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-26036-7




