Introduction

In recent years, large language models have become more prevalent in medical research1,2,3,4,5. These models have purportedly demonstrated high performance across a variety of fields in medicine, and easily passed formal assessments of medical knowledge such as the United States Medical Licensing Exam2,3,4. One of the more prominent benchmarks used to report the performance of LLMs is the MultiMedQA, which encompasses questions from many fields and stages of medical training2,6,7,8. Notably, this benchmark and others are composed of multiple-choice questions, which may present limitations for the accurate assessment of LLMs9,10,11. While recent works such as CRAFT-MD have focused on converting these multiple-choice questions into more real-world assessments involving multi-turn conversations, there are still no rigorous evaluations of the quality of these multiple-choice benchmarks themselves12.

We hypothesized that existing multiple-choice question (MCQ) benchmarks are poor metrics for assessing the medical knowledge and capabilities of LLMs. To test this, we developed a benchmark of paired free-response and multiple-choice questions and developed a technique for automatically assessing free-response answers. We then compared the performance of GPT-4o13, GPT-3.514, and Llama-3-70B15to answer questions when presented in both multiple-choice and free-response formats. We further studied the performance of these LLMs when the question stems were progressively masked in both free-response and multiple-choice formats. We hypothesized that multiple-choice performance should approach random chance at 25% as information is increasingly lost to masking. Lastly, we conducted human evaluations with medical students to establish human baselines and provide context for LLM results.

Results

FreeMedQA creation

Starting with 14,965 candidate questions from the MultiMedQA and using an LLM-based pipeline (see Methods, Extended Data Fig. 1), we created 10,278 questions with paired free-response and MCQ versions (FreeMedQA). We also built an evaluative method using GPT-4o as a judge to score free-response answers based on MCQ answers (Extended Data Fig. 2).

Evaluation of LLMs’ performance in free-response compared to multiple-choice

Using this novel benchmark, we found that GPT-4o, GPT-3.5, and Llama-3-70B-Chat report significant drops in performance when evaluated using a free-response question format as opposed to MCQ. On average, the models attained a 39.43% (combined p = 1.3 * 10-5) absolute drop in performance from multiple-choice to free-response answering capabilities. Llama 3 exhibited the greatest absolute drop of 46.59% (relative drop of 59.08%; p = 0.006), followed by GPT-4o reporting an absolute 37.50% drop (relative drop of 43.23%;p = 0.004), and GPT-3.5 reporting the lowest drop at 34.20% (relative drop of 56.51%; p = 0.004) (Fig. 1).

Fig. 1
figure 1

LLMs performance on FreeMedQA. Performance of gpt-4o-2024-08-06, gpt-3.5-turbo-0125, and llama3-70B-chat on FreeMedQA (n=10,278 for both MC and FR), as well as Medical Students on sample forms from the FreeMedQA (n=175). All three models displayed depreciated performances in the free-response category when compared to the multiple-choice with a 39.43% average drop in performance. Medical students had 22.29% decline in performance with the transition from multiple-choice to free-response. For the AI models, error bars represent the standard deviation from five independent experimental runs. For human, error bars represent the standard deviation for the medical students’ performance would be calculated with respect to their mean score.

Evaluation of medical students’ performance in free-response compared to multiple-choice

To contextualize these findings, we also assessed medical trainees using a subset of 175 unique questions from FreeMedQA MC and a subset of 175 unique questions from FreeMedQA

FR for a total sample size of 350 questions. We found that senior medical students experienced a 22.29% decrease in performance when transitioning from multiple-choice to free-response questions (p = 0.008), with scores dropping from 39.43% on multiple-choice to 17.79% on free-response questions (Fig. 1).

Evaluation of LLMs’ performance with masked inputs

To investigate this relative performance of LLMs further, we performed a masking study where we progressively masked out the question stems of FreeMedQA questions. For the multiple-choice component, the answer options were presented without any masking. All models deteriorated in performance in both multiple-choice and free-response categories as increasing portions of the input were masked. A notable discrepancy is at 100% masking, where the multiple-choice performance is on average 6.70% greater than that of a random chance of 25% for all models, implying an innate process of pattern recognition being used by the LLMs. GPT-4o had the greatest deviation from random chance with an accuracy of 37.34%, 12.34% higher than random chance, despite complete masking of the inputs (p = 0.031). Comparatively,

across all models, the free-response performance declines to 0.15%, where the deviation from 0% is observed by the evaluative model’s noise, random static that the model picks up, not a sign that the AI actually got any answers right. (Fig. 2).

Fig. 2
figure 2

Performance of LLMs with Masking. A graphical presentation of the performance of the studied models, when the prompt is masked in 25% increments, in free-response and multiple-choice formats.

Discussion

We present a straightforward revision of existing multiple-choice benchmarks for medical LLMs – converting them to free-response questions – that may aid researchers and clinicians in elucidating LLMs’ strengths and weaknesses. Our approach uses a highly challenging dataset, where even human medical trainees struggle, averaging 39% on the multiple choice questions, thus providing a more rigorous test of a model’s clinical reasoning. We find that medical LLMs have learned processes to determine the answer in a multiple-choice setting that is independent of their ability to answer the question being asked, as their performance declines significantly but remains above chance when input information is masked up to the entire input. Therefore, medical LLM performance on benchmarks composed of multiple-choice questions does not reflect their genuine understanding of medical concepts in a more general setting, and reformulating LLM assessments as free-response questions or multi-turn dialogues12 seems prudent.

A degradation in performance in the free-response format compared to the multiple-choice structure across all models was also observed. We attribute this to learned mechanisms utilized by LLMs to recognize the answer reliant solely on the answer options. We observed very little variability in correctness for either question format, which suggests that for these fact-based tasks, the models tend to have a high degree of certainty. For the medical students, despite the difference in form sampling, the observed decline in their performance from multiple-choice to free-response aligns with established psychological principles, suggesting that the format, rather than random variation, was the primary driver of the observed effect. Interestingly, the performance decrease of models is similar to that of humans, suggesting that some of this gap is due to a more general test-taking strategy that is leveraged by both medical LLMs and medical trainees. In fact, a parallel could be drawn from the different levels of learning witnessed in humans via Bloom’s Taxonomy16to similar mechanisms in LLM learning. While LLMs may excel at multiple-choice tasks by leveraging the provided options as cues, this does not necessarily mean they possess a true understanding of the subject matter, as their performance is less reliable on free-recall questions that require them to generate the correct answer without any assistance. Mere recognition of the correct choice becomes a more feasible task than recalling all associated information.17

While both humans and LLMs show a drop in performance, the greater degradation of LLMs compared to humans hints that LLMs may be better than humans at maximizing test strategies such as testwiseness, chance guessing, and cueing. Humans may be burdened by cognitive load or fatigue, preventing their prime performance from being presented consistently. In contrast, the model may be more capable of effectively “reverse engineering” the question, using the provided options to guide its response and eliminate incorrect choices. A feat that becomes infeasible when the options are eliminated.

Passing a test is a necessary but not sufficient condition for competence, as it does not account for the collaborative and contextual nature of actual practice18. Doctors operate within a system of checks and balances with other professionals and technology, a factor that is entirely absent in a test-based assessment. Other recent works have echoed this concern over the use of multiple-choice questions for assessing medical LLMs19. The recently released CRAFT-MD benchmark adapts medical multiple-choice questions to multi-turn dialogues, which poses an interesting alternative to free-response questions12. Also, calls to consider evaluating medical LLMs using processes similar to medical trainees seem to be increasingly justified in light of such results20.

It is to be acknowledged that medical exams, while a solid test of factual knowledge, are a fundamentally incomplete way to gauge a model’s true clinical readiness. They fail to capture the complex, collaborative dance of a real hospital floor, where doctors leverage a network of colleagues and technology. Most importantly, these benchmarks can’t measure the emotional intelligence and empathy that are the very heart of patient care. This is particularly relevant given that these models are known to be easily influenced by misleading cues that are common in clinical contexts21,22. This raises the concern that the LLMs may be optimized for a benchmark that is detached from the reality of clinical practice and the actual exams medical students face.

This study is not without limitations. To maintain the integrity of the study, we removed 31.32% of the questions in the MultiMedQA that required knowledge of the answer options to identify the correct answer. These questions were characterized as having prompts that relied heavily on the multiple choice options.This decreased the size of our derived FreeMedQA benchmark and also raised more general concerns over test question quality. Our study was conducted only on a subset of one popular medical benchmark, but there are other popular medical benchmarks used to report LLM performance.23 This study was restricted to the English language, but medical benchmarks in other languages also exist24. Our FreeMedEval approach utilizes GPT-4o, which while efficient, is stochastic and not void of error, although GPT-4 is commonly utilized in this manner in other studies12. Our findings also indicate differential performance of Llama-3-70B as it performs closer to GPT-4o on multiple-choice questions but drops to GPT-3.5 levels on free-response tasks. We were unable to provide an explanation for this observation since it requires detailed knowledge of model-specific training methodology, which is unfortunately not available for industry-grade models in the field. Furthermore, our filtering methodology, which relies on an LLM to determine question answerability, also presents a limitation. Our FreeMedEval approach utilizes GPT-4o for evaluation, which, by its stochastic nature, can introduce noise, predominantly in the form of false positives, into the scoring process. As demonstrated by the human-LLM disagreement, our filter may have retained a notable number of questions that are inherently unsolvable in a free-response format, which could partially explain the performance decrement observed in our models. The evaluation of natural language generation is an open problem in natural language processing, which we leave for future work. For future avenues of investigation, this study could be conducted on a dataset specific to a specialty in medicine25,26. The study could be scaled to additional modalities or scaled to include multiple popular datasets27,28,29. Lastly, further efforts to improve the evaluation of natural language generation and free-response answers are sorely needed.

Conclusion

Our study and other recent studies show that the evaluation of medical LLMs and medical AI in general is clearly an open problem. The push to develop and introduce AI models into clinical care needs to be met with equally strong and creative approaches to appropriately evaluate these models to ensure their safety and reliability.

Methods

Compliance and ethical approval

All methods were performed in accordance with relevant guidelines and regulations. The study’s experimental protocols were approved by NYU Langone Medical Center, and informed consent was verbally obtained from all subjects.

FreeMedQA filtering

We utilized the MultiMedQA dataset acquired from the Hugging Face Hub30. For each question, we prompted GPT-4o for whether it could be answered without multiple-choice options included. We employed few-shot prompting, a technique where a model is given a few examples to guide its response, by providing ten manually curated examples of correctly sorted questions with each call to the model for this task. This process yielded a subset of 10,278 MultiMedQA questions that were deemed appropriate to be converted to a free response format. This dataset can be found under “Supplementary Materials.” This dataset can be The multiple-choice versions of the dataset composed the MC portion of our new FreeMedQA dataset, also found under “Supplementary Materials”. ( Extended Data Fig. 1.)

Filtering quality control

A manual review was conducted to evaluate GPT-4o’s ability to categorize questions as answerable or not without answer choices, a manual review was conducted. 100 random questions were sampled from the MultiMedQA dataset, and judged to require answer options or not by a senior medical student who was blinded to the GPT-4o’s decision. (Extended Data Fig. 2.) In this evaluation, we found GPT-4o to tend to be conservative, excluding more questions than the human reviewer. We deemed this to be suitable for the task—while it decreases the size of the dataset, this minimizes the number of unanswerable questions.

Free-response adaptation

We employed regular expressions (RegEx) string matching to identify and replace phrases that are specific to multiple-choice questions. For example, "which of the following" was replaced with “what” to align with the free-response structure. The resulting questions with the corresponding correct answers composed the free response (FR) portion of our FreeMedQA dataset.

Performance assessment

We performed a comparative study of three different industry-grade LLMs - GPT-4o, GPT-3.5, and Llama-3-70B. We first prompted each model to answer the multiple-choice version of the question in FreeMedQA and evaluated the result using string matching. We used a maximum context length of 1024, and a temperature of 0.0, and repeated each experiment five times to obtain statistical bounds. We then presented each model with a free-response version of each question. We used GPT-4o to evaluate the correctness of the answer by presenting it with the correct answer choice and the candidate’s answer choice and prompting it to evaluate if two answers are similar (one is contained in the other) (Extended Data Fig. 3). Notably, it was blinded to the question being asked. We similarly repeated the experiment five times.

We obtained error bars by computing standard deviations over the reruns of our experiments (Fig. 1). We established statistical significance by using the Mann-Whitney U test between the five of each repeat. We aggregated the p-values using Fisher’s method.

Masked study

We systematically evaluated the medical LLM’s performance by progressively masking parts of the multiple-choice questions it was presented with. We used the results from the first experiment as our baseline with unmasked questions to record its maximum accuracy. Next, we tokenized each question and created three progressively masked versions: one with the last 25% of tokens hidden, another with the last 50%, one with the last 75% obscured, and a final one with 100% of the tokens hidden. (Supplemental Fig. 1) The masked portions were replaced with a generic [MASK] token, simulating a scenario where key information is missing. (Supplemental Fig. 2) The full list of answer choices remained visible for all versions, ensuring the model’s task was to select the correct option from the same set, but with less contextual information from the question stem.

For each masking level (25%, 50%, 75%, and 100%), we created a single, static masked dataset. We then ran each model on that same masked dataset five separate times to calculate the standard deviations for our error bars. To establish statistical significance in the multiple choice with 100% masking study, we used the Wilcoxon signed-rank test with a null hypothesis of random chance probability of 25% and an alternative that models perform better than chance.

Medical student knowledge evaluation

We created a non-overlapping set of 350 unique questions by selecting 175 questions from each of the two FreeMedQA subsets. These questions were distributed across seven Google Forms, each containing 50 questions, 25 multiple-choice and 25 free-response. The forms featured a randomized arrangement of multiple-choice and free-response questions. The multiple-choice questions were evaluated by comparison to the correct answer stored in the FreeMedQA by GPT-4o. The free-response questions were assessed via synonymity assessment by GPT-4o. We performed a one-sided paired Wilcoxon signed-rank test on multiple choice vs free response averages achieved on every form by the med students with an alternative of better performance on the multiple-choice forms.