Abstract
This study evaluated the effectiveness of large language models (LLMs) and vision-language models (VLMs) in gastroenterology. We used board-style multiple-choice questions to assess the performance of both proprietary and open-source LLMs and VLMs—including GPT, Claude, Gemini, Mistral, Llama, Mixtral, Phi, and Qwen, across different interfaces, computing environments, and levels of compression (quantization). Among the proprietary models, o1-preview (82.0%) and Claude3.5-Sonnet (74.0%) had the highest accuracy, outperforming the top open-source models: Llama3.3-70b (65.7%) and Qwen-2.5-72b (61.0%). Among the small quantized open-source models, the 8-bit Llama 3.2-11b (51.7%) and 6-bit Phi3-14b (48.7%) performed the best, with scores comparable to their full-precision counterparts. Notably, VLM accuracy on image-containing questions improved (~10%) when given human-generated captions, remained unchanged with original images, and declined with LLM-generated captions. Further research is warranted to evaluate model capabilities in real-world clinical decision-making scenarios.
Similar content being viewed by others
Introduction
Large language models (LLMs) increasingly serve as healthcare decision-support tools, leveraging their ability to process text at scale with a domain-specific, evidence-based knowledge base1,2. Because LLMs performance depends heavily on training data, these models often exhibit substantial variability across tasks, medical specialties, and clinical settings3. Medical subspecialities like gastroenterology pose a particular challenge, as its clinical practice requires the applications of complex, specific, and rapidly evolving technical knowledge and to disparate clinical data types, including text, tabular data, and images4. Vision-language models (VLMs), designed to process both textual and visual inputs, hold immense potential for augmenting or automating many clinical tasks that require both text and image inputs5. However, VLM performance with clinical data is poorly understood.
Board-exam style multiple-choice questions (MCQs) have served as imperfect, but standardized benchmarks of LLM medical reasoning capabilities4. While clinical MCQs do not capture the full complexity of real-world clinical reasoning, they can provide an objective, concrete, and reproducible gold standard to benchmark against6,7. MCQ-based studies of LLM clinical reasoning show variable performance on specialty-level datasets, with some studies showing performance approaching that of medical residents. However, the utility and generalizability of these studies are limited by their inconsistent or unspecified prompting strategies, variable model settings, and assessment of only a small subset of proprietary models. Supplementary Table 1 summarizes previous studies on LLM performance evaluation on medical MCQs8,9,10,11,12,13,14,15,16,17,18,19,20,21,22.
Our study aims to fill these critical gaps. We offer a comprehensive benchmarking strategy for both LLMs and VLMs, focusing on gastroenterology-related clinical reasoning questions, thereby addressing the paucity of standardized evaluations in this specialized field. We apply our strategy to a comprehensive sampling of 36 LLMs, including 7 open-source and proprietary model families (Llama, Phi, Gemma, Qwen, OpenAI, Claude, Gemini), 3 levels of model quantization (i.e., degree of model compression used to facilitate local use), 4 LLMs fine-tuned on medical data, and 7 VLMs. We explore a wide range of potential factors impacting model response, including the inclusion of images, question content, question difficulty, and model interface. Critically, we present a comprehensive, systematic, reproducible, and scalable benchmarking approach that can be extended to other clinical reasoning MCQ datasets.
We conducted evaluations of models across various interfaces—including application programming interfaces (APIs), web-based user interfaces (both first- and third-party), and local environments (ranging from personal laptops to high-performance servers)—to reflect the multifaceted settings in which LLMs may be utilized. A list of glossaries and terms pertaining to LLMs is provided in Supplementary Table 2.
Results
Dataset characterization
The Experiment 1 dataset, American College of Gastroenterology Self-Assessment (ACG-SA) 2022, comprises 300 questions, 138 of which are image-based. The dataset spans 10 gastroenterology topics, with the five most prevalent being liver (N = 52), colon (N = 49), esophagus (N = 36), pancreaticobiliary (N = 32), and endoscopy (N = 26). Nearly all the questions were case-based (N = 297) and targeted higher-order thinking skills (N = 298). We identified 123 questions focused on diagnosis, 217 on treatment, 211 on investigation, 55 on complications, and 3 on pathophysiology. The average question length was 232.92 ± 94.10 tokens, roughly equivalent to 174 words. The average performance of test-takers on 2022 questions was 74.52% ± 19.49%.
The optimization dataset consisted of 60 randomly sampled text-based questions from ACG-SA 2021 and 2023. The Experiment 2 dataset comprises 138 image-based questions from ACG-SA 2022, totaling 195 images used to answer these questions. The average dimensions of the images were 257 ± 79 pixels by 206 ± 83 pixels, with a resolution of 110 ± 71 DPI. The Experiment 3 dataset included ACG-SA 2021, 2022, and 2023, each containing 300 questions.
LLM setting optimization
We found that the choice of model settings, prompt strategy, and use of structured outputs all impacted LLM performance in gastroenterology MCQ reasoning. We found that structured output generation, when available, enhanced performance by 5–10% (Fig. 1a and Supplementary Table 3). When evaluating the 11 prompt engineering techniques individually, we found that 5 techniques resulted in a score > 2 standard deviations above the raw prompt. The direct question and answer and justification prompts had the greatest impact on performance (mean difference and 95% CI: 4.44%; 2.78–6.11%), followed by contextual embedding (2.77%; −0.56% to 5.0%), confidence scoring (1.67%; −0.56% to 3.34%), and expert mimicry (1.67%; 0.56%–2.78%). The best performing prompt containing all 5 techniques increased performance by 5.00% (95% CI: 1.11% to 7.78%) (Fig. 1d and Supplementary Table 4). Supplementary Fig. 1 demonstrates the effectiveness of our GPT-3.5-engineered prompt compared to a raw prompt. The optimized prompt enhanced performance for GPT-4o, Llama3.1-70b, and Llama-3.1-8b models, while maintaining Phi-3.5-4b’s baseline performance. Setting the model temperature to 1 (default) resulted in optimal performance (Fig. 1b and Supplementary Table 5). Additionally, setting max_token to 512 plus the number of tokens in the input text resulted in the best performance (Fig. 1c and Supplementary Table 6). The best performing settings were used for all subsequent analyses.
LLM performance on text-based MCQ reasoning
Figure 2 illustrates the performance of full-size LLMs for all 2022 ACG self-assessment questions when only the question text (without any images) was provided. As depicted in Supplementary Fig. 2, the granular analysis at the question level reveals a significant overlap in correctly answered questions across various LLMs. OpenAI o1 achieved the highest score of 82.0%, surpassing the average score of test takers (74.52%) and cutoff required to pass the exam (70%). When restricted to questions without associated image data, OpenAI o1 performance increased to 86.4%. Among the remaining commercial models, GPT-4o and Claude-3.5-Sonnet achieved a score of 74%. The subsequent group of models, including GPT-4 and Claude 3.5-Opus, attained scores ranging from 64% to 66%, which were comparable to those of the top open-source models. Notably, the API interface with optimized model settings uniformly matched the performance of the equivalent web chat interface for all proprietary models (Supplementary Fig. 3). Thus, the unoptimized API may underperform the equivalent proprietary web chat interface. Among the open-source models, Llama3.1-405b and Llama-3.3-70b demonstrated the highest performance (64%), followed by Qwen-2.5-72b (61%). The most effective quantized model hosted on a local computer (requiring less than 16GB of memory) was Llama-3.2-11b-Q8 (51.7%) and Phi3-14b-Q6 (48.7%).
Models are grouped by family and arranged chronologically within each cluster from oldest to newest. Stacked bars show response categories: correct, 2-option selection (2OP), external option selection (EOP), no option selection (NOP), errors, and incorrect responses. For models tested through both API and web interface, the API version is used in this figure due to similar performance between interfaces.
Model size and quantization performance loss
Figure 3a highlights the direct relationship between accuracy and model size, and Fig. 3b illustrates the performance of quantized models in comparison to the full-size model. A comparative analysis of the full-precision open-source models and their quantized counterparts demonstrated comparable results for six out of five evaluated models, particularly for large models and 8-bit quantized versions. The most substantial performance change was observed in Llama3-8b, which exhibited a decrease in accuracy from 43.3% to 31% following 8-bit quantization.
a Model size (billions of parameters, logarithmic scale) versus performance accuracy (percent). LLM families are color-coded: Mistral (blue), Llama (purple), Gemma (green), Phi (red), and Qwen (orange). b Performance changes in quantized models compared to their full-precision counterparts (connected by lines).
Supplementary Fig. 4 illustrates the performance of quantized and full-size models on a single LLM architecture (Llama-3.2) across four model sizes of 1b, 3b, 11b, and 90b. The 4-bit quantized version of the 11b model exhibited superior performance compared to the full-size 3b model (44.3% vs 35.7%) while requiring equivalent memory sizes. Additionally, the 4-bit quantized version of the 90b model demonstrated enhanced performance relative to the full-size 11b model (61.0% vs 48.7%).
We evaluated two 8-bit quantized fine-tuned models: medicineLLM (based on Llama2-7b) and OpenBioLLM (based on Llama3-8b). Both models exhibited lower performance compared to their 8-bit quantized source models. OpenBioLLM demonstrated inferior performance relative to its source model, Llama3-8b (29% vs. 31%), and medicineLLM performed less effectively than Llama2-7b (27% vs. 30.3%). The meditron-70b-Q4 model, when subjected to training from inception rather than fine-tuning on a general-purpose LLM, demonstrated a performance metric of 29.7%.
LLM performance stratified by question type
We utilized the 2022 ACG dataset to investigate the impact of question characteristics on LLM performance. Figure 4 illustrates performance for top performers of each model family, stratified by these characteristics. No significant differences were observed based on the presence of images (Supplementary Fig. 5a–c), question subject (Supplementary Fig. 6a–k), inclusion of laboratory data (Supplementary Fig. 7a, b), or the patient care phases (Supplementary Fig. 8a–e) in the questions. Unsurprisingly, performance declined across all LLMs with increasing question difficulty (Supplementary Fig. 9a–d). For instance, GPT-4o’s performance decreased from 88% on easy (human average performance: quartile 4) questions to 74% on medium difficulty (quartile 2 and quartile 3) questions and 58.7% (quartile 1) on challenging questions. Question length did not show a clear relationship with longer questions (Supplementary Fig. 10a–c). Due to the limited number of questions aimed at lower-level thinking skills (two) and non-case-based questions (three), we did not investigate performance stratified by question taxonomy.
Performance is reported stratified by question topic, text- or image-based format, question length, patient care phase, laboratory inclusiveness in questions, and difficulty (Q1 represents challenging questions based on the average percentage of humans answering correctly, and Q4 represents easy questions). The bars represent percentage of accurate answers with 95% confidence intervals estimated using the bootstrapping method.
VLM performance on image-inclusive questions
Figure 5 illustrates VLM performance when answering MCQ across four different scenarios: no image data, VLM-generated text descriptions of the images, direct images, and human-generated text descriptions. Supplementary Fig. 11 shows GPT-4 and Claude-3-Opus performance within the API and Web environments. Providing images directly to VLMs did not significantly improve performance over the no-image baseline, except for Claude3-Sonnet (20.2% increase, p = 0.001) and Llama3.2-11b (8.7% increase, p = 0.047). For other models, no substantial improvements were observed: GPT4 Web showed a marginal 2.9% increase, GPT4 API experienced a 3.5% decrease, Claude3-Opus-Web showed no change, Claude3-OpusV-API had a 0.7% increase, GeminiAdvancedV-Web recorded a 4.7% decrease, Phi-3.5-4b displayed a 4.4% increase, and Llama3.2-90b showed a 4.3% decrease.
Models are clustered and arranged by four scenarios: no image (baseline), VLM caption + question, image + question, and human description + question. Stacked bars show response categories: correct, 2-option selection (2OP), external option selection (EOP), no option selection (NOP), errors, and incorrect responses. Chi-square tests were performed to compare each scenario against the no-image baseline, with p-values reported adjacent to the corresponding bars. For models tested through both API and web interface, the API version is used in this figure due to similar performance between interfaces.
Generating VLM descriptions and then supplying the descriptions along with the question text did not improve the performance of GPT4V, GeminiAdvancedV, or Claude3-OpusV and worsened Llama3.2-11b performance (3% decrease). In contrast, providing a one-sentence human hint that captured the core information from the image, without directly providing the answer, resulted in performance improvements for all the models, except for GPT4V-API and Phi-3.5-4b, which showed similar performance. Human image guidance resulted in improvements ranging from 8.0% (95% CI: 0.65%, 15.35%) for GPT4V-Web to 29.3% (95% CI: 21.86%, 36.8%) for GeminiAdvancedV-Web. Interestingly, Claude-3-Sonnet performance did not improve with human hints. These findings suggest that LLM gastroenterology image comprehension and reasoning capabilities are poor.
VLM performance stratified by question type
Figure 6 illustrates the stratified performance of VLMs across different image types for the two models that demonstrated statistically significant improvements with direct image input: Llama3.2-11b and Claude-3-Sonnet. Comparative performance metrics for GPT-4 and Claude-3-Opus are detailed in Supplementary Fig. 12. The figure delineates baseline performance without image input, the performance change with direct image input, and the maximum performance achieved with human-generated descriptions. Performance varied drastically across models and image modalities (Supplementary Fig. 13a–e). For example, with 38 endoscopy images, GPT-4’s performance remained unchanged, while Llama3.2-11b and Claude-3-Opus demonstrated performance gains of 5–10%. In contrast, for eight manometry images, GPT-4 and Llama3.2-11b exhibited a 20% improvement, whereas Claude-3-Opus showed a lack of understanding, reflected in a 10% decrease in performance when images were provided.
Two VLMs, a Llama-3.2-11b and b Claude-3-Sonnet, are arranged by three scenarios. The baseline performance is presented in the left column, while the change in performance after providing the image directly and the human description is displayed in the middle and right columns, respectively. The bars represent percentage of accurate answers with 95% confidence intervals estimated using the bootstrapping method. Performance is reported stratified by question topic, text- or image-based format, question length, patient care phase, laboratory inclusiveness in questions, and difficulty (Q1 represents challenging questions based on the average percentage of humans answering correctly).
GPT models performance over time: model generation and question years
LLM performance remained relatively consistent across the 2021, 2022, and 2023 versions of the examination, with GPT-4-0613 scoring 71.59%, 65.38%, and 70.35%, respectively. GPT-3.5-0125 scored 44.57%, 46.30%, and 53.49%, respectively (Fig. 7a). As illustrated in Fig. 7b, model performance was impacted more by the introduction of a new model archenteric (i.e., GPT-4o) than the addition of more recent training data (i.e., successive versions of GPT-4).
Auxiliary results
During our setup experiments, we noted inconsistent model outputs across multiple model runs. To explore the impact of key model settings on model consistency, we measured differences in model accuracy over 3 runs (Fig. 8). We found that a fixed random seed, a temperature of zero, and simpler prompts lead to the most consistent outputs. Notably, an increase in consistency did not necessarily lead to improved accuracy.
As a case study, we selected a medium difficulty question consisting of 217 tokens (139 words, 793 characters) that focused on the management of liver disease. After incorporating our prompt-engineered command, the total model input increased to 311 tokens (212 words, 1302 characters). Supplementary Table 7 details the execution time and cost per token and per execution, along with the minimum required infrastructure for all the setups in Experiment 1.
Discussion
Our study aimed to assess the performance of LLMs and VLMs in gastroenterology via a reproducible methodology, providing initial insights for future applications of these tools. The capabilities of LLMs have demonstrated remarkable progress, with accuracy rates rising from 12% (base GPT-3 architectures) to an impressive 82% (o1). Notably, Claude 3 Opus, the most advanced model in the Claude 3 family at the time of publication, achieved a performance of 66%, equal to that of GPT-4.
We found that open-source models underperformed proprietary models. Llama3.3-70b achieved 64% accuracy, reaching GPT-4 performance, followed by Mixtral 8x7b with 55% accuracy, and smaller models of Llama3.2-3b and -11b with 35.7% and 48.7% accuracy, respectively. While the tested open-source models perform worse than commercial models do, open-source models can be run locally and offer multiple advantages in the medical field, including data privacy and model stability23. However, running these LLMs locally requires substantial infrastructure that can be challenging to scale.
For example, even a relatively small model with 7 billion parameters requires approximately 15 gigabytes of memory, not including the memory consumed by the operating system24. Larger models, with 13 billion or 70 billion parameters, demand approximately 30 GB and 150 GB of memory, respectively. Moreover, achieving acceptable inference times (i.e., the time taken to process a prompt and generate a response) necessitates processing power that can only be provided with expensive high-performance computers. Using cloud-based services can avoid the need to buy this infrastructure, but this approach is also limited by data privacy and cost considerations, leaving users with two options: either investing significantly in the necessary infrastructure or opting for smaller models that require less memory.
Quantization offers a potential solution by reducing the model size at the cost of some performance loss. A detailed explanation of this process is beyond the scope of our study (for more information, see Huggingface24 and Jacob et al.25). In essence, quantization is the process of compressing a full-precision model (with 32-bit weights) to a lower-precision version, such as 16-bit (Q16) or 8-bit (Q8), resulting in models that are half or a quarter of the original size. This reduces the computational cost of LLMs, enabling them to run on consumer-grade computers. The accessibility of this approach allows users to maintain full control over their data, thereby safeguarding patient information. However, the impact of this optimization on model reasoning has not been explored.
In our investigation, we assessed the performance of seven quantized models running on a consumer-grade laptop with 16GB memory and six models running on a high-performance computing cluster (H100). Phi3-14b-Q6 demonstrated the best performance, achieving an accuracy of 48.7%. With the exception of a performance decrease from 43% to 31% for the Llama3-8b-Q8 model, all quantized models maintained results comparable to those of their full-size counterparts (Fig. 3b).
We found that healthcare fine-tuned versions of Llama2-7b (medicineLLM) and Llama3-8b (OpenBioLLM) underperformed their general-purpose counterparts. The performance of a specialized medical language models, Meditron-70b-Q4, has been found to be suboptimal. This observation may underscore the inherent challenges in developing a domain-specific LLM from inception, as opposed to general-purpose LLMs. These challenges are manifest in various aspects, including the optimization of architecture and model design, as well as the acquisition and utilization of data in expert domains.
Previous studies of nonmedical tasks have indicated that LLM performance remains stable after 8-bit quantization26,27. Our findings align with these studies, as three out of four quantized models showed minimal to no drop in performance. Comparing Phi3-3b-Q16 (42%) and Phi3-14b-Q6 (48%), it appears that using larger quantized models may be preferable to running full-precision versions of smaller models. However, these conclusions are based on the narrow context of our quantization experiments. Future studies should consider the impacts of machine configurations, different quantization levels, and performance across different tasks and domains, including multimodal models.
Our evaluation of leading VLMs, including GPT-4, Claude3-Opus, GeminiAdvanced, Llama 3.2, and Phi3.5, revealed a critical performance gap in vision and vision-language processing. Counterintuitively, accuracy either remained unchanged or worsened when including images from question stems with images or when providing VLM-generated captions of those images. Performance only improved when one-sentence human hints were provided, suggesting that the primary performance bottleneck lies in how VLMs process and represent image data rather than their text-based understanding image-related concepts. This finding is corroborated by a related study of GPT-4V’s clinical reasoning, which demonstrated a substantial performance drop from 80% to 58% when images were included with clinical text28.
Several factors likely contribute to this performance limitation. While the exact training data for many models are not publicly known, they likely do not contain high-quality multimodal training data for expert-level clinical reasoning in gastroenterology. This limits the models’ ability to create accurate joint visual and text representations of clinical imaging concepts. There are also likely fundamental limitations in how VLMs reason about relationships between visual and textual elements, which are exacerbated by the complex and multimodal nature of medical reasoning. Furthermore, VLMs typically downscale high-resolution images during processing, potentially losing critical fine-grained details that are especially crucial in medical contexts.
Evaluating the accuracy of VLMs and LLMs via MCQs is only the initial step in their overall clinical benchmarking. While MCQs enable quick accuracy evaluations, they have several limitations. First, the real-world application of LLMs is more dynamic and involves complex interactions with physicians, patients, and other members of the healthcare system. Second, the role of LLMs extends beyond merely answering clinical scenario questions; they can extract data, assist in documentation, summarize information, and much more29. Third, evaluating LLM responses requires a nuanced approach to fully characterize model capabilities. Even when an LLM selects the correct option, it may exhibit hallucinations in its response. Conversely, an incorrect answer might still include partially correct information that is still meaningful.
Beyond the limitations of using MCQs, LLMs should be evaluated using additional metrics beyond mere accuracy. For example, addressing the issue of model hallucination requires methods to measure model certainty30. As discussed by Liang et al., these metrics include consistency (providing similar answers in repeated runs), calibration (understanding their uncertainty), efficiency (balancing time, cost, energy, and carbon emissions with accuracy), robustness (handling typos and errors in real-world practice), and fairness4. Our study revealed challenges with consistency and instances of hallucination in language model outputs. We observed a ±5% variation when the same experiment was repeated with GPT-3.5, which aligns with our previous findings on model uncertainty in LLM performance31. Many user guides suggest that setting lower temperatures to produce more deterministic outputs. OpenAI has introduced a “seed” model setting to further increase response consistency. Our initial results indicate that while setting a seed can lead to more reproducible outputs, the model’s behavior with respect to temperature settings remains unpredictable. Furthermore, the optimal setup for achieving the highest accuracy differs from that for achieving the highest consistency. A previous study on USMLE soft skills revealed that GPT-3.5 and GPT-4 changed their answers in a follow-up question in 82% and 0% of cases, respectively8. These findings suggest that future investigations are warranted to explore LLM consistency and uncertainty further.
An organizational effort is needed to develop an automated, validated, and reproducible pipeline for evaluating LLMs and VLMs in medicine. While our efforts to create such a pipeline represent progress, they also highlight the need for collaboration among multiple parties and governing bodies. The primary reason for using MCQs and focusing on accuracy in this study, as well as in much of the existing research, is the lack of access to a more suitable validation dataset4. Ideally, this dataset should reflect the real-world usage of LLMs in medicine and gastroenterology and encompass metrics beyond accuracy.
Several considerations should be taken into account when interpreting the results of this study. Although we integrated a setup experiment into the pipeline, we limited our configuration to OpenAI models and reused the best model settings for other LLMs. A more effective approach may be to create unique sets of prompts and model settings for each individual LLM. Additionally, the use of quantized models in local environments is a nascent field, with many libraries and modules still in beta development. This is also true for the quantized models accessed through the HuggingFace portal. We have already noted the potential information loss when evaluating LLM responses as simply correct or incorrect. Another limitation of our study is the limited number of experimental runs. Future research would benefit from conducting multiple runs and reporting the average and SD, as the SD can provide insights into model consistency. Furthermore, our benchmark is limited to questions pertaining to gastroenterology. While some other related topics are involved, the results may not be fully generalizable to the entire field of medicine. Finally, given resource constraints, we focused on impactful, open-source models—prioritizing those accessible via platforms like HuggingFace—to support reproducibility. We also evaluated quantized versions to assess their clinical applicability, particularly for local deployment where privacy and efficiency are key.
Our findings demonstrate that LLMs can answer gastroenterology board exam questions with a high degree of accuracy. Open-source models have progressively smaller performance gaps compared to their proprietary alternatives. Despite these promising results, practical deployment requires balancing data privacy, infrastructure costs, and model optimization strategies. Our findings also highlight the current limitations of VLMs in interpreting gastroenterology images, underscoring the need for refined architectures and training data, higher-resolution inputs, and additional contextual cues. While MCQs offer a quick means to gauge LLM and VLM accuracy, more nuanced evaluation metrics are essential to address issues like hallucinations, uncertainty, and real-world applicability. Ultimately, broader validation datasets, diverse benchmarks, and collaborative standard-setting among medical stakeholders will be crucial to safely and effectively integrate these AI-driven tools into clinical practice.
For clinicians, these results suggest that LLMs have the potential to support medical education and clinical decision-making; however, their application should be interpreted cautiously. While LLMs achieve high accuracy on MCQs, clinical reasoning in practice is more complex, requiring dynamic patient interactions without predefined options. Current models serve as a baseline for performance, but their safe implementation necessitates additional validation, integration with clinical workflows, and safeguards against erroneous outputs. Furthermore, the limited efficacy of VLMs in interpreting medical images highlights the need for continued refinement before they can be relied upon for diagnostic support.
For developers, these findings underscore the trade-offs between model performance, privacy, and computational efficiency. Open-source models approximate the capabilities of proprietary alternatives with a predictable delay of one or two iterations, allowing for privacy-preserving AI applications at the cost of marginally lower performance and increased resource demands. Moreover, quantized large models outperform smaller full-precision models while maintaining comparable storage efficiency, emphasizing the value of optimization techniques. Despite LLMs’ strong performance on MCQs, real-world deployment demands further refinement, particularly in mitigating hallucinations and enhancing explainability. Meanwhile, VLMs require targeted enhancements, including high-resolution data inputs and domain-specific fine-tuning, to achieve clinical-grade reliability.
Methods
Dataset characteristics
To evaluate gastroenterology reasoning capabilities, we employed a comprehensive set of gastroenterology board exam-style questions sourced from the American College of Gastroenterology (ACG) self-assessment examinations. These self-assessments are designed to represent the essential knowledge, skills, and attitudes required for delivering excellent patient care in the field of gastroenterology. Importantly, access to these examinations and their corresponding answers is restricted by a paywall. This limitation presumably precludes the inclusion of this content in LLM training datasets, thereby enhancing the validity of utilizing this dataset for evaluation purposes.
We used the 2021, 2022, and 2023 editions of the ACG-SA. Each edition has 300 questions with three to five answer choices and a single best answer. There were 124, 138, and 128 questions with images from the 2021, 2022, and 2023 exams, respectively. These image questions incorporated between one and four visual elements, encompassing endoscopy, radiology, histology, or physical examination images. The data examines the knowledge competencies of gastroenterologists.
Question and exam context
The Gastroenterology Board Exam is a rigorous and essential assessment that gastroenterologists must pass to become board-certified in their specialty. Administered by the American Board of Internal Medicine, this exam tests a candidate’s knowledge, practical skills, and critical thinking abilities in gastroenterology. The exam ensures that practitioners meet the high standards required to provide quality care in this complex field. Test takers undergo a rigorous educational pathway before taking the exam, including four years of undergraduate study, four years of medical education, three years of internal medicine residency, and three years of gastroenterology fellowship. By passing this exam, these physicians achieve board certification and affirm their expertise in providing high-quality care in gastroenterology32.
The American College of Gastroenterology (ACG) developed the Board Prep/Self-Assessment Test (SA), which was developed meticulously by an ACG committee with contributions from most postgraduate course faculty members. This comprehensive test consists of approximately 300 questions annually. These questions, presented in single best answer and true/false formats and often supplemented with photos and illustrations, come with in-depth answers and referenced annotations. It offers an essential review for both graduating fellows preparing for their GI Boards and seasoned practitioners seeking a comprehensive clinical update.
Overview of experiment design
Our analysis of LLM performance in gastroenterology reasoning was structured into several experiments, as summarized in Fig. 9. In Experiment 0, we used a subset of our dataset and GPT-3.5 to determine the best performing model settings (function, prompt, temperature, and max-token) for our subsequent experiments.
In Experiment 1, we evaluated the performance of all LLMs on both 162 text-only and 138 image-inclusive questions in the 2022 ACG-SA. In Experiment 2, we assessed the performance of VLMs on the 138 image-based questions from the 2022 ACG-SA. We compared baseline performance (without image information) to performance with accompanying LLM-generated captions or human-generated captions.
Additionally, we evaluated the historical trends of GPTs across different model versions to analyze the impact of model evolution on performance, as well as performance on different editions of ACGs from 2021 to 2023. This served as an indirect sensitivity analysis, as any degradation in performance on the recently published 2023 dataset could suggest the inclusion of previous datasets in the training material.
Experimentation environment and local computation
We operationalized “environments” as distinct computational interfaces facilitating model interaction and response capture. Five environments were evaluated: (1) proprietary web-chat interfaces for OpenAI and Claude models, (2) application programming interfaces (APIs) for OpenAI and Claude implementations, (3) Poe third-party web-chat platform for open-source model access, (4) consumer-grade laptop for quantized and small-size models, and (5) local high-performance server.
Local LLM responses were generated using a Lenovo 7 16IAX7 laptop featuring a 12th Gen Intel Core i9-12900HX processor with 16 cores, 32 GB DDR5 RAM, and an NVIDIA RTX 3080 Ti GPU equipped with 16 GB graphics memory. For high-performance computations, a dedicated Linux server was utilized, incorporating encrypted data transfer procedures. This server is equipped with an H100-PCIe GPU with 80 GB VRAM, 62-core CPUs, and 164 GB RAM.
API interactions utilized Python-based programmatic control, enabling precise manipulation of prompt structure, temperature, token limits, random seeds, and function calling capabilities (parameter definitions provided in Supplementary Table 2). Ollama version 0.5 and LM Studio version 0.2 were employed for conducting a local endpoint API for managing the responses. Web-based evaluations employed Python-generated input sequences that were manually transferred to interface, with complete response capture executed without intermediate human interpretation, ensuring methodological consistency across platforms.
Experiment zero: parameters
LLM performance demonstrates sensitivity to input prompting and parameters. To establish optimal configurations for analyzing LLM reasoning performance, we systematically evaluated four key parameters: output format (structured versus unstructured), prompt engineering strategies (simple prompts, 11 individual techniques, and 6 combinatorial approaches), temperature settings (0, 0.2, 0.4, 0.6, 0.8, 1.0), and maximum token limits (512, 1024, and adaptive 512+).
LLMs have gained the capability to provide dictionary-like objects instead of raw text, a capability known as structured output or function call. Our output format evaluation compared unstructured text generation with structured implementations through OpenAI’s native function calling capability and Langchain’s output parser framework (Supplementary Table 3). The structured output functionality enables models to return dictionary-like objects rather than raw text, facilitating systematic response parsing. These implementations inject formatting instructions into the message stream and parse generated responses to extract structured parameters. Langchain enforces output conformity through explicit format constraints, ensuring reliable JSON extraction regardless of underlying model architecture. For models lacking native structured output support—including web-based interfaces, Poe environment implementations, local deployments, and legacy GPT variants (instruct-GPT, Babbage, and DaVinci)—we processed unstructured text outputs using extraction methods detailed in the Evaluation of Accuracy section.
The prompt is the instruction given to the LLM to elicit answers. Our prompt optimization employed a three-phase approach. Initially, we implemented a baseline prompt (“select the correct option from the provided options”). Subsequently, we evaluated 11 distinct prompt engineering techniques independently, then synthesized six optimal prompt combinations through an automated GPT-4 API pipeline (Supplementary Table 4). All prompts were delivered as single-request messages without system-level instructions, as complete context was embedded within each query.
Briefly, the prompt engineering techniques include: (1) Direct Questioning, a straightforward query of the question and answer choices; (2) Option Analysis, the separate evaluation of each choice for the strongest justification; (3) Chain of Thought, prompting the AI to detail its reasoning steps; (4) Answer and Justify, requesting both a final answer and a supporting explanation; (5) Elimination Method, discarding obviously incorrect options first; (6) Comparative Analysis, weighing each remaining choice against the others; (7) Contextual Embedding, providing additional relevant information to guide the AI; (8) Confidence Scoring, where the AI rates its certainty for each option; (9) Expert Mimicry, posing the prompt as though an authority in the field is responding; (10) Consensus Technique, repeating and merging inputs to find the most agreed-upon answer; and (11) Give Model Time to Think, allowing space for deeper reflection before finalizing a response.
The temperature controls the randomness of the model’s response. A lower temperature (close to 0) results in more deterministic and repetitive outputs, with a higher likelihood assigned to the most likely words. Conversely, a higher temperature leads to more random and diverse outputs, which can enhance creativity but also increase the probability of nonsensical or irrelevant responses. We evaluated six temperature levels of 0, 0.2, 0.4, 0.6, 0.8, and 1.0 (Supplementary Table 5).
The Maximum Token Count (max_token) determines the allowable length of the input and output total for the LLMs. Maximum token limits were tested at three configurations: fixed 512, fixed 1024, and adaptive (512 plus input length), the latter ensuring consistent response space independent of question complexity. Response truncation at these boundaries represented a recognized technical constraint (Supplementary Table 6).
Experiment Zero: parameter optimization
Parameter optimization employed a sequential evaluation protocol using GPT-3.5 (GPT-3.5-turbo-0125; March 2024) on 60 randomly sampled questions from 2021 to 2023 datasets. The sequential approach enabled isolation of individual parameter effects through stepwise optimization, with each parameter’s optimal configuration informing subsequent evaluations. The optimization sequence proceeded as follows.
First, output format comparison established whether structured or unstructured generation yielded superior performance. Output format evaluation compared unstructured text generation with structured implementations through OpenAI’s function calling API and Langchain’s output parser framework.
Second, prompt optimization proceeded through three empirical phases. Initial baseline assessment employed standard instruction syntax (“Answer the following question and select one option from the provided options”). Subsequently, eleven discrete prompt engineering techniques derived from established literature were evaluated independently through triplicate runs33,34,35. Techniques demonstrating mean performance exceeding the 95% confidence interval upper bound of baseline prompts qualified as effective interventions. Following identification of effective techniques, we synthesized six optimal prompt candidates through automated GPT-4 pipeline integration, implementing the methodology of Zhou et al. 36. Three additional variants emerged through GPT-4 refinement using domain-specific optimization criteria. Final prompt selection derived from averaged performance across all candidates, encompassing baseline, eleven individual techniques, and six combinatorial approaches tested across varying token allocations.
Third, temperature parameter calibration examined values of 0, 0.1, 0.4, 0.6, 0.9, and 1.0 to regulate response stochasticity.
Fourth, token allocation assessment determined optimal length constraints across the three configurations: fixed 512 token; fixed 1024 token; and adaptive, calculated as 512 plus input token count. The adaptive strategy maintained consistent response space independent of question complexity, acknowledging that total token limits encompass both input components (prompt, question, options) and generated output. This approach addressed variability in question length while ensuring equitable computational resources across diverse query structures.
Statistical validation through triplicate execution confirmed consistency across prompt strategies, establishing robust parameter selection criteria for subsequent experimental phases. This systematic optimization protocol ensures methodological reproducibility while maximizing model performance assessment reliability.
Experiment Zero: optimal parameters
The final identified optimal parameters, used in main experiments across different models were: (1) temperature of 1, (2) max_token equal to the number of input tokens plus 512, (3) use of a function call to generate a structured output, and (4) the following prompt: “Imagine you are a seasoned gastroenterologist approaching this complex question from the gastroenterology board exam. Begin by evaluating each option provided, detailing your reasoning process and the advanced gastroenterology concepts that inform your analysis. The most accurate option is chosen on the basis of your expert judgment, your choice is justified, and your confidence in this decision is rated from 1 to 10, with 10 being the most confident. Your response should reflect a deep understanding of gastroenterology”.
The prompt architecture consisted of three hierarchical components: instruction prompt, question text, and response options, each separated by line breaks for parsing clarity. Implementation diverged based on model capabilities for structured output generation.
For models supporting native function calling (OpenAI GPTs and Claude models), we implemented structured output through schema-defined function calls. The schema specification comprised three elements: a functional description (“You can generate your unstructured output and selected option”), parameter definitions for dual-output architecture, and constraint specifications. The dual-output parameters captured both reasoning process and final selection: “unstructured_output” recorded the model’s reasoning chain with the description “Your unstructured output for the answer, if needed,” while “selected_option” captured the discrete choice with the description “The selected option from the list of provided choices (Example: “F”).” Valid responses were constrained to the enumerated set {“A”, “B”, “C”, “D”, “E”}, enforcing selection validity through schema-level constraints rather than post-processing validation.
For models lacking native structured output capabilities (web interfaces, Poe interface, and legacy GPT variants), the identical prompt structure generated unstructured text requiring extraction algorithms. These models received the same tripartite prompt architecture but produced free-form responses necessitating pattern-matching and heuristic extraction methods detailed in the Evaluation of Responses section.
Experiment Zero: validation of optimal parameters
We utilized GPT-3.5 to determine the optimal prompt and configuration due to its accessibility via API during the experimentation phase and used this setup for future experiments. However, the efficacy of this GPT-3.5-based prompt in enhancing the performance of other LLM remains uncertain. Consequently, we conducted a validation experiment to assess the performance of both the unmodified prompt and the optimized prompt on models with different architectures and sizes: GPT-4o, Llama-3.1-70b, Llama-3.1-8b, and Phi-3.5-4b. As illustrated in Supplementary Fig. 1, the optimized prompt improved the performance for GPT-4o and Llama-3.1 models, while it yielded comparable performance to the unmodified prompt for Phi-3.5-4b. Although this provides limited validation, further research is necessary to examine the generalizability of a configuration developed for one LLM to another.
Experiment 1: text-based gastroenterology questions
The optimal configuration identified through Experiment Zero was systematically applied across all subsequent model evaluations, with default parameters employed when specific settings remained inaccessible within testing environments. Table 1 presents comprehensive model specifications including version identifiers or repository addresses, date of use, model knowledge cut-off, experiment environment(s), and modifiable parameters.
Experiment 1: quantized models
Quantization is a process that allows large models, which typically require resources beyond a standard computer, to be used on more accessible computers at the cost of lower precision. For example, even a small model with 7 billion (7B) parameters needs approximately 28 GB of memory (7 times 4) for execution, in addition to the memory the system uses for its own. Consequently, not only do big models such as Mixtral8x7 (47b) or Llama2-70b, even small models usually with 7B parameters, need cloud computing or supercomputers. Quantization compresses the model size from 32-bit precision to the desired N-bit precision37. We used an 8-bit quantized version for 7 billion models (Mistral-7b-Q8, Llama2-7b-Q8, and Medcine-LLM-Q8), which requires approximately 8 GB of RAM, and a Q5_K_M version for 13 billion models (), which requires approximately 12 GB of RAM.
Experiment 2: image-based gastroenterology questions
For experiment 2, we assessed the visual reasoning capabilities of proprietary and open-source VLMs via image-containing questions from the 2022 ACG-SA. Historical limitations on the availability of image-capable LLMs led previous studies to exclude questions containing images, focusing instead on assessing the models’ text-based reasoning capabilities20,21,22. Our dataset, however, contains a considerable number of questions accompanied by images.
We interacted with the web interfaces of GPT-4, Claude3-Opus, and Gemini Advanced and the API interfaces of GPT-4 and Claude3-Opus, alogn with three open-source models Phi-3.5-4b, Llama-3.2-11b, and Llama-3.2-90b. We assessed performance across four distinct scenarios to capture different aspects of visual understanding.
-
(1)
No-image: This scenario excludes the image and provides only the question text to the VLM. This represents baseline VLM performance without any image information. The prompt was same as experiment 1: “Imagine you are a seasoned gastroenterologist approaching this complex question from the gastroenterology board exam. Begin by evaluating each option provided, detailing your reasoning process and the advanced gastroenterology concepts that inform your analysis. The most accurate option is chosen on the basis of your expert judgment, your choice is justified, and your confidence in this decision is rated from 1 to 10, with 10 being the most confident. Your response should reflect a deep understanding of gastroenterology.” “+ “\nQuestion:\n{question_stem}” + “\nOptions:\n{options}.”
-
(2)
This scenario provided VLMs with human-crafted image descriptions. Two independent gastroenterologists created these descriptions, aiming to convey comprehensive visual information while avoiding high-level classifications or specific terminology that might inadvertently reveal the answer. For instance, a perianal abscess image was described as “a small perianal bulge with overlying erythema and induration,” omitting diagnostic terminology. Discrepancies between evaluators were resolved through consensus discussion.
-
(3)
The instructional prompt was structured as follows: “Imagine you are…” + “You lack direct access to the image. Instead, a human has provided relevant information about the image in the context. You may use this hint if helpful; otherwise, base your answer solely on the available information.” + “\nQuestion:\n{question_stem}” + “\nOptions:\n{options}” + “\nHuman-generated image caption: {human_generated_caption}.” This configuration represents theoretical maximum VLM performance, simulating conditions where complete visual information is textually accessible without direct image processing requirements.
-
(4)
VLM-generated description: This scenario evaluated VLMs’ capacity to generate and subsequently utilize their own image descriptions. Each VLM first produced a caption for the medical image in an isolated instance, then received this self-generated description as textual input in a separate evaluation session. The instructional prompt was structured as follows: “Imagine you are…” + “You lack direct access to the image. Instead, another LLM has provided relevant information about it. You may use this caption if helpful; otherwise, base your answer solely on the available information.” + “\nQuestion:\n{question_stem}” + “\nOptions:\n{options}” + “\nLLM-generated image caption: {vlm_generated_caption}.” This configuration quantifies the VLM’s dual capacity for image-to-text translation and text-based reasoning, revealing how visual information is internally represented and subsequently processed within the model’s inference pipeline.
-
(5)
Direct image: This scenario evaluated VLMs’ native visual processing capabilities through unmediated image-question pairing. Each VLM received the medical image directly alongside textual query components, enabling simultaneous multimodal processing. The instructional prompt was structured as follows: “Imagine you are…” + “Images have been provided for your reference. You may use the information in the images if helpful; otherwise, base your answer solely on the available information.” + “\nQuestion:\n{question_stem}” + “\nOptions:\n{options}” + [direct image input]. This demonstrates the VLM’s ability to interpret the image directly and with the question text.
In some cases, VLMs avoid providing captions. We tagged captions as failed when the LLM avoided providing the caption after three tries. The GPT4V Web, GPT4V API, Claude3OpusV-Web Claude3-OpusV API, and GeminiAdvancedV Web had 0, 19, 2, 3, and 88 errors, respectively.
Question type and stratified performance analysis
We reported the performance stratified by question type to identify areas where the LLMs’ answers were less accurate. On the basis of the percentage of test-takers who answered a question correctly, we classified the questions into four difficulty quartiles. We used simplified Bloom’s Taxonomy to classify questions on the basis of the cognitive level tested (lower-order, higher-order, case-based, and integrated)38. The question phase of care was defined as either diagnosis, treatment, management of complications, or pathophysiology. The question topic categories were provided by the ACG. The question token length was determined via the tiktoken Python library and categorized into tertiles of short (49–179 tokens), medium (180–262 tokens), and long (263–588 tokens), which resulted in 99, 99, and 102 labels, respectively.
Gastroenterology Subject Category: The 2022 ACG-SA was our main databank for generating answers. The dataset includes 49 questions on the colon, 26 on endoscopy, 36 on the esophagus, 30 on inflammatory bowel disease (IBD), 14 on irritable bowel syndrome (IBS), 52 on the liver, 19 categorized as miscellaneous, 3 on nutrition, 32 on pancreaticobiliary, 22 on the small bowel, and 17 on the stomach.
Cognitive Taxonomy (Bloomberg Taxonomy): In this study, we employed a modified Bloom’s taxonomy to categorize MCQs on the basis of their required cognitive processes introduced by Nikki Ziadi38. This method classifies questions into two levels: lower-order thinking skills (Level 1) and higher-order thinking skills (Level 2). The level 1 questions necessitate the recall or comprehension of learned information without applying a concept, exemplified by questions such as “Which of the following drugs can cause pain, vomiting, and jaundice?” The level 2 questions demand the application of prior knowledge to new information, involving analysis, synthesis, and evaluation. An example is “A 12-year-old girl with sickle cell disease has pain in her right arm. X-ray revealed bony lesions consistent with osteomyelitis. Which of the following is the most likely causal organism?” This categorization allows for a nuanced assessment of LLM capabilities across different cognitive domains.
One author (SAASN) labeled questions via the modified Bloom’s taxonomy guide38. In the 2022 ACG-SA, only 2 questions targeted lower-order thinking skills. This is expected, as the self-assessment is designed for the gastroenterology board exam, a complex and high-level evaluation. Additionally, 297 questions were case-based, presenting clinical or laboratory scenarios, and 92 questions were integrated, combining multiple domains and disciplines into single test items.
Patient Care Phase and Laboratory Test: One author (SAASN) labeled the questions on the basis of their aim to test one or more steps in the patient care process. The questions aimed at evaluating the patient care phase were categorized into five key areas of gastroenterology: 123 questions focused on diagnosis, 217 focused on treatment, 211 focused on investigation, 55 focused on complications, and 3 focused on pathophysiology. Additionally, 170 questions included laboratory results in the question stem or options.
Difficulty using Test-Taker Scores: The question difficulty was defined as the percentage of respondents who answered the question correctly. The average scores for the human test takers were kindly provided by the ACG upon request. To evaluate performance on the basis of question difficulty, we stratified the questions into 4 quartiles. The questions were classified into Q1 (12.75%–64.92%), Q2 (64.93%–79.23%), Q3 (79.23%–89.44%), and Q4 (89.45%–99.21%), each consisting of 75 questions.
Question length
The sum of the question and option tokens was calculated via the tiktoken Python library. The token is the smallest unit of data processed by LLMs after translation into the corresponding mathematical representation during the process of embedding. The questions were categorized into three tertiles: short (49–179), medium (180–262), and long (263–588), which resulted in 99, 99, and 102 labels, respectively.
Image modality
The dataset includes a diverse range of imaging and diagnostic modalities with their respective frequencies of usage. Computed Tomography (CT) was the most common modality, appearing 63 times, followed by Endoscopy with 36 occurrences and Ultrasound with 26. Magnetic Resonance Imaging (MRI) was noted 15 times, while Colonoscopy appeared 7 times. Techniques such as Biopsy and X-ray were each recorded 6 times. Flexible Sigmoidoscopy and Manometry both appeared 4 times, with Anorectal Manometry, Capsule Endoscopy, and MRCP each occurring 3 times.
Less frequent methods included Endoscopic, Barium Esophagram, Abdominal X-ray, and EUS, each documented twice. The following modalities appeared only once in the dataset: Radiograph, Pathology Slide, Endosonographic, Fluoroscopy, pH monitoring, Esophageal Manometry, Esophagram, Barium Swallow, Sigmoidoscopy, MR Enterography, Pathology, ERCP, Angiography, MR Angiography, PET CT, PET, Nuclear Medicine, Enterography, and Gastric Emptying Study. This diversity highlights the wide array of diagnostic tools employed across various medical specialities.
Sensitivity experiment: trend of performance
To assess the impact of model updates and training date cutoffs on performance, we conducted an analysis focusing on the OpenAI model family accessed via API. This choice was motivated by the availability of multiple model versions for comparison. The analysis encompassed the following models: GPT-3 (babbage-002, davinci-002), GPT-3.5 (gpt-3.5-turbo-1106, gpt-3.5-turbo-0125), GPT-4 (gpt-4-0613), GPT-4 Turbo (gpt-4-1106-preview, gpt-4-0125-preview), and GPT-4o (gpt-4o-2024-05-13).
Auxiliary experiment: consistency and cost
To investigate the consistency of the model outputs, we conducted an experiment with three separate runs via the GPT-3.5 API. Specifically, we examined the effects of using a fixed random seed, different temperatures, and prompt complexity on the consistency of the model outputs. We estimated the cost and runtime of the models on the basis of a median question length of 216 tokens. For a representative case study, we selected a medium-difficulty question focused on liver disease, comprising 217 tokens. The question was augmented with a prompt-engineered command (using the same settings as in Experiment 1), resulting in a total input of 311 tokens. Execution time was measured from the input to the final token generation using the time library, where possible, or a timer for web interfaces. Token counts were managed via the tiktoken library. The cost per token was sourced from OpenAI and Anthropic as of June 20, 2024. We utilized paid chat accounts for ChatGPT, Claude, and Gemini services, each costing $20/month.
Response evaluation pipeline
LLM responses in experiments can be structured (like dictionary objects) or unstructured (plain text). For models providing structured output, the correctness of the answer was evaluated by comparing the ground truth (ACG-SA answer key) and LLM answer via Python code. A correct response was defined as the selection of a single, correct answer, whereas an incorrect response was the selection of a single, incorrect answer.
However, with web interfaces or models lacking structured output, identifying selected options in raw text complicates scalable evaluation. We used a semiautomated approach to evaluate LLM answers, briefly summarized in Fig. 10. For the remaining models with unstructured responses and web interface environments, we used the GPT-3.5 API to extract the selected option from the LLM textual response.
The GPT-3.5 API with settings of structured output, temperature of 0, and maximum token of 50 was used with the following commands:
Schema Description: “Act as a helpful, reliable, and accurate assistant. This response is generated in the context of a multiple-choice question. The LLM was asked to think about each option and then select an option. Your task is to identify the exact chosen option on the basis of the provided text.” + Parameter Description: “The option selected by the LLM.” + Schema Options: “A”, “B”, “C”, “D”, “E”, “No option was selected”, “More than one option was selected”, “I am not sure.” + Available Options List [Option Letter: Option Text] + {LLM answer: {llm_answer}}
Option was either the extraction of an option (“A”, “B”, “C”, “D”, “E”) or marking the question to be manually evaluated by a human. In addition, some structured outputs were failed to provide the final options, which were also evaluated by a human.
One evaluator (SAASN) with a medical doctorate degree and 2 years of medical practice evaluated answers requiring human evaluation. The guidelines for labeling answers are depicted in Supplementary Fig. 14. If one option was selected by LLM, the option was replaced with the LLM answer, and the correctness was evaluated via Python code. If more than one option was selected and the correct answer lies in one of the selected options, the two-option label (“2OP”) was added. If no option was selected or the LLM avoided providing any option, no option (“NOP”) label was added. If the LLM sought to provide an answer beyond the provided options, the label of the external option (“EOP”) was used. In cases where the LLM selected an option while providing some caution or uncertainty, the selected answer was selected as the LLM choice. The final evaluation categories are:
Correct: One correct answer is selected.
Incorrect: One or two incorrect answers are selected.
2OP: Two answers are selected, one of which is correct.
EOP: An answer external to answer choices is selected.
NOP: No answer was selected due to a lack of sufficient information.
Error: Incomplete or unintelligible response after two attempts.
Validation of evaluation pipeline
We conducted 49 LLM and 36 VLM (9 VLMs in 4 scenarios) runs on 300 questions from ACG 2022. Eight runs were managed with structured outputs (four GPT runs and four Claude3 runs). Among these structured outputs, 215 out of 2400 responses failed to parse correctly, necessitating manual evaluation. Additionally, across 77 standard (unstructured) outputs via Poe, web interfaces, and local LM studio servers, GPT-3.5 extracted 23100 responses, 236 of which require human validation. In total, our semiautomated evaluation approach required manual evaluation for 451 out of 25,500 generated answers.
To evaluate the performance of GPT-3.5 in extracting selected options from raw responses, we created a dataset comprising 100 answers from various environments. Supplementary Fig. 15 illustrates the performance of option extraction workflow on the sampled data. GPT-3.5 avoided extraction of option in 17 instances (3 extraction errors and 14 labeled for manual review), successfully extracted 82 instances, and incorrectly extracted one option, achieving an accuracy of 98.8% for extractions. Considering 17% of the cases flagged for manual evaluation, the overall accuracy of our semiautomated approach reaches 99%.
Libraries
We used Python version 3.11 for API interactions, and we used Langchain (0.1.17 and 0.2.4) to standardize these interactions. To classify and troubleshoot LLM responses, we created an app for human evaluation application using the Streamlit (1.33) library. Communication with the OpenAI models was performed via the OpenAI Python package. To interact with the Claude3 models, we provided XML-formatted requests to the API. Local models were run on a laptop equipped with an RTX 3080 Ti GPU (16 GB VRAM) and a Core i9-12900 CPU (32 GB DDR5 RAM), creating a local server utilizing the llama.cpp library and LM Studio version 0.2. These runs utilized half of the model layers on the GPU (typically 16) and 10 CPU threads. A local server-based private server was utilized with an H100 GPU (PCIE 80GB VRAM) and 16vCPU (188GB RAM). Ollama 0.5 version was employed for conducting experiments. The tiktoken library was employed to count the number of tokens in the question stem and options, particularly in scenarios with a 512-plus max-token limit. We used asyncio to send parallel requests whenever possible.
Statistical analysis
Categorical variables are presented as counts (N) and percentages (%). Continuous variables are presented as the means ± standard deviations (SDs). Additionally, we reported the median and min–max range formatted as the median [range: lower range, higher range]. For comparison of prompt engineering techniques, we used the Wilcoxon rank test to calculate the P value using the “raw prompt” as the reference, and the 95% confidence interval (95% CI) was estimated using the bootstrapping method with 10,000 samples. For the comparison of the four scenarios in Experiment 2, we used a paired t-test with “no image” as the reference. Python was used for statistical analysis and visualization, employing the pandas, matplotlib, scipy, and statsmodels libraries. Where relevant, p values were two-tailed, and a p value less than 0.05 was considered to indicate statistical significance.
Subgroup analysis
We assessed the performance of the 6 top performing models across question topics and features. With a dataset of 300 MCQs and an overall correctness rate of 60–80%, we calculated the 95% CIs for the proportion of correct responses within each subgroup. The CIs were estimated using a normal approximation to the binomial distribution. The proportion was determined by dividing the number of correct responses by the total number of responses. The standard error of this proportion was calculated based on the variability of the proportion and the total sample size. Using a critical value of 1.96, corresponding to a 95% confidence level, the lower and upper bounds of the CI were derived by adjusting the proportion by the product of the standard error and the critical value. The resulting bounds were restricted to fall within the range of 0–1 to ensure they represented valid probabilities.
Ethical considerations
This study employed MCQs from GI board examinations, emphasizing the ethical use of data. As the study did not involve human subjects, formal approval was unnecessary. We placed a strong emphasis on accurately citing and attributing the questions to their original sources to maintain academic integrity and respect intellectual property rights. Additionally, we ensured that data usage was terminated in services such as ChatGPT, Claude, and Poe to prevent unintended use of the data by these platforms39,40,41. The study was conducted in accordance with the transparent reporting of a multivariable model for individual prognosis or diagnosis (TRIPOD)-LLM guidelines (Supplementary Table 8).
Data availability
The data that support the findings of this study are available from American College of Gastroenterology (ACG), but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however, available from the authors upon reasonable request and with permission of ACG. In addition, the ACG self-assessment question and answers are available for members at https://education.gi.org/. The benchmark and performances are available at HuggingFace Leaderboard: (link to the leaderboard visual interface will be added upon acceptance).
Code availability
The underlying code for this study is available in GitHub repository and can be accessed via this link https://github.com/Sdamirsa/LLM-VLM-in-Gastroenterology.
References
Henson, J. B., Glissen Brown, J. R., Lee, J. P., Patel, A. & Leiman, D. A. Evaluation of the potential utility of an artificial intelligence chatbot in gastroesophageal reflux disease management. Am. J. Gastroenterol. 118, 2276–2279 (2023).
Safavi-Naini, S. A. A. et al. Su1962 GI-COPILOT: AUGMENTING CHATGPT WITH GUIDELINE-BASED KNOWLEDGE. Gastroenterology 166, S-882–S-883 (2024).
Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 141 (2023).
Liang, P. et al. Holistic evaluation of language models. Trans. Mach. Learn. Res. (2023).
Klang, E., Sourosh, A., Nadkarni, G. N., Sharif, K. & Lahat, A. Evaluating the role of ChatGPT in gastroenterology: a comprehensive systematic review of applications, benefits, and limitations. Ther. Adv. Gastroenterol. 16, 17562848231218618 (2023).
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).
Li, W. et al. Can multiple-choice questions really be useful in detecting the abilities of LLMs? in International Conference on Language Resources and Evaluation https://doi.org/10.48550/arXiv.2403.17752 (2024).
Brin, D. et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci. Rep. 13, 16492 (2023).
Gilson, A. et al. How does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 9, e45312 (2023).
Kung, T. H. et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit. Health 2, e0000198 (2023).
Khorshidi, H. et al. Application of ChatGPT in multilingual medical education: how does ChatGPT fare in 2023’s Iranian residency entrance examination. Inform. Med. Unlocked 41, 101314 (2023).
Moshirfar, M., Altaf, A. W., Stoakes, I. M., Tuttle, J. J. & Hoopes, P. C. Artificial intelligence in ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions. Cureus https://doi.org/10.7759/cureus.40822 (2023)
Mihalache, A., Popovic, M. M. & Muni, R. H. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 141, 589 (2023).
Cai, L. Z. et al. Performance of generative large language models on ophthalmology board–style questions. Am. J. Ophthalmol. 254, 141–149 (2023).
Suchman, K., Garg, S. & Trindade, A. J. Chat generative pretrained transformer fails the multiple-choice American College of Gastroenterology self-assessment test. Am. J. Gastroenterol. 118, 2280–2282 (2023).
Noda, R. et al. Performance of ChatGPT and bard in self-assessment questions for nephrology board renewal. Clin. Exp. Nephrol. 28, 465–469 (2024).
Skalidis, I. et al. ChatGPT takes on the European exam in core cardiology: an artificial intelligence success story? Eur. Heart J. Digit. Health 4, 279–281 (2023).
Passby, L., Jenko, N. & Wernham, A. Performance of ChatGPT on specialty certificate examination in dermatology multiple-choice questions. Clin. Exp. Dermatol. 49, 722–727 (2024).
Humar, P., Asaad, M., Bengur, F. B. & Nguyen, V. ChatGPT is equivalent to first-year plastic surgery residents: evaluation of ChatGPT on the plastic surgery In-service examination. Aesthet. Surg. J. 43, NP1085–NP1089 (2023).
Hoch, C. C. et al. ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur. Arch. Otorhinolaryngol. 280, 4271–4278 (2023).
Ali, R. et al. Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery 93, 1090–1098 (2023).
Lum, Z. C. Can artificial intelligence pass the American Board of Orthopaedic Surgery examination? Orthopaedic residents versus ChatGPT. Clin. Orthop. 481, 1623–1630 (2023).
Zheng, Y. et al. Large language models for medicine: a survey. Int J. Mach. Learn Cyber. 16, 1015–1040 (2025).
Optimum. Quantization. HuggingFace Website https://huggingface.co/docs/optimum/en/concept_guides/quantization (2024).
Zhao, W. et al. DIVKNOWQA: assessing the reasoning ability of LLMs via open-domain question answering over knowledge base and text. In Proc. North American Chapter of the Association for Computational Linguistics Findings of the Association for Computational Linguistics: NAACL 2024 51–68 (Association for Computational Linguistics, 2024).
Li, S. et al. Evaluating quantized large language models. In Proc. International Conference on Machine Learning https://doi.org/10.48550/arXiv.2402.18158 (2024).
Gong, Z. et al. What makes quantization for large language model hard? An empirical study from the lens of perturbation. In Proc. AAAI Conference on Artificial Intelligence abs/2403.6408 18082–18089 arXiv https://doi.org/10.48550/arXiv.2403.06408 (2024).
Buckley, T., Diao, J. A., Rajpurkar, P., Rodman, A. & Manrai, A. K. Accuracy of a vision-language model on challenging medical cases. Preprint at https://doi.org/10.48550/arXiv.2311.05591.
Shahab, O., El Kurdi, B., Shaukat, A., Nadkarni, G. & Soroush, A. Large language models: a primer and gastroenterology applications. Ther. Adv. Gastroenterol. 17, 17562848241227031 (2024).
Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Preprint at https://openreview.net/forum?id=8s8K2UZGTZ (2022).
Savage, T. et al. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J. Am. Med. Inform. Assoc. 32, 139–149 (2025).
American Board of Internal Medicine (ABIM) Official Website. Gastroenterology certification exam content. abim.org https://www.abim.org/certification/exam-information/gastroenterology/exam-content (2024).
Kowalczyk, M. A step-by-step guide to prompt engineering: best practices, challenges, and examples. Lakera.ai https://www.lakera.ai/blog/prompt-engineering-guide (2023).
Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet Res. 25, e50638 (2023).
White, J. et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT. Preprint at https://doi.org/10.48550/arXiv.2302.11382 (2023).
Zhou, Y. et al. Large language models are human-level prompt engineers. In Proc. International Conference on Learning Representations. https://openreview.net/forum?id=92gvk82DE- (2022).
Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. QLoRA: efficient finetuning of quantized LLMs. Preprint at https://openreview.net/forum?id=OUIFPHEgJU (2023).
Zaidi, N. Modified bloom’s taxonomy for evaluating multiple choice questions. Baylor College of Medicine https://www.bcm.edu/sites/default/files/2019/04/principles-and-guidelines-for-assessments-6.15.15.pdf (2015).
Anthropic. Consumer terms of service. Anthropic Official Website https://www.anthropic.com/legal/consumer-terms (2024).
Poe. Poe privacy center. Poe Official Website https://poe.com/privacy_center (2024).
OpenAI. Data controls FAQ. OpenAI Official Website https://help.openai.com/en/articles/7730893-data-controls-faq (2024).
Acknowledgements
AS was funded by the American Gastroenterological Association AGA-Amgen Fellowship-to-Faculty Transition Award (AGA2023-32-06). The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript. We want to thank the American College of Gastroenterology for kindly providing their valuable question bank, the Hugging Face team for making AI accessible and easy to use, and the Bloke account on Hugging Face for providing quantized versions of open-source LLMs. During the preparation of this work, the author used generative AI to improve the writing style and grammar of the manuscript. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Author information
Authors and Affiliations
Contributions
S.A.A.S.N.: Conceptualization, methodology, formal analysis, investigation, data curation, writing original draft, programming, project administration; S.A.: Methodology, investigation; O.S.: Methodology, investigation; Z.S.: Investigation; T.S.: Validation, investigation, writing, review and editing; S.R.: Investigation; J.S.S.: Investigation; R.A.S.: Investigation; F.L.: Investigation; J.O.Y.: Investigation; J.E.: Investigation; S.B.: Investigation; A.Sh.: Investigation; S.M.: Validation, writing original draft, writing review and editing; N.P.T.: Validation, writing, review and editing; G.N.: Methodology, validation, writing review and editing, supervision; B.E.K.: Conceptualization, methodology, resources, data curation, writing review and editing, project administration; A.So.: Conceptualization, methodology, formal analysis, resources, data curation, writing original draft, writing review and editing, programming, supervision, project administration.
Corresponding authors
Ethics declarations
Competing interests
G.N. is deputy editor at npj Digital Medicine. All authors declare no financial or non-financial competing interests related to this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Safavi-Naini, S.A.A., Ali, S., Shahab, O. et al. Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning. npj Digit. Med. 8, 797 (2025). https://doi.org/10.1038/s41746-025-02174-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-025-02174-0












