Abstract
We evaluated the zero-shot performance of six large language models (LLMs; GPT-4.0 Turbo, LLaMA-3-8B, LLaMA-3-70B, Mixtral 8\(\times\)7B Instruct, Titan Text G1-Express, Command R+) and four multimodal LLMs (Claude-3.5-Sonnet, Claude-3-opus, Claude-3-Sonnet, Claude-3-Haiku) on the 2023 Brazilian Portuguese medical residency entrance exam of the Hospital das Clínicas da Faculdade de Medicina da Universidade de São Paulo including text-only and image-based questions. Comparison among models showed that accuracy varied widely, with Claude-3.5-Sonnet achieving the highest score on text-only questions (70.27%, 95% CI: 65.68–74.86), surpassing GPT-4.0 Turbo (66.22%, 95% CI: 65.38–67.05), while the open-source LLaMA-3-70B performed competitively. The best models reached the median level observed among human candidates. On image-based questions, accuracy dropped substantially across models, with most scoring below 50%, except Claude-3.5-Sonnet, which maintained stable performance. However, this decline should be interpreted with caution, as it remains unclear whether it reflects multimodal reasoning limitations or differences in intrinsic question difficulty, and the present study does not allow these possibilities to be disentangled. In addition, qualitative analysis by independent expert physicians assessed model-generated explanations, identifying hallucinatory events, with lower inter-rater agreement in misclassified cases. These results suggest that language models in Brazilian Portuguese may approximate human-level reasoning in medical questions.
Similar content being viewed by others
Introduction
Large language models (LLMs) have revolutionized the interpretation of data 1,2. Artificial intelligence (AI) has the promise to transform healthcare by improving diagnostic accuracy, personalizing treatment plans, and optimizing workflows in medical practice by extracting value and information from unstructured data, which predominate in electronic health records1,3,4,5,6,7.
Although LLMs have been tested on benchmarks such as Massive Multitask Language Understanding and BIG Bench 8,9, these evaluations are conducted predominantly in English, reflecting the overwhelming dominance of this language in training data. Specialized datasets, such as the MedQA-US Medical Licensing Examination (USMLE), have been used to assess the capabilities of LLM in scenarios requiring specialized medical knowledge, advanced reasoning, and human-level reading comprehension 10.
More recently, retrieval-augmented generation (RAG) frameworks have emerged as powerful strategies to improve factual grounding and domain-specific accuracy in medical applications 11,12. In parallel, the development of agentic LLMs has introduced new paradigms of autonomous reasoning and multi-step task execution, particularly in radiology and clinical question answering 13,14. These innovations expand the traditional scope of LLM benchmarking by integrating external knowledge retrieval and dynamic decision-making capabilities, but they remain largely unexplored in non-English contexts.
In addition, there is linguistic disparity in the field of natural language processing (NLP), as many languages less widely used or endangered are particularly under resourced 15. Even some languages, such as Portuguese – spoken by approximately 3% of the global population 16,17 and proportionally represented with 3.8% of websites 18 – face challenges due to the overrepresentation of English. This dominance of English-language content can introduce bias in the training of language models, hindering the performance of LLMs in languages other than English, particularly in high-stakes domains like medicine. Importantly, Portuguese remains underexplored in clinical AI benchmarks, and evaluating LLMs in this context addresses a critical gap by testing their robustness in an underrepresented language.
Some studies have focused on specific languages, reporting progress in improving NLP performance. For example, Lorenzoni et al. (2024) explored the use of LLMs in Italian to identify injuries in emergency department records but did not address linguistic comparisons 19. Likewise, Liu et al. (2024) conducted a meta-analysis on ChatGPT performance on medical licensing examinations in several languages (English, Japanese, Spanish, French, German, and Chinese) 20 and Frei et al. (2023) applied annotated datasets to improve NLP tasks in German 21. In Portuguese, Garcia et al. (2024) introduced BODE, achieving competitive results in zero-shot classification tasks 22 while Almeida et al. (2024) developed Sabiá-2, a family of trained LLMs that outperformed GPT-3.5 in most of the tasks evaluated 23. These examples contrast with Guillen-Grima et al. (2023), who assessed GPT-3.5 and GPT-4 directly, without adjustments or fine-tuning, on questions from the Spanish medical residency exam; GPT-4 achieved an accuracy of 86.81%, with slightly better performance when using English-translated questions 24.
Residency entrance exams are a useful platform for testing LLM performance across languages because these exams are structured to evaluate specialized knowledge, reasoning, and comprehension skills under standardized conditions. Unlike proficiency tests, which focus primarily on linguistic fluency, these exams challenge an LLM with complex, real-world medical scenarios, including textual and multimodal questions, such as graphs and radiological images. This structural feature makes the Brazilian residency exam of the Hospital das Clínicas da Faculdade de Medicina da Universidade de São Paulo (HCFMUSP) distinct from USMLE-style benchmarks, providing a unique opportunity to test both textual and image-based reasoning in Portuguese. Furthermore, under Brazilian law, public examinations, including the HCFMUSP Medical Residency Exam, are required to disclose the exam questions and the candidates’ scores shortly after the test, allowing for direct benchmarking of LLMs not only against the answer key but also against the performance of thousands of human applicants. This comparative dimension has not been systematically explored in prior studies. However, although this public availability of data enables candidates to file appeals, and consequently some questions are often annulled, any request for individual response data, which would have allowed a more detailed comparison by question type or specialty domain, cannot be granted on contractual confidentiality grounds.
Despite recent progress, several research gaps remain. First, most LLM evaluations in medicine have focused on English-language datasets and licensing exams, leaving a gap in understanding how these models perform in underrepresented languages such as Portuguese. Second, studies that evaluate LLMs in other languages often use translated benchmarks or simulated scenarios, rather than real-world, high-stakes exams, such as residency entrance assessments. Third, few studies have compared the performance of both unimodal and multimodal LLMs under the same conditions, especially in the context of complex questions involving medical images and reasoning. These limitations hinder a comprehensive understanding of the true potential and boundaries of these models in diverse, multilingual healthcare settings. These research gaps highlight the need for comprehensive, real-world evaluations of LLMs and MLLMs in underrepresented languages such as Portuguese, especially in high-stakes multimodal medical contexts.
To address these gaps, this study assesses the performance of zero-shot LLMs and multimodal large language models (MLLMs) on the medical residency exam of the HCFMUSP, a real-world, linguistically authentic and standardized test in Brazilian Portuguese. By benchmarking six LLMs and four MLLMs against human candidates, we aim to advance the evaluation of AI tools in medicine across linguistic and modality boundaries. This work therefore goes beyond simple replication of English-language benchmarks, offering novel insights into how underrepresented languages and multimodal question formats affect the reliability, safety, and fairness of generative AI in clinical contexts.
Results
Performance of LLMs on text-based questions
The evaluation results of the models on the exam questions with only text as inputs revealed a variation in accuracy between the different models tested. In summary Claude-3-Sonnet (72.97%) produced the highest score in accuracy, followed closely by Claude-3.5-Sonnet (70.27%) and Claude-3-opus (70.54%), all with minimal variability across the 5 trials. GPT-4 Turbo (66.22%) had a comparable performance, with the observation of no variation observed in the 5 trials. On the lower end, Titan Text had the weakest accuracy at 21.35%, while Llama-3-8b and Mixtral 8\(\times\)7B showed moderate results at 45.95% and 52.16%, respectively. Overall, accuracy ranged from 21% to 73%, revealing a wide performance gap across models. Regarding processing time, Llama-3-8b (4.02s) and Claude-3-haiku (4.12s) showed the lowest processing time, while Claude-3-opus (18.50s) and GPT-4 Turbo (14.70s) had the longest processing times. In general, models from the Claude family balanced high accuracy with reduced varied processing times, with Claude-3-Sonnet offering top accuracy at moderate speed, and Claude-3-haiku delivering a strong speed-accuracy trade-off. Omnibus test showed performance differences in accuracy and processing time among models. Post-hoc pairwise comparisons (Holm-adjusted) revealed that accuracy varied significantly more than processing time among models (Table 1). Figure 1 (empty circles) shows the relationship between model accuracy and mean processing time per question.
Comparison of accuracy (median accuracy, percentage of correct answers) and mean execution time per question (seconds). Empty circles: 74 textual questions; filled circles: questions containing images.
Multimodal reasoning and accuracy decline in image-based questions
In this analysis, only models from the Claude family were included; therefore, performance comparisons are limited to intra-family differences rather than across distinct architectures.
The accuracy and processing time for Claude-3-Sonnet, Claude-3-opus, Claude-3.5-Sonnet and Claude-3-haiku were analyzed on all 117 questions (Table 2). Overall, we observed a tendency to decrease accuracy (except for Claude 3.5 Sonnet) mean processing time per question increases were observed with the addition of questions containing images (Fig. 1, filled circles). Standard deviations were low across models. Claude-3.5-Sonnet showed the best performance in accuracy (69.57%), followed by Claude-3-opus (63.59%), Claude-3-Sonnet (54.70%), and Claude-3-haiku (44.44%). In terms of mean processing time, Claude-3-haiku showed the minimal mean processing time (5.48s), while Claude-3-opus was the slowest (24.68s). Claude-3.5-Sonnet balanced high accuracy with moderate processing time (13.02s), positioning it as a high-performing and efficient option.
Figure 2 shows the distribution of different types of questions (textual, non-radiological, and radiological) in five medical domains (gynecology and obstetrics, pediatrics, internal medicine, public health, and surgery). Figure 3 shows the accuracy of the Claude family models in answering these questions across five independent trials for each model. The Claude models consistently ranked highest overall, with markedly better performance on text-only questions compared to items containing images. This difference was especially pronounced among models with lower overall accuracy.
Distribution of the 117 valid questions of the 2023 HCFMUSP medical residency entrance exam by medical domain and format, classified as text-based (n = 74) or image-based (n = 43), with image-based items further divided into radiological (n = 19) and non-radiological (n = 24).
Comparison the accuracy of the Claude family models based on the type of questions (questions containing radiological or non-radiological images, or text only) across five medical areas: Gynecology and Obstetrics, Internal Medicine, Pediatrics, Public Health, and Surgery. Values in parentheses indicate the number of questions.
Overall, models performed best on textual questions, particularly in the public health and pediatric domains. Notably, the public health domain contained no radiological questions, while non-radiological image items in surgery and internal medicine yielded comparatively better results. Claude-3-opus and Claude-3.5-Sonnet slightly outperformed the other models. In contrast, questions involving images, especially those with radiological content, were more challenging, with consistently lower accuracies across all domains. When separated by question type, MLLMs achieved markedly lower accuracy on radiological questions (below 50 %), whereas performance was substantially higher on text-only and non-radiological items. This reinforces that textual reasoning remains the most consistent and clinically applicable capability of current models.
Comparison with candidates’ performance
LLMs and MLLMs generally achieved accuracy levels within the main density range of human applicants, with the best models approaching, but not surpassing, the human median. The distribution of human scores peaked between 65–70% accuracy, with a slightly lower median. For text-only questions, the top-performing models were Claude-3-Sonnet (72.97%), Claude-3.5-Sonnet (70.27%), Claude-3-opus (70.54%), and GPT-4.0 Turbo (66.22%), whereas Claude-3-haiku (61.35%) and Command R+ (59.46%) performed below the human median. When image-based items were included, accuracy declined across all models: Claude-3.5-Sonnet (69.57%) remained closest to the human score distribution peak, while Claude-3-opus (63.59%), Claude-3-Sonnet (54.70%), and Claude-3-haiku (44.44%) failed to reach comparable performance. Figure 4 shows the distribution of human applicants’ accuracy as a smoothed probability density function (solid line), with the peak corresponding to the mode and the dashed vertical line indicating the median. The numbers in circles represent the mean accuracy achieved by each LLM across five independent trials (in smaller font). The upper panel includes only textual questions, whereas the lower panel also includes items with images. Boxplots (right panels) depict the variability of accuracy across runs of each model. The raw accuracy values are available in the supplemental material (see “Data availability” in “Methods”).
Distribution of accuracy (solid line) and median (dashed line) of residency applicants, along with the accuracy achieved by different large language models: only textual questions (upper panel) and questions including images (lower panel). The abscissa positions of the small numbers indicate the accuracy obtained in each of the five trials. Boxplots (right panels) represent the variability of accuracy across runs of each model.
Quality, coherence, and safety of model-generated explanations
The evaluation by three clinical experts is summarized in Tables 3 and 4. Gwet’s AC1 effect sizes are presented with their 95% confidence intervals, quantifying the magnitude of agreement across all key comparisons described below.
Table 3 is divided in two columns: the left side corresponds to the questions that the model answered correctly and the right side to those it answered incorrectly, as determined by the official answer key. It consists of three panels. The upper panel reports the problematic questions for the model, either when it provided the correct answer but an incorrect justification (left side) or when it provided the incorrect answer but a correct justification (right side). Numbers in regular font indicate the counts, revealing that most correctly answered questions were correctly interpreted by the model (left) and most incorrectly answered questions were also incorrectly interpreted (right). Numbers in italics correspond to the problematic questions identified by each of the three observers, indicating that the questions marked by each observer do not completely overlap. Among the questions considered problematic:
-
there are nine correctly answered:
-
four included non-radiological images and pertain to the areas of internal medicine (1), gynecology and obstetrics (1), pediatrics (1), and public health (1);
-
five are textual questions in the areas of gynecology and obstetrics (2) and public health (3).
-
-
there are thirteen incorrectly answered:
-
one had non-radiological image for surgery.
-
three had radiographic images related to surgery (1), gynecology and obstetrics (1), and pediatrics (1).
-
nine are textual questions in the areas of surgery (2), internal medicine (2), gynecology and obstetrics (2), and public health (3).
-
Still maintaining the analysis in two separate columns according to the official answer key, the intermediate panel of Table 3 refines the upper panel by assessing whether Claude-3-opus correctly interpreted and justified each question (text and/or image) and whether its answers were coherent with those interpretations. While the correctness of the answers was previously determined by the official key, this section of the table concerns only the judgment of the model’s interpretations and justifications by the human observers, based on pairwise agreement estimated using Gwet’s AC1 statistic. All possible combinations occurred: in most cases, the observers agreed about which questions were interpreted or misinterpreted by the model (main diagonals of the 2\(\times\)2 tables in both table columns). However, there was no consensus between observers on the interpretation of the problematic questions, with instances in which one observer judged an interpretation as correct while another considered it incorrect (off-diagonals of the 2\(\times\)2 tables). The predominance of agreement among observers, in all cases, was statistically significant; however, this agreement was greater for questions that the model answered correctly than for those it answered incorrectly.
Table 3 (bottom panel) assesses the model’s interpretation of the questions and the coherence between its interpretation and the chosen alternative. Among the questions correctly answered by the model, there is agreement in most cases where the questions were correctly interpreted and the chosen alternative was consistent with that interpretation. Only one question (# 89) showed a discrepancy in which the model’s interpretation was not coherent with the chosen alternative, while in four others (# 58, 64, 67, and 71) the interpretation itself was flawed. Even so, the model’s answer was correct, as this corresponds to the left side of the table; a correct answer derived from incorrect reasoning, where the justification nonetheless pointed to the chosen alternative. Among the questions incorrectly answered by the model, interpretation and coherence do not statistically align. The model’s answers were coherent in 41 questions (left column of the 2\(\times\)2 table), but the interpretation was correct in 16 cases and incorrect in 25. In the remaining four questions (right column of the 2\(\times\)2 table), the chosen alternative did not correspond to the given interpretation (questions # 12 and # 47 with correct interpretations; # 44 and # 67 with incorrect interpretations). This lack of alignment reflects the model’s difficulties, as all these questions received incorrect answers according to the official answer key (right side of the table).
Table 4 is also divided into two columns: the left side corresponds to the questions the model correctly answered and the right side to those it incorrectly answered, as determined by the official answer key. This table examines the responses provided by the artificial intelligence model that, if followed as directives for action in medical practice, could potentially cause harm to a patient. The top panel presents the judgment of each observer (i.e., intraobserver evaluation) on whether the justifications provided by the model were concordant with the potential harm to patients. The variable “Justification” (Yes/No) refers to its correctness, transposed for each observer from Table 3. The variable “Harm” (Yes/No) represents the potential harm to a patient. In all cases, there was significant disagreement (negative agreement), indicating that off-diagonal cells were more frequent than those on the main diagonal, meaning that incorrect justifications were associated with harm and correct ones with the absence of harm. According to observers 1 and 2, this (dis)agreement was clearer (that is, closer to \(-1\)) for the questions correctly answered than for those incorrectly answered by the model. Observer 3 assumed a stricter perception that all correct explanations lead to no harm, and all incorrect explanations lead to harm for both sets of questions.
Two illustrative examples of justifications are shown in Figs. 5 and 6. Figure 5 shows question #48, for which the model provided the correct answer and a justification consistent with this answer (i.e., concordant and therefore correct) that would not cause harm to a patient. Figure 6 shows question #5, for which the model provided an incorrect answer, a justification consistent with that answer (that is, concordant and therefore incorrect), and an interpretation that, if followed, could cause harm to a patient; the model interpreted that the patient presented total airway obstruction, suggesting unnecessary maneuvers that would delay bronchoscopy, which would otherwise remove the foreign body and resolve the problem.
Example of correct answer from the model and justification validated as ”concordant” by a clinician.
Example of incorrect answer from the model. However, the clinician evaluated this wrong answer was ”concordant” with the justification, but this mistake would be ”unsafe” for the patient.
Table 4 (bottom panel) presents the comparison of the opinions of the observers (i.e., interobserver analysis) on the potential harm that could be caused to a patient. In all cases, there was agreement between the observers, but the level of agreement was greater when they evaluated the questions that the model answered correctly; in these cases, the most frequent cell corresponded to the absence of harm according to both observers. For the questions answered incorrectly, the agreement mainly concerned the presence of harm in the interpretation of all observer pairs.
Discussion
This study expands prior English-language benchmarks by introducing a linguistically distinct and multimodal evaluation in Brazilian Portuguese, using real human performance as reference. By integrating textual and image-based questions within an authentic, high-stakes examination, this work provides a new lens through which to assess generative models in an underrepresented language and cultural context. This setting allowed us to identify language-specific limitations and multimodal reasoning gaps that are not captured by traditional text-only evaluations. However, it is important to note that the models evaluated here reflect the state of the art available at the time of the examination, and that some of the limitations discussed may be attenuated or no longer apply to more recent LLM generations.
We assessed the zero-shot performance of LLMs and MLLMs to answer clinical questions in Portuguese. Overall, model performance showed substantial variability, reflecting differences in architecture, training data, and multilingual capabilities. The Claude family consistently achieved higher accuracy, particularly in text-only questions, while the open-source Llama-3-70B performed competitively, illustrating the potential of cost-effective models in multilingual medical settings.
When image-based questions were introduced, model accuracy decreased substantially and became more variable, which may reflect limitations in multimodal reasoning and/or differences in intrinsic question difficulty. Although Claude-3.5-Sonnet demonstrated relatively stable performance, the overall results suggest that the MLLMs evaluated in this study face challenges when visual information is involved. This pattern was consistent across different medical domains, indicating that the challenge may not be domain-specific and may instead be related to limitations in the models’ capacity. These findings highlight the need for targeted improvements in visual understanding and cross-modal alignment to achieve clinically reliable multimodal AI systems. Qualitatively, the clinician coauthors did not perceive the image-based questions as intrinsically more difficult than the text-only items, although this impression remains subjective and cannot be formally tested with the available data. Therefore, this question remains open as the present study did not have access to item-level response data from human candidates, and it was not possible to assess whether human performance also differed between purely text-based questions and those that included images.
When we qualitatively assessed the explanations for model-generated answers (using Claude-3-opus, the only MLLM available at the time of expert evaluation), we observed overall agreement among evaluators, but disagreement on which questions were flagged as problematic. At that stage, Claude-3-opus was the only model with stable API access and an output format compatible with blinded physician review, which technically constrained this part of the study to that model. While the model demonstrated coherent reasoning for correct answers, hallucinations were frequent in incorrect ones, and human disagreement increased in such cases. This indicates that interpretive ambiguity may arise both from model limitations and from the inherent complexity of certain medical questions.
The lower inter-observer agreement observed when the model’s answers were incorrect likely reflects multiple factors, including inherent ambiguities in some exam questions, variability in clinical interpretation, and the lack of absolute medical consensus in certain scenarios. Additionally, since raters provided their assessments independently, without prior calibration or adjudication, subjective differences were expected, particularly for items involving nuanced reasoning or incomplete contextual information. These findings suggest that the observed disagreement stems more from the inherent complexity of medical reasoning and annotation subjectivity than from inconsistencies in the evaluation protocol itself.
Model performance varied across medical domains. Questions related to public health and general clinical practice achieved higher accuracy, likely reflecting the greater availability of open-access Portuguese-language data, including materials published by the Brazilian Ministry of Health. In contrast, radiological and other image-based items exhibited the lowest performance, particularly among multimodal models, reflecting the limited availability of high-quality annotated medical images and multimodal datasets in Portuguese. These disparities likely stem from differences in data availability and representation rather than from model architecture.
Our findings indicate that, despite substantial recent progress, LLMs and MLLMs still fall short of consistent human-level performance. Even the strongest models achieved accuracy comparable to a median residency applicant, suggesting that current architectures can approximate, but not yet surpass, human reasoning in complex medical assessments. The persistent performance gap, particularly evident in image-based questions, underscores the difficulty these models face in integrating visual and contextual information. This limitation highlights the need for more robust multimodal training strategies and domain-specific fine-tuning to improve generalization across diverse clinical tasks. Interestingly enough, in the case of humans, the added value of having figures in the exam questions may even enhance the performance when meaningfully integrated 25,26. On the other hand, humans tend to perform when poorly designed visuals hinder comprehension 27,28,29. Besides that, GPT-4’s accuracy was close to the 60% in an English medical sufficiency test (USMLE) even surpassing threshold reported by Kung et al. 30. In that paper, however, it is important to note that, not only the written language (English) was different to our work, but also the prompting engineer characteristics were not the same.
Regarding language and image-related challenges, Guillen-Grima et al. 24 evaluated GPT-3.5 and GPT-4 on Spain’s MIR exam and found that GPT-3.5 scored 66.48% in English and 63.18% in Spanish, while GPT-4 scored 87.91% in English and 86.81% in Spanish. However, when image-related questions were included, accuracy dropped to 26.1% in English and 13.0% in Spanish. These differences reinforce that performance declines are largely driven by the presence of images and language representation in training data, issues likely shared across Portuguese and Spanish medical contexts. Notably, this pattern was not uniform across the models. For the most advanced model available at the time of this study, Claude-3.5-Sonnet, the performance on text-only and image-based questions was nearly identical. This stability suggests that, as model generations advance beyond those evaluated here, the performance gap between text-based and multimodal tasks may further narrow or even disappear.
On the other hand, the overall strong performance of AI in medical tests is a well-documented fact. Several studies have highlighted the potential of MLLMs to assist physicians in interpreting non-radiological medical images 31,32,33,34. Nevertheless, as we noted here, challenges remain due to the complexity of imaging techniques and anatomical variations 35,36.
We have evaluated the explanation provided by the model as assessed, by three experienced physicians. Hallucinations in LLMs pose a significant risk, especially in medicine, as they generate plausible but incorrect explanations. To address this, strategies include fine-tuning with high-quality data, robust evaluation frameworks, and integrating explainability mechanisms to ensure accuracy and patient safety 37,38. Our study explored the potential impact of hallucinations on medical questions. The model provided accurate explanations for about 94% of the correct questions, and incorrect explanations in 87% of the incorrect questions. This suggests that LLMs can provide a reasonable connection between the choice and the text accompanying the rationale. This is a critical property on the route to valuable tools for supporting medical decision making 39. On the other hand, when the model provided a wrong answer to the exam question, in the majority of cases the explanation was not correct – this in congruent with the fact that the rationale (explanation) provided was also not correct, thus implicating that the model really had followed a wrong path.
In those cases, unreliable AI performance should be examined from a broad testing perspective, considering not just ”hallucinations”, but also errors in training data or code. Overreliance (automaton bias) and pleasing bias, where AI aligns with perceived user preferences, may also impact test explanations and need to be addressed 40,41.
There still points to be addressed on the role of written language in LLM/MLLMs performance when used in medical test. Studies focusing on performance disparities driven by linguistic representation in publicly available datasets are still limited and sparse, and general discussions on language limitations exist 42. Here we have contributed with a few interesting results obtained in a Portuguese written medical text: mainly the analysis of model explainability and the relevance of non-text material in the model performance, from our knowledge, have now yet been addressed in a single test-frame. To what extend our conclusions can hold in other similar tests in Portuguese (i.e. tests designed to evaluate medical expertise when images are critical – as in the case of radiology, pathology, dermatology and oftalmology specialty tests) is still open. Considering that models are rapidly evolving, it is very probable that newest models will outperform the current ones. We believe it will be relevant to consider future clinical usability a proper context evaluation using a framework that considers explainability at its core. For instance, platforms such as MedHelm 43, designed to assess LLM performance for medical tasks including a benchmark suite and and “LLM-jury” system might be relevant. However, relevant to the point of our approach, MedHelm was not designed to deal with different languages, so it is still to be seen if the current results would hold in multi-langual settings.
Our study has a few limitations. First, the analysis was restricted to a single exam year (2023), which corresponds to the selection process for residents admitted in 2024. This choice was made because it was the most recent publicly available version of the HCFMUSP medical residency exam at the time of data collection, ensuring the evaluation reflected the current structure, topic distribution, and difficulty level of the official test. However, exam content and emphasis may vary slightly across years, and therefore future studies should include multiple exam editions to assess the consistency and stability of model performance over time. Additionally, as mentioned in the Introduction, it was not possible to assess human candidate performance by question type, given that only aggregate public data are available. We could not analyze GPT-4’s image processing capabilities, since these were not accessible within the hospital’s contracted version. Therefore, it might be possible that other models perform better in languages with less digital representation. This hypothesis remains indirect and warrants further exploration.
Second, we did not include an English baseline using translated exam questions, as our aim was to evaluate model behavior in authentic Portuguese clinical text rather than in translated material. Ideally, this comparison would provide the best pathway to address the effect of language on LLM and MLLM performance. Due to time, resource, and validation constraints, a medically reviewed translation/back-translation process was not feasible. Future studies should include parallel English versions of the same exam to isolate language-specific effects and assess generalizability across linguistic contexts.
Third, we could not evaluate other models simply due to limited access and computational resources. It is likely that the variations observed among the LLMs and MLLMs tested here might differ if additional models were included.
Finally, our study also suggests several future research directions. Investigating training and fine-tuning methods to improve LLM accuracy in Portuguese is promising. Using RAG strategies and other techniques could help to overcome some of the gaps detected here. We also highlight the importance of comparative studies across languages, socioeconomic contexts, and medical guidelines as a mean to offer insights into the adaptability and flexibility of these models.
In conclusion, this study assessed LLMs and MLLMs performance on a Brazilian Portuguese medical proficiency test, showing similar results to those in English and Spanish in a zero-shot setting. These results highlight the potential for effective integration of AI into clinical workflows. However, there are several common challenges and dependencies that need to be addressed ultimately, depending on medical participation in the development of responsible AI solutions.
Methods
The models were evaluated using a set of publicly available test questions in Brazilian Portuguese from the HCFMUSP medical residency entrance exam 44.
The HCFMUSP medical residency exam in 2023 consisted of 120 multiple choice questions, each with four options. Three questions (#55, #79, and #110) were cancelled post-hoc due to potential ambiguities, leaving 117 valid items for analysis. The exam covered five core areas of medicine: gynecology and obstetrics, pediatrics, internal medicine, public health, and surgery. Among the questions, 74 were text only, while 43 required interpreting images in conjunction with text. The images were classified into two categories: (1) non-radiological images (flow charts, photographs of medical procedures, dermatological lesions, and medical examinations such as electrocardiograms or endoscopic views) and (2) radiological images.
We evaluated the MLLMs and LLMs available at the time of writing, including the following models: Claude-3-opus, Claude-3-Sonnet, Claude-3.5-Sonnet, Claude-3-haiku, GPT-4.0 Turbo, LLaMA-3-8B, LLaMA-3-70B, Mixtral 8\(\times\)7B Instruct, Titan Text G1-Express, and Command R+. These models were among the leading performers according to previous publications 8,24,30. None of the models were exposed to the questions of this test during their training phase. The models employed in this study were accessed via distinct interfaces, depending on the model provider. Most LLMs were accessed through the Amazon Bedrock service, utilizing Python-based API calls to integrate with the service. An exception was made for the GPT models, which were accessed through an internal API service developed by the Hospital Israelita Albert Einstein (HIAE) Machine Learning Operations (MLOps) team, which connects directly to a private GPT server hosted on Azure server cloud (located within Brazillian borders and within Hospital’s firewall cybersecurity system). In order to ensure consistency and reproducibility, all models were prompted based on standardized protocols based on methodologies commonly applied in multiple-choice-based language model assessments 45,46,47. These protocols provided the necessary context for each task and explicitly instructed the models to answer the question and furnish detailed explanations for their chosen answers. Furthermore, the prompts requested the inclusion of specific tags to facilitate subsequent automated answer extraction via regular expression processing.
Model performance was quantified through accuracy and processing time. Accuracy was defined as the proportion of correct answers relative to the total number of questions evaluated in each task:
Each model was evaluated over five trials. Prior to running the experiments, both the order of the questions and the order of the multiple-choice answer options were randomized, generating five distinct shuffled sets. Each model was then tested using the same five shuffled sets, with one set per trial, to ensure comparability across models while allowing assessment of internal consistency. In this way, although five runs were conducted per model, these represent technical replications under controlled parameters (temperature = 0, same dataset, and predefined randomized question order) rather than independent samples. Any variability across runs, expected to be null or very small, reflects stochasticity in the prompt-order sequence rather than inferential uncertainty. Therefore, no hypothesis testing was applied.
We evaluated model performance in terms of accuracy and processing time for both LLMs and MLLMs using exam questions containing only textual input (n = 74). For multimodal performance (restricted to MLLMs from the Claude family), we applied the same metrics to all exam questions, including text-only, radiological, and non-radiological image-based items (n = 117). For the assessment of quality, coherence, and safety of model-generated explanations, explanations produced by one MLLM (Claude-3-opus) for each of the 117 exam questions were additionally evaluated by medical experts.
The evaluations were conducted between May 14 and June 21, 2024. During this period, all models were accessed through stable production APIs to ensure reproducibility, using a temperature setting of 0 and fixed decoding parameters.
-
GPT-4 Turbo was accessed via an internal API developed by the Hospital Israelita Albert Einstein Machine Learning Operations (MLOps) team, connecting to a private GPT server hosted on Microsoft Azure within the institution’s secure cloud environment (Brazilian region, behind the hospital’s firewall).
-
All other LLMs and MLLMs (Claude-3 family, LLaMA, Mixtral Instruct, Titan Text G1-Express, and Command R+) were accessed through Amazon Bedrock (AWS managed service) using Python-based API calls.
The performance of LLMs and MLLMs was then compared with the real-world results of human candidates in the 2024 HCFMUSP Medical Residency Exam. Subsequently, the generated answers were evaluated by three experienced internal medicine physicians using concordance and safety criteria. The models were prompted to provide both the selected answer choice and a written explanation in Portuguese, maintaining the same linguistic structure as the original question.
-
1.
Concordance was defined as internal coherence between the answer and its explanation, without signs of hallucination or misinterpretation.
-
2.
Safety was defined as the absence of reasoning that could plausibly mislead a physician or cause patient harm, regardless of whether the final answer was correct.
Each explanation was therefore classified into one of four categories: concordant and safe, non-concordant and safe, concordant and unsafe, or non-concordant and unsafe.
Statistical analysis
Corrections for repeated measures were applied to evaluate accuracy across the 5 trials. Standard errors for descriptive statistics were computed using the “seWithin” function, available in the Haus Lin package 48.
Omnibus test to assess differences in accuracy and processing time were analyzed using a linear mixed-effects model with a fixed effect for the model and a random intercept for the five trials (to account for repeated measures), which is equivalent to a one-way repeated measures ANOVA in this context. Post-hoc pairwise comparisons between models were conducted using estimated marginal means with Holm’s adjustment for multiple comparisons, implemented via the ‘emmeans::emmeans’ function in R 49. The confidence level was set to 95%, and the Satterthwaite method was used to approximate degrees of freedom. For comparison, the model results were also located within the distribution of human candidate performances, using the ‘stats::density’ function in R 50, which applies a Gaussian kernel by default to estimate the probability density function of the data. We used the ‘cld’ (compact letter display) function from the ‘multcomp’ package 51, which assigns different letters to groups to facilitate visualization of statistical differences among them. Furthermore, the precision of the five trials using the MLLMs (Claude-3-Sonnet, Claude-3-opus, Claude-3.5-Sonnet and Claude-3-haiku) was further segmented by main medical areas (gynecology and obstetrics, pediatrics, internal medicine, public health and surgery) and by types of questions (textual or containing non-radiological or radiological images).
The agreement between experts applied the Gwet AC1 coefficient, following previously validated methodology 52. This coefficient ranges from -1 (complete disagreement) to +1 (complete agreement) and tests the null hypothesis of no agreement (AC1=0), such that disagreement or agreement between observers is significant for p < 0.05. Similarly to the role of Pearson’s correlation, Gwet’s AC1 can also be interpreted, although not traditionally classified as such, as a measure of effect size. While Pearson’s r quantifies the strength and direction of a linear relationship between variables, Gwet’s AC1 expresses the magnitude of agreement or disagreement between raters or events. Both coefficients are dimensionless, have well-defined standardized limits (ranging from \(-1\) to 1), and are not affected by the sample size, properties that characterize effect size measures Jones2025.
Finally, questions flagged by experts as having incorrect explanations, regardless of whether the selected answer was correct, were reviewed to determine whether the explanation reflected a misinterpretation of the question or a mismatch between the explanation and the chosen alternative. These analyses were based on a single trial of Claude-3-opus, which was available at the time the physicians were recruited.
Data availability
Supplementary material, including exam questions in Portuguese, explanations for the answers provided by the AI model, data, and R and Python scripts that reproduce the results presented in this article, along with some additional analyses, is available on the Harvard Dataverse at https://doi.org/10.7910/DVN/OLKIL3.
References
Sarkar, C. et al. Artificial intelligence and machine learning technology driven modern drug discovery and development. Int. J. Mol. Sci. 24. https://doi.org/10.3390/ijms24032026 (2023).
Shah, N.H., Entwistle, D., & Pfeffer, M.A. Creation and adoption of large language models in medicine. JAMA. 330. https://doi.org/10.1001/jama.2023.14217 (2023).
Haug, C. J. & Drazen, J. M. Artificial intelligence and machine learning in clinical medicine. 2023. N. Engl. J. Med.388. https://doi.org/10.1056/nejmra2302038 (2023).
Marafino, B. J. et al. Validation of prediction models for critical care outcomes using natural language processing of electronic health record data. JAMA Netw. Open. 1. https://doi.org/10.1001/jamanetworkopen.2018.5097 (2018).
Moskovitch, R., Polubriaginof, F., Weiss, A., Ryan, P. & Tatonetti, N. Procedure prediction from symbolic Electronic Health Records via time intervals analytics. J Biomed Inform. 75. https://doi.org/10.1016/j.jbi.2017.07.018 (2017).
Harutyunyan, H., Khachatrian, H., Kale, D. C., Steeg, G. V. & Galstyan, A. Multitask learning and benchmarking with clinical time series data. Sci Data. 6. https://doi.org/10.1038/s41597-019-0103-9 (2019).
Bommasani, R., Liang, P. & Lee, T. Holistic Evaluation of Language Models. Ann. N Y Acad. Sci. 1525. https://doi.org/10.1111/nyas.15007 (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature. 620. https://doi.org/10.1038/s41586-023-06291-2 (2023).
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A.A.M., Abid, A., Fisch, A., et al.: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Available from: (2023). https://arxiv.org/abs/2206.04615.
Nori, H., Lee, Y.T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., et al. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine; 2023. Available from: https://arxiv.org/abs/2311.16452.
Tayebi Arasteh, S. et al. RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering e240476 (Artificial Intelligence, Radiology, 2025).
Zakka, C., Shad, R., Chaurasia, A., Dalal, A.R., Kim, J.L., Moor, M., et al. Almanac–retrieval-augmented language models for clinical medicine. NEJM AI. 1(2), AIoa2300068 (2024).
Wind, S., Sopa, J., Truhn, D., Lotfinia, M., Nguyen, T.T., Bressem, K., et al. Agentic large language models improve retrieval-based radiology question answering. arXiv preprint arXiv:2508.00743. (2025).
Plaat, A., van Duijn, M., van Stein, N., Preuss, M., van der Putten, P., Batenburg, K.J. Agentic large language models: a survey. arXiv preprint arXiv:2503.23037. (2025).
Ranathunga, S., de Silva, N.: Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World (2022). Available from: https://arxiv.org/abs/2210.08523.
Wikipedia. Portuguese-speaking world — Wikipedia, The Free Encyclopedia; 2024. Available from: https://en.wikipedia.org/wiki/Portuguese-speaking_world.
Wikipedia. List of languages by number of native speakers; 2025. Accessed: 2025–01-11. Available from: https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers.
W3Techs. Usage Statistics of Content Languages for Websites; 2025. Accessed: 2025–01-11. Available from: https://w3techs.com/technologies/overview/content_language.
Lorenzoni, G., Gregori, D., Bressan, S., Ocagli, H., Azzolina, D., Dalt, L.D., et al. Use of a large language model to identify and classify injuries with free-text emergency department data. JAMA Netw Open. 5(7), e2413208–8. https://doi.org/10.1001/jamanetworkopen.2024.13208 (2024).
Liu, M. et al. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J. Med. Internet Res. 26, e60807. https://doi.org/10.2196/60807 (2024).
Frei, J. & Kramer, F. Annotated dataset creation through large language models for non-english medical NLP. J. Biomed. Inform. 145. https://doi.org/10.1016/j.jbi.2023.104478 (2023).
Garcia, G.L., Paiola, P.H., Morelli, L.H., Candido, G., Júnior, A.C., Jodas, D.S., et al. Introducing Bode: A Fine-Tuned Large Language Model for Portuguese Prompt-Based Task. Available from: https://arxiv.org/abs/2401.02909 (2024).
Almeida, T.S., Abonizio, H., Nogueira, R., Pires, R. Sabiá-2: A New Generation of Portuguese Large Language Models (2024). Available from: https://arxiv.org/abs/2403.09887.
Guillen-Grima, F. et al. Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clin. Pract.13, 35. https://doi.org/10.3390/clinpract13060130 (2023).
Martín-Sanjosé, J.F., Juan, M.C., Vivó, R., Abad, F. The effects of images on multiple-choice questions in computer-based formative assessment. Digit. Educ. Rev. (2015).
Sagoo, M. G., Vorstenbosch, M. A. T. M., Bazira, P. J., Ellis, H. & Kambouri, M. Owen C 14 (The Effect of Images on Medical Students’ Performance. Anat Sci Educ, Online Assessment of Applied Anatomy Knowledge, 2021).
Wang, Z., Ardasheva, Y., Carbonneau, K. & Liu, Q. Testing the seductive details effect: Does the format or the amount of seductive details matter? Appl Cogn Psychol.35, 25. https://doi.org/10.1002/acp.3801 (2021).
Pouw, W., Rop, G., de Koning, B. & Paas, F. The Cognitive Basis for the Split-Attention Effect. J Exp Psychol Gen. 148. https://doi.org/10.1037/xge0000578 (2019).
Crisp, V. & Sweiry, E. Can a picture ruin a thousand words? The effects of visual resources in exam questions. Educ. Res. 48. https://doi.org/10.1080/00131880600732249 (2006).
Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PloS Digit Health. 2. https://doi.org/10.1371/journal.pdig.0000198 (2023).
Chan, S. et al. Machine learning in dermatology: current applications, opportunities, and limitations. Dermatol. Ther. 10. https://doi.org/10.1007/s13555-020-00372-0 (2020).
Du, A. X., Emam, S. & Gniadecki, R. Review of machine learning in predicting dermatological outcomes. Front Med. 7. https://doi.org/10.3389/fmed.2020.00266 (2020).
Hogarty, D. T. et al. Artificial intelligence in dermatology-where we are and the way to the future: A review. Am. J. Clin. Dermatol. 21. https://doi.org/10.1007/s40257-019-00462-6 (2020).
Panagoulias, D.P., Tsoureli-Nikita, E., Virvou, M., Tsihrintzis, G.A. Dermacen Analytica: A Novel Methodology Integrating Multi-Modal Large Language Models with Machine Learning in tele-dermatology (2024). arxiv: https://arxiv.org/abs/2403.14243.
Kelly, B.S., Judge, C., Bollard, S.M., Clifford, S.M., Healy, G.M., Aziz, A., et al. Radiology artificial intelligence: a systematic review and evaluation of methods (RAISE). Eur. Radiol. 32. https://doi.org/10.1007/s00330-022-08784-6 (2022).
Katal, S., York, B., Gholamrezanezhad, A. A. I. & in radiology: From promise to practice - A guide to effective integration. Eur. J. Radiol. 12, 181. https://doi.org/10.1016/j.ejrad.2024.111798 (2024).
Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., et al. Hallucination of Multimodal Large Language Models: A Survey (2024). Available from: arxiv: https://arxiv.org/abs/2404.18930.
Li, A. et al. Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG-enhanced vs. Human-Generated Medical Templates. medRxiv. (2024). Available from: https://doi.org/10.1101/2024.09.27.24314506.
Gu, Z., Yin, C., Liu, F., Zhang, P. MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context (2024). Available from: arxiv: https://arxiv.org/abs/2407.02730.
Ngo, A. et al. ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Acad Pathol. 11, (2024).
Cecil, J., Lermer, E., Hudecek, M. F. C., Sauer, J. & Gaube, S. Explainability does not mitigate the negative impact of incorrect AI advice in a personnel selection task. Sci Rep. 14, 9736. https://doi.org/10.1038/s41598-024-60220-5 (2024).
Osama, M., Afridi, S. & Maaz, M. ChatGPT: Transcending Language Limitations in Scientific Research Using Artificial Intelligence. J. Coll. Phys. Surg. Pak.33. https://doi.org/10.29271/jcpsp.2023.10.1198 (2023).
Bedi, S., Cui, H., Fuentes, M., Unell, A., Wornow, M., Banda, J.M., et al. MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks (2025). Available from: arxiv: https://arxiv.org/abs/2505.23802.
FUVEST. Residência Médica 2024 - FUVEST divulga questões de prova objetiva de concurso para residência médica da FMUSP - Fuvest; 2024. Available from: https://www.fuvest.br/residencia-medica-2024-fuvest-divulga-questoes-de-prova-objetiva-de-concurso-para-residencia-medica-da-fmusp/.
Ali, R., Tang, O.Y., Connolly, I.D., Fridley, J.S., Shin, J.H., Sullivan, P.L.Z., et al. Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank. Neurosurgery. 93(5). https://doi.org/10.1227/neu.0000000000002551 (2023).
Garabet, R., Mackey, B. P., Cross, J. & Weingarten, M. ChatGPT-4 Performance on USMLE Step 1 Style Questions and Its Implications for Medical Education: A Comparative Study Across Systems and Disciplines. Med. Sci. Educ. 34. https://doi.org/10.1007/s40670-023-01956-z (2024).
Mihalache, A., Huang, R. S., Popovic, M. M. & Muni, R. H. ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med. Teacher.46. https://doi.org/10.1080/0142159X.2023.2249588 (2024).
Lin, H. hausekeep: A Collection of Utility Functions for Data Science and Statistics - Compute standard errors (within-subjects);. Available from: https://hauselin.github.io/hausekeep/reference/seWithin.html.
Lenth, R.V. emmeans: Estimated marginal means, aka least-squares means. R package (version 1.7.1). R Foundation for Statistical Computing. 34,(2021).
Team, R.C. R: A Language and Environment for Statistical Computing (2024). Available from: https://www.R-project.org/.
Hothorn, T., Bretz, F. & Westfall, P. Simultaneous inference in general parametric models. Biometr. J. 6(50), 346–363. https://doi.org/10.1002/bimj.200810425 (2008).
Silveira, P. S. P. & Siqueira, J. O. Better to be in agreement than in bad company: A critical analysis of many kappa-like tests. Behav Res Methods.. https://doi.org/10.3758/s13428-022-01950-0 (2022).
Funding
The authors declare that no funding was received to support this research.
Author information
Authors and Affiliations
Contributions
C.T. Methodology, Writing – Review and Editing, Data Curation, Writing – Original Draft, Project administration. G.M. Methodology, Writing – Review Editing, Software, Investigation. D.L. Methodology, Writing – Review Editing, Software, Investigation. A.R. Methodology, Writing – Review Editing, Formal analysis, Visualization. A.P. Methodology, Writing – Review Editing, Data Curation. U.F. Methodology, Writing – Review Editing, Data Curation. E.R. Methodology, Writing – Review Editing, Conceptualization. J.V. Methodology, Writing – Review Editing. P.S. Methodology, Writing – Review Editing, Formal analysis, Data Curation, Visualization, Supervision. E.A. Methodology, Writing – Review Editing, Conceptualization, Supervision, Funding acquisition.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Truyts, C.A.M., Rabelo, A.G., Souza, G.M.d. et al. Zero-shot performance of selected large language and multimodal models on the 2023 Brazilian Portuguese medical residency exam. Sci Rep 16, 11756 (2026). https://doi.org/10.1038/s41598-026-42829-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-42829-w








