Abstract
Large language models (LLMs) show promise in healthcare, but concerns remain that they may produce medically unjustified clinical care recommendations reflecting the influence of patients’ sociodemographic characteristics. We evaluated nine LLMs, analyzing over 1.7 million model-generated outputs from 1,000 emergency department cases (500 real and 500 synthetic). Each case was presented in 32 variations (31 sociodemographic groups plus a control) while holding clinical details constant. Compared to both a physician-derived baseline and each model’s own control case without sociodemographic identifiers, cases labeled as Black or unhoused or identifying as LGBTQIA+ were more frequently directed toward urgent care, invasive interventions or mental health evaluations. For example, certain cases labeled as being from LGBTQIA+ subgroups were recommended mental health assessments approximately six to seven times more often than clinically indicated. Similarly, cases labeled as having high-income status received significantly more recommendations (P < 0.001) for advanced imaging tests such as computed tomography and magnetic resonance imaging, while low- and middle-income-labeled cases were often limited to basic or no further testing. After applying multiple-hypothesis corrections, these key differences persisted. Their magnitude was not supported by clinical reasoning or guidelines, suggesting that they may reflect model-driven bias, which could eventually lead to health disparities rather than acceptable clinical variation. Our findings, observed in both proprietary and open-source models, underscore the need for robust bias evaluation and mitigation strategies to ensure that LLM-driven medical advice remains equitable and patient centered.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Data availability
All the synthetic cases are publicly available in a Hugging Face repository: https://huggingface.co/datasets/mamuto11/LLMs_Bias_Bench. Interested parties may access these materials without a data use agreement. Access to the de-identified real data for further research on sociodemographic biases in AI or other output biases is available upon request. Interested parties should contact one of the corresponding authors, and we will respond within 21 days.
Code availability
All code, including scripts and instructions for generating and analyzing the synthetic vignettes, is provided in the Supplementary Information without restriction. For any questions about the data or code, please contact the corresponding author; inquiries will receive a response within 14 days.
References
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Glicksberg, B. S. et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J. Am. Med. Inform. Assoc. 31, 1921–1928 (2024).
Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 141 (2023).
Mosadeghrad, A. M. Factors influencing healthcare service quality. Int. J. Health Policy Manag. 3, 77–89 (2014).
Njoku, A., Evans, M., Nimo-Sefah, L. & Bailey, J. Listen to the whispers before they become screams: addressing Black maternal morbidity and mortality in the United States. Healthcare 11, 438 (2023).
Keteepe-Arachi, T. & Sharma, S. Cardiovascular disease in women: understanding symptoms and risk factors. Eur. Cardiol. 12, 10–13 (2017).
Richardson-Parry, A. et al. Interventions to reduce cancer screening inequities: the perspective and role of patients, advocacy groups, and empowerment organizations. Int. J. Equity Health 22, 19 (2023).
Liu, M., Sandhu, S., Reisner, S. L., Gonzales, G. & Keuroghlian, A. S. Health status and health care access among lesbian, gay, and bisexual adults in the US, 2013 to 2018. JAMA Intern. Med. 183, 380–383 (2023).
Rejeleene, R., Xu, X. & Talburt, J. Towards trustable language models: investigating information quality of large language models. Preprint at http://arxiv.org/abs/2401.13086 (2024).
Vela, M. B. et al. Eliminating explicit and implicit biases in health care: evidence and research needs. Annu. Rev. Public Health 43, 477–501 (2022).
Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit. Health 6, e12–e22 (2024).
Omar, M. et al. Evaluating and addressing demographic disparities in medical large language models: a systematic review. Preprint at medRxiv https://doi.org/10.1101/2024.09.09.24313295 (2024).
Cau, R., Pisu, F., Suri, J. S. & Saba, L. Addressing hidden risks: systematic review of artificial intelligence biases across racial and ethnic groups in cardiovascular diseases. Eur. J. Radiol. 183, 111867 (2024).
Pfohl, S. R. et al. A toolbox for surfacing health equity harms and biases in large language models. Nat. Med. 30, 3590–3600 (2024).
Gallegos, I. O. et al. Bias and fairness in large language models: a survey. Comput. Linguist. 50, 1097–1179 (2024).
Resnik, P. Large language models are biased because they are large language models. Preprint at http://arxiv.org/abs/2406.13138 (2024).
Chaudhary, I. et al. Quantitative certification of bias in large language models. Preprint at http://arxiv.org/abs/2405.18780 (2024).
Poulain, R., Fayyaz, H. & Beheshti, R. Bias patterns in the application of LLMs for clinical decision support: a comprehensive study. Preprint at http://arxiv.org/abs/2404.15149 (2024).
OpenAI et al. GPT-4 technical report. Preprint at http://arxiv.org/abs/2303.08774 (2024).
Kaplan, D. M. et al. What’s in a name? Experimental evidence of gender bias in recommendation letters generated by ChatGPT. J. Med. Internet Res. 26, e51837 (2024).
NIMHD. Minority Health and Health Disparities Definitions www.nimhd.nih.gov/resources/understanding-health-disparities/minority-health-and-health-disparities-definitions.html (2024).
Cascella, M. et al. The breakthrough of large language models release for medical applications: 1-year timeline and perspectives. J. Med. Syst. 48, 22 (2024).
Cochran, S. D., Sullivan, J. G. & Mays, V. M. Prevalence of mental disorders, psychological distress, and mental health services use among lesbian, gay, and bisexual adults in the United States. J. Consult. Clin. Psychol. 71, 53–61 (2003).
Gmelin, J. O. H. et al. Increased risks for mental disorders among LGB individuals: cross-national evidence from the World Mental Health Surveys. Soc. Psychiatry Psychiatr. Epidemiol. 57, 2319–2332 (2022).
Meyer, I. H. Prejudice, social stress, and mental health in lesbian, gay, and bisexual populations: conceptual issues and research evidence. Psychol. Bull. 129, 674–697 (2003).
Hoy-Ellis, C. P. Minority stress and mental health: a review of the literature. J. Homosex. 70, 806–830 (2023).
Bernheim, S. M., Ross, J. S., Krumholz, H. M. & Bradley, E. H. Influence of patients’ socioeconomic status on clinical management decisions: a qualitative study. Ann. Fam. Med. 6, 53–59 (2008).
Arpey, N. C., Gaglioti, A. H. & Rosenbaum, M. E. How socioeconomic status affects patient perceptions of health care: a qualitative study. J. Prim. Care Community Health 8, 169–175 (2017).
Serchen, J., Hilden, D. R., Beachy, M. W. & Health and Public Policy Committee of the American College of Physicians. Meeting the health and social needs of America’s unhoused and housing-unstable populations: a position paper from the American College of Physicians. Ann. Intern. Med. 177, 514–517 (2024).
Schramowski, P., Turan, C., Andersen, N., Rothkopf, C. A. & Kersting, K. Large pre-trained language models contain human-like biases of what is right and wrong to do. Nat. Mach. Intell. 4, 258–268 (2022).
Yang, J., Soltan, A. A. S., Eyre, D. W., Yang, Y. & Clifton, D. A. An adversarial training framework for mitigating algorithmic biases in clinical machine learning. NPJ Digit. Med. 6, 55 (2023).
Naveed, H. et al. A comprehensive overview of large language models. Preprint at http://arxiv.org/abs/2307.06435 (2024).
Stacey, D. et al. A systematic process for creating and appraising clinical vignettes to illustrate interprofessional shared decision making. J. Interprof. Care 28, 453–459 (2014).
Operario, D. et al. Sexual minority health disparities in adult men and women in the United States: National Health and Nutrition Examination Survey, 2001–2010. Am. J. Public Health 105, e27–e34 (2015).
Bozdag, M., Sevim, N. & Koç, A. Measuring and mitigating gender bias in legal contextualized language models. ACM Trans. Knowl. Discov. Data 18, 79 (2024).
Bhardwaj, R., Majumder, N. & Poria, S. Investigating gender bias in BERT. Cogn. Comput. 13, 1008–1018 (2021).
Yang, Y., Liu, X., Jin, Q., Huang, F. & Lu, Z. Unmasking and quantifying racial bias of large language models in medical report generation. Preprint at http://arxiv.org/abs/2401.13867 (2024).
Preiksaitis, C. et al. The role of large language models in transforming emergency medicine: scoping review. JMIR Med. Inform. 12, e53787 (2024).
Shrank, W. H., Rogstad, T. L. & Parekh, N. Waste in the US health care system: estimated costs and potential for savings. JAMA 322, 1501–1509 (2019).
Bazargan, M., Cobb, S. & Assari, S. Discrimination and medical mistrust in a racially and ethnically diverse sample of California adults. Ann. Fam. Med. 19, 4–15 (2021).
Yadav, H., Shah, D., Sayed, S., Horton, S. & Schroeder, L. F. Availability of essential diagnostics in ten low-income and middle-income countries: results from national health facility surveys. Lancet Glob. Health 9, e1553–e1560 (2021).
Agbareia, R. et al. The role of prompt engineering for multimodal LLM glaucoma diagnosis. Preprint at medRxiv https://doi.org/10.1101/2024.10.30.24316434 (2024).
Sahoo, P. et al. A systematic survey of prompt engineering in large language models: techniques and applications. Preprint at http://arxiv.org/abs/2402.07927 (2024).
Yu, Y. et al. Large language model as attributed training data generator: a tale of diversity and bias. Preprint at https://arxiv.org/abs/2306.15895 (2023).
Hackmann, S., Mahmoudian, H., Steadman, M. & Schmidt, M. Word importance explains how prompts affect language model outputs. Preprint at http://arxiv.org/abs/2403.03028 (2024).
Reisner, S. L. et al. Global health burden and needs of transgender populations: a review. Lancet 388, 412–436 (2016).
Braveman, P. & Gottlieb, L. The social determinants of health: it’s time to consider the causes of the causes. Public Health Rep. 129, 19–31 (2014).
Pitts, S. R., Niska, R. W., Xu, J. & Burt, C. W. National Hospital Ambulatory Medical Care Survey: 2006 emergency department summary. Natl Health Stat. Report pubmed.ncbi.nlm.nih.gov/18958996/ (2008).
Raven, M., Lowe, R. A., Maselli, J. & Hsia, R. Y. Comparison of presenting complaint vs. discharge diagnosis for identifying ‘nonemergency’ department visits. JAMA 309, 1145–1153 (2013).
Weiss, A. J., Wier, L. M., Stocks, C. & Blanchard, J. Healthcare Cost and Utilization Project (HCUP) Statistical Briefs (Agency for Healthcare Research and Quality, 2014).
Acknowledgements
We thank K. Devarakonda and her team for providing key edits and feedback during the submission process. Financial disclosure: this research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
M.O. led the study design, data analysis, visualizations and paper drafting. S.S., R.A., E.K. and G.N.N. contributed to data interpretation and paper refinement. N.L.B., D.U.A., C.R.H., A.W.C., B.S.G., R.F. and B.K. provided expert review, validation and paper editing. All authors reviewed and approved the final paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Medicine thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Sections 1–6, Figs. 1–5 and Tables 1–29.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Omar, M., Soffer, S., Agbareia, R. et al. Sociodemographic biases in medical decision making by large language models. Nat Med 31, 1873–1881 (2025). https://doi.org/10.1038/s41591-025-03626-6
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41591-025-03626-6
This article is cited by
-
Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models
npj Digital Medicine (2025)
-
Foundation models in medicine are a social experiment: time for an ethical framework
npj Digital Medicine (2025)
-
Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support
Communications Medicine (2025)
-
Large language models for clinical decision support in gastroenterology and hepatology
Nature Reviews Gastroenterology & Hepatology (2025)
-
Evaluating acute image ordering for real-world patient cases via language model alignment with radiological guidelines
Communications Medicine (2025)