Abstract
Recent advancements in large language models (LLMs) have demonstrated their potential in numerous medical applications, particularly in automating clinical trial matching for translational research and enhancing medical question-answering for clinical decision support. However, our study shows that incorporating non-decisive socio-demographic factors, such as race, sex, income level, LGBT+ status, homelessness, illiteracy, disability, and unemployment, into the input of LLMs can lead to incorrect and harmful outputs. These discrepancies could worsen existing health disparities if LLMs are broadly implemented in healthcare. To address this issue, we introduce EquityGuard, a novel framework designed to detect and mitigate the risk of health inequities in LLM-based medical applications. Our evaluation demonstrates its effectiveness in promoting equitable outcomes across diverse populations.
Similar content being viewed by others
Introduction
Large language models (LLMs)1,2,3,4,5,6 have demonstrated significant promise across a range of medical applications. Models such as GPT-4 can process vast amounts of text, generating human-like responses, summaries, and contextually relevant insights. This capability holds significant promise for advancing both patient care and medical research. LLMs are particularly valuable in tasks like clinical trial matching and medical question answering (MQA), which are crucial for translational research and clinical decision support, respectively. These applications underscore the transformative role LLMs can play in improving healthcare outcomes and streamlining research efforts.
However, despite these impressive capabilities, LLMs may exacerbate persistent healthcare inequities worldwide. In many clinical settings, especially in low-resource environments, biased decision-making can further exacerbate disparities in treatment and access to care. This urgent challenge calls for artificial intelligence (AI) systems that are not only powerful but also fair and unbiased. To address this, we propose EquityGuard, a novel framework that employs contrastive learning to actively mitigate bias in LLM outputs. In this study, we validate EquityGuard on two primary medical tasks: clinical trial matching and medical question-answering.
Clinical trial matching (CTM), an essential process for accelerating translational research, involves identifying and pairing patients with appropriate clinical trials based on complex eligibility criteria derived from patient medical records and trial protocols7. Although LLMs offer transformative solutions by automating this process, they can inadvertently propagate bias, leading to the systematic exclusion of certain demographic groups from clinical trials.
Similarly, medical question-answering (MQA) systems powered by LLMs8,9,10,11,12,13,14 hold great potential for enhancing clinical decision support by integrating diverse sources such as clinical guidelines, research papers, and patient-specific information. Yet, biased outputs in MQA tasks may lead to misinformation and disproportionately affect underrepresented communities.
Our evaluation includes state-of-the-art models such as GPT-4 as well as the latest releases, including Gemini and Claude (2024 versions). Although these models demonstrate remarkable performance improvements, they still inherit biases from their training data.
In this study, we aim to address two key research questions:
-
RQ1: To what extent do LLMs exhibit inequities across two major medical applications, i.e., CTM and MQA tasks?
-
RQ2: What techniques can be applied to mitigate inequities when applying LLMs in medical applications, and how effective are they in promoting health equity?
Understanding how inequities manifest across healthcare care tasks is essential to address these issues. Previous research has identified several sources of inequity, including inherent biases in training data, underrepresentation of certain groups, and algorithmic design flaws15,16,17. However, there remains a need for focused investigations into how these inequities affect specific healthcare tasks, such as CTM and MQA. This paper aims to fill that gap by identifying and mitigating inequities in these applications. Two examples are illustrated in Fig. 1.
Clinical Trial Matching (left) and Medical Question Answering (right). On the left, including race and sex information (e.g., “African-American” and “woman”) in the patient note, despite being irrelevant to matching the correct clinical trials, resulted in altered clinical trial recommendations generated by the LLMs. On the right, adding race information (e.g., “Native American”) to the question, which should not affect the response, led to incorrect answers from the LLMs. These examples show that non-decisive socio-demographic factors in different patient populations can lead to incorrect outputs from LLMs, which may lead to harmful clinical outcomes to these patient populations and eventually exacerbate healthcare inequities.
The proposed EquityGuard framework is based on contrastive learning and could systematically evaluate and mitigate inequities in LLMs18,19,20,21. EquityGuard uses contrastive learning techniques22,23,24 to disentangle socio-demographic determinants of health (SDOH) factors from task-related embeddings, ensuring that these attributes do not unduly influence model predictions. Through a series of experiments, we show that EquityGuard could enhance equity in LLMs for medical applications, specifically CTM and MQA tasks. EquityGuard is designed to be adaptable across diverse healthcare settings, including low-resource environments25,26, thereby effectively mitigating bias even when clinical data is scarce and promoting equitable outcomes in both CTM and MQA tasks.
Results
Our experiments focused on examining how race, sex, and SDOH factors (including low income, LGBT+, homeless, illiteracy, disabled, and unemployed) influence the outputs of LLMs and potentially introduce inequity and inaccuracy. To address these issues, we proposed the EquityGuard framework, which leverages contrastive learning to mitigate the effects of irrelevant SDOH attributes by aligning embeddings of similar inputs. This approach aims to improve the fairness of LLM outputs by reducing the influence of sensitive demographic factors.
We evaluated the models on five datasets across two key medical applications: CTM and MQA tasks. The CTM datasets include SIGIR 201627, TREC 2011, and TREC 202228, while the MQA datasets are MedQA8 and MedMCQA9. We added specific terms for race, sex, and each SDOH category in the input to different LLMs, in the same way as illustrated in Fig. 1, to examine the output. We tested four LLMs for the evaluation: GPT-4, GPT-4o Mini, Gemini (Gemini 1.5 Flash)29, and Claude (specifically, Claude-3-5-sonnet-20240620)30. For the EquityGuard implementation, we mainly used the open-source LLMs, including LLaMA3 8B and Mistral v0.3, and compared with baseline GPT-4 model. More details about the EquityGuard framework and approach used in this study can be found in the Method section.
Comparison of equity in LLMs
Figure 2 presents radar plots compared the performance of the LLMs on CTM and MQA tasks when different SDOH factors are introduced into the dataset. Performance for the CTM task is measured using the Normalized Discounted Cumulative Gain at rank 10 (NDCG@10), with higher values indicating better performance. For the MQA task, error rates are used, with lower values indicating better performance.
The clinical trial matching (CTM) performance is measured using NDCG@10 (higher is better), while the medical question answering (MQA) performance is measured using error rate (lower is better). SDOH factors include race, sex, low income, LGBT+ status, homelessness, illiteracy, disability, and unemployment. Each sensitive attribute was incorporated into the input data for both CTM and MQA tasks during the evaluation.
Among the evaluated models, GPT-4 consistently demonstrated the best overall performance across a variety of SDOH factors. In the CTM task, GPT-4 maintained relatively stable NDCG@10 scores, even when different SDOH factors were included in the input. For instance, GPT-4 performed particularly well across low-income, unemployed, and disabled groups, showing minimal variation in NDCG@10 scores compared to other models. Moreover, GPT-4 exhibited balanced performance across racial groups, including Black, White, and Hispanic, reflecting greater fairness in handling diverse queries (Fig. 2 left).
In contrast, both Gemini and Claude showed greater variability in performance across SDOH factors. These models experienced significant declines in NDCG@10 scores for Native American, Middle Eastern, and Pacific Islander groups, indicating poorer handling of less-represented racial categories. Furthermore, they exhibited noticeably higher error rates in the MQA task when facing queries involving homelessness, unemployment, and low income categories, revealing challenges in equity concerning SDOH factors.
While GPT-4 maintained lower error rates for most MQA categories, especially in sex and race-related queries, Gemini and Claude demonstrated a higher propensity for errors in underrepresented groups. For example, Gemini exhibited error rates as high as 0.31 for low income cases and 0.29 for homeless populations. These disparities highlight that, while GPT-4 offers better equity, Gemini and Claude are more prone to producing inequitable outputs for vulnerable groups, particularly in MQA tasks.
To quantify fairness, we used the equal opportunity (EO) and demographic parity (DP) metrics. GPT-4 again outperformed the other models, showing consistent results with higher EO and DP scores across different SDOH factors. Gemini and Claude, however, displayed greater disparities, particularly in their treatment of unemployed and low income groups, suggesting that these models struggle to maintain fairness across diverse populations (Fig. 3).
Fairness and correlation analysis
To further investigate the inequities observed in the LLM models, we conducted a correlation analysis between different categories. We evaluated whether certain race, sex and SDOH factors tend to produce similar biased outcomes, which is critical for understanding systemic biases and improving fairness across categories.
For each pair of inequity categories, we calculated the correlation based on the following criteria:
-
If both categories result in the same wrong answer or wrong rerank order, a correlation of +1 is assigned.
-
If one category is correct while the other is wrong, a correlation of -1 is given.
-
If both categories either get the same correct answer or the right rerank order, a correlation of 0 is assigned.
In the CTM task (Fig. 4, left), several race, sex and SDOH factors exhibited strong correlations, revealing compounded inequity patterns. The Black and Pacific Islander categories displayed a high correlation coefficient of 0.5, suggesting that model decisions were consistently similar for these groups. Additionally, the socioeconomic factors unemployed and low income showed a notable correlation of 0.25, indicating that inequities related to SDOH heavily influence model outputs in the CTM task.
The left plot shows the correlation between inequity categories in CTM tasks, illustrating how different inequity-modified queries resulted in similar trial rankings or selections by the models. The right plot shows the correlation between inequity categories in MQA tasks, displaying how often different inequity-modified queries led to the same answers or error patterns. These heatmaps help analyze how inequities across categories are interconnected, impacting model fairness.
Other factors, such as low income and Black, demonstrated moderate correlations (0.25), pointing to shared inequities between economic disadvantage and racial categories. Conversely, correlations between Hispanic and low income resulted in a negative correlation (-0.26), highlighting disparities in how the model treats these categories. The unemployed and Mixed Race categories showed a weaker positive correlation of 0.2, indicating less interconnection between these inequities compared to others.
In the MQA task (Fig. 4, right), similar trends were observed. Strong correlations were found between race and SDOH factors. The unemployed group was closely related to the disability category, with correlation values exceeding 0.17. This implies that inequities in SDOH factors significantly align with racial inequities, further entrenching model biases when answering medical questions.
Interestingly, the low income category showed negative correlations with all other categories. This suggests that, in this particular task, the model treats low income as a distinct factor not strongly linked to other SDOH factors or demographic attributes. One possible explanation is that the task might focus more on clinical or health-related issues, and SDOH factors like income level may not be as directly relevant. Consequently, the model pays less attention to low-income as a critical feature in this context, leading to these lower correlations.
Inequity mitigation in CTM
We evaluated the effectiveness of the EquityGuard framework in mitigating inequities in LLMs for the CTM task by analyzing model performance across race, sex, and SDOH factors. The models assessed were LLaMA3 8B, Mistral v0.3, both with and without the application of EquityGuard (denoted as w/ EquityGuard and w/o EquityGuard, respectively). GPT-4 was compared as a baseline model.
Tables 1 and 2 present the NDCG@10 scores across different race and sex categories. Models trained with EquityGuard exhibited more uniform performance across race and sex compared to their counterparts without EquityGuard. For instance, LLaMA3 8B w/ EquityGuard maintained NDCG@10 scores around 70% across all categories, whereas LLaMA3 8B w/o EquityGuard showed greater variability, with scores ranging from 67.7% (Native American) to 72.6% (Asian). The performance disparities between these groups indicate inequities in models without EquityGuard.
Table 3 extends the analysis to SDOH factors, including LGBT+, low income, unemployed, disabled, illiterate and homeless. Models with EquityGuard displayed more consistent performance across these categories. For example, LLaMA3 8B w/ EquityGuard achieved higher NDCG@10 scores in the low income (89.8%) and unemployed (87.4%) categories compared to w/o EquityGuard (81.3% and 83.4%, respectively). This improvement suggests that EquityGuard enhances fairness in CTM by mitigating inequities associated with SDOH factors.
Inequity mitigation in MQA
We further assessed EquityGuard’s impact on the MQA task using the MedQA and MedMCQA datasets. Error rates across SDOH categories are presented in Tables 4 and 5. Implementing EquityGuard led to a noticeable reduction in error rates across all race, sex and SDOH categories. For instance, LLaMA3 8B w/ EquityGuard achieved an average error rate of 19.8%, compared to 21.2% for w/o EquityGuard, representing a relative decrease of approximately1.3%.
Notably, the reduction in error rates was significantly greater in categories that initially exhibited higher inequities. In the Black category, LLaMA3 8B’s error rate decreased from 22.0% (w/o EquityGuard) to 20.3% (w/ EquityGuard). Mistral v0.3 showed similar improvements, with error rates decreasing from 21.4% to 19.5% in the Black category after applying EquityGuard.
Table 6 presents error rates across SDOH categories for the MQA task. EquityGuard effectively reduced error rates in categories such as LGBT+, low-income, and unemployed. For LLaMA3 8B, the error rate in the low-income category decreased from 18.4% (w/o EquityGuard) to 12.7% (w/ EquityGuard). This significant reduction highlights EquityGuard’s capability to mitigate inequities associated with SDOH factors.
Enhanced fairness metrics
To quantify the fairness improvements achieved by EquityGuard, we calculated the EO and DP differences for the LLaMA3 8B models (Fig. 5). Models with EquityGuard (w/ EquityGuard) demonstrated reduced EO and DP differences across SDOH factors, indicating enhanced fairness. Specifically, the EO difference increased by an average of 28%, and the DP difference increased by approximately 32% compared to the models without EquityGuard.
Overall impact of EquityGuard
Our results demonstrate that EquityGuard significantly mitigates inequities in LLMs across both CTM and MQA tasks. Key observations include:
-
Uniform performance across demographics: Models with EquityGuard provided more consistent NDCG@10 scores and lower error rates across all race, sex and SDOH categories, indicating reduced inequity.
-
Improved fairness metrics: Enhanced EO and DP scores affirm that EquityGuard promotes equitable model behavior, ensuring that sensitive demographic factors do not disproportionately influence predictions.
Overall, the application of EquityGuard contributes to more fair and equitable decision-making processes in healthcare AI systems by minimizing the influence of sensitive attributes on model outputs. This is critical for addressing health disparities and ensuring equitable healthcare delivery.
Discussion
LLMs are typically trained on vast amounts of publicly available text data drawn from the internet, books, social media, and other human-generated content. Because this data reflects societal norms, perspectives, and historical narratives, it inevitably contains biases—particularly those that marginalize or misrepresent disadvantaged populations. As a result, LLMs can internalize and replicate these biases in the way they generate text, make predictions, or support decision-making. When such biased models are deployed in health-related tasks—such as clinical decision support, clinical trial matching, patient education, or triage—they risk propagating and even amplifying existing disparities. This may lead to skewed outcomes that disproportionately affect already underserved communities, ultimately exacerbating health inequities rather than alleviating them. In this study, we examined and confirmed the existence of such potential inequities using existing LLMs. We then introduced EquityGuard, a framework employing contrastive learning to mitigate inequities in LLMs applied to healthcare tasks, specifically CTM and MQA. By reducing the undue influence of race, sex and SDOH factors, EquityGuard enhances fairness in LLM outputs. Our experiments across five datasets demonstrated that even advanced models like GPT-4, Claude, and Gemini are susceptible to inequities, which EquityGuard effectively mitigated, leading to more equitable outcomes in both tasks.
Despite these promising results, there are limitations to our approach. One significant challenge lies in accurately identifying and processing social demographic determinant factors within the datasets. While we utilized Bio_ClinicalBERT for named entity recognition and developed a custom pipeline to enhance accuracy, the detection of these factors is not foolproof. Misidentification or omission can adversely affect the effectiveness of bias mitigation. Future work could explore more advanced methods for detecting SDOH factors, possibly incorporating additional context or leveraging unsupervised learning techniques31,32,33. Another limitation pertains to balancing bias mitigation with task performance. While EquityGuard significantly improves fairness metrics, it is observed that models such as LLaMA3 8B and Mistral v0.3 exhibit a slight performance decrease compared to GPT-434. This can be attributed to the fact that GPT-4, with its substantially larger number of parameters and more sophisticated architecture, inherently possesses greater capacity for handling complex tasks. In contrast, the smaller models experience a trade-off between bias mitigation and task accuracy when the additional contrastive loss is incorporated. Future work will explore parameter-efficient techniques to better balance this trade-off. The incorporation of additional loss components for bias mitigation may impact overall model performance. Future research could investigate adaptive strategies or alternative loss functions that more effectively balance fairness and performance35,36. It is important to note that LLMs inherently exhibit stochastic behavior. Although we report the average performance along with the standard deviations, the variability in outputs may pose risks in healthcare applications, such as inconsistent clinical recommendations. Future work will investigate ensemble methods and post-processing techniques to further mitigate such randomness. Note that this study focuses exclusively on text-based clinical tasks (CTM and MQA) and does not address bias mitigation in medical imaging a topic to be explored in future work.
Furthermore, our study focused on a limited set of SDOH factors. The complex nature of biases in healthcare suggests that other factors, such as age, or the intersectionality between attributes, could also contribute to biased outcomes37,38,39,40. Expanding EquityGuard to account for a broader range of factors would enhance its applicability and robustness in real-world settings. The evaluation metrics used, DP and EO, provide insights into the models’ fairness but may not capture all dimensions relevant in healthcare contexts38,39,41. Although EquityGuard exhibits strong bias mitigation capabilities, its deployment at scale within large healthcare systems entails significant challenges related to computational efficiency and ethical oversight. Future work could incorporate additional fairness metrics, such as Equalized Odds or calibration error, to provide a more comprehensive assessment of model fairness42,43.
In conclusion, while EquityGuard shows promise in mitigating inequities in LLMs for CTM and MQA tasks, addressing its current limitations is crucial for the advancement of AI-driven healthcare systems that are not only effective but also equitable. Future work will focus on enhancing social demographic determinant detection, refining inequity mitigation strategies, expanding the range of considered inequities, and exploring additional fairness metrics. By advancing these areas, we aim to contribute to the development of AI models that support fair, transparent, and equitable decision-making in healthcare, ultimately fostering more inclusive and trustworthy technologies for diverse patient populations.
Methods
Overview
We proposed EquityGuard, a contrastive learning-based framework designed to mitigate inequities in LLMs applied to healthcare tasks22,44. Contrastive learning is a self-supervised machine learning technique that aims to learn effective data representations by contrasting positive and negative pairs of samples. The core idea is to map similar data points closer together in the feature space while pushing dissimilar ones further apart. Specifically, we focus on two tasks: CTM and MQA. EquityGuard aims to reduce the influence of race, sex, and SDOH factors on model predictions by aligning embeddings through contrastive learning targeted at biased inputs.
Data processing
Our experiments evaluated EquityGuard on two tasks CTM and MQA using five datasets. For CTM, we used the SIGIR 2016, TREC 2021, and TREC 2022 datasets, which provide clinical trial descriptions and patient case reports. For MQA, we employed the MedQA and MedMCQA datasets containing complex medical questions and corresponding answer options.
To assess the impact of sex, race, and SDOH factors on model predictions, we applied Bio_ClinicalBERT for named entity recognition and collaborated with medical experts to filter out topics explicitly related to these factors. Detailed data distributions are provided in Tables 7 and 8. Additionally, we generated counterfactual queries by systematically altering these factors (e.g., changing “Black” to “White”) to enable controlled contrastive learning and bias evaluation.
EquityGuard framework
EquityGuard employs contrastive learning to minimize the influence of social demographic determinant factors on model outputs by targeting biased inputs. For each query, we construct triplets consisting of an anchor, a positive sample, and a negative sample:
-
Anchor (xanchor): The original query without race, sex and SDOH factors (neutral version).
-
Positive (xpos): A query that includes a race, sex and SDOH factor, differing minimally from the anchor.
-
Negative (xneg): A query that includes additional or different factors compared to the anchor.
The goal is to align the model’s embeddings such that the anchor and positive samples (which share the same medical context) are close in the embedding space, while the negative sample (which introduces additional inequities) is farther away. Figure 6 illustrates the overall process of EquityGuard, where contrastive learning is employed to align the embeddings of the original (anchor) and minimally perturbed (positive) queries while separating those with additional bias (negative). This approach actively mitigates the influence of socio-demographic factors on the final model predictions.
Table 9 provides examples of anchor, positive, and negative samples used in the contrastive learning process.
Model architecture and training
We build upon the LLaMA model, extending it to handle both ranking and question-answering tasks. The model shares a transformer-based backbone and is adapted for each task:
Clinical Trial Matching (Ranking Task): The model encodes patient notes and trial eligibility criteria into embeddings. We compute a relevance score between the patient note and each trial using a scoring function and rank the trials accordingly. The objective is to maximize the NDCG@10 metric while minimizing inequity.
Medical Question Answering (Classification Task): The model encodes medical questions and predicts the correct answer choice. We use cross-entropy loss for training, aiming to minimize the error rate while reducing inequity.
The overall loss function \({\mathcal{L}}\) combines the task-specific loss \({{\mathcal{L}}}_{{\rm{task}}}\) and the contrastive loss \({{\mathcal{L}}}_{{\rm{contrastive}}}\) for inequity mitigation:
where λ is a hyperparameter controlling the trade-off between task performance and inequity mitigation.
For the contrastive loss, we use the triplet loss function:
where d( ⋅ , ⋅ ) is a distance metric (e.g., cosine distance), m is the margin, f ( ⋅ ) is the embedding function, and N is the number of triplets.
We trained the models using the Adam optimizer with a learning rate of 1 × 10−5. The hyperparameter λ was set to 0.1, and the margin m was set to 1.0, tuned on a validation set. The training was conducted on four NVIDIA V100 GPUs with 32GB memory each. We repeated all experiments five times with different random seeds and report the average performance along with the standard deviation to quantify the stability of the model outputs. We performed a sensitivity analysis for the contrastive loss weight λ over the range 0.05 to 0.20. The experimental results (see Table 10) indicate that λ = 0.10 achieves the best trade-off between task performance and fairness improvement.
Evaluation
To evaluate the effectiveness of EquityGuard, we measured both task performance and fairness metrics. For CTM, we used the Normalized Discounted Cumulative Gain at rank 10 (NDCG@10) to evaluate the ranking quality. For MQA, we used the error rate to assess the accuracy of the model in answering questions.
To assess the models’ fairness, we computed two metrics:
-
Demographic parity (DP): Measures the difference in the probability of positive outcomes across different demographic groups.
-
Equal opportunity (EO): Measures the difference in true positive rates across different demographic groups.
We compared EquityGuard with several baseline models: LLaMA3 8B without inequity mitigation, Mistral v0.3 without inequity mitigation, and GPT-4 (a state-of-the-art LLM without explicit inequity mitigation). We also included versions of LLaMA3 8B and Mistral v0.3 with EquityGuard applied to assess the effectiveness of our proposed method. This approach promotes fairness and equity in healthcare applications by mitigating inequities in LLM predictions. The performance of the teacher model and the API cost can also be seen in these Supplementary Tables 1–7.
Data availability
The datasets used in this work, including SIGIR 2016, TREC 2021 and 2022, MedQA, and MedMCQA, are publicly available.
Code availability
The code used in the article is available in this GitHub repository: https://github.com/PittNAIL/EquityGuard.
References
Achiam, J. et al. Gpt-4 technical report. arXiv https://arxiv.org/abs/2303.08774 (2023).
Dubey, A. et al. The llama 3 herd of models. arXiv https://arxiv.org/abs/2407.21783 (2024).
Grosse, R. et al. Studying large language model generalization with influence functions. arXiv https://arxiv.org/abs/2308.03296 (2023).
Benary, M. et al. Leveraging large language models for decision support in personalized oncology. JAMA Network Open 6, e2343689–e2343689 (2023).
Zhou, L. et al. Larger and more instructable language models become less reliable. Nature 634, 61–68 (2024).
Kaplan, J. et al. Scaling laws for neural language models. arXiv https://arxiv.org/abs/2001.08361 (2020).
Jin, Q. et al. Matching patients to clinical trials with large language models. Nat. Commun. 15, 9074 (2023).
Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: a large-scale multi-subject multi-choice dataset for medical domain question answering. arXiv https://arxiv.org/abs/2203.14371 (2022).
Acikgoz, E. C. et al. Hippocrates: an open-source framework for advancing large language models in healthcare. arXiv https://arxiv.org/abs/2404.16621 (2024).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv https://arxiv.org/abs/2303.13375 (2023).
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv https://arxiv.org/abs/2311.16452 (2023).
Singhal, K. et al. Towards expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2023).
Bai, X., Wang, A., Sucholutsky, I. & Griffiths, T. L. Measuring implicit bias in explicitly unbiased large language models. arXiv https://arxiv.org/abs/2402.04105 (2024).
Yu, C. et al. Credit card fraud detection using advanced transformer model. arXiv https://arxiv.org/abs/2406.03733 (2024).
Dai, S. et al. Unifying bias and unfairness in information retrieval: a survey of challenges and opportunities with large language models. arXiv https://arxiv.org/abs/2404.11457 (2024).
Tu, T. et al. Towards conversational diagnostic AI. arXiv https://arxiv.org/abs/2401.05654 (2024).
Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, AIoa2300138 (2024).
Tanno, R. et al. Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation (2024).
Dash, D. et al. Evaluation of gpt-3.5 and gpt-4 for supporting real-world information needs in healthcare delivery. arXiv https://arxiv.org/abs/2304.13714 (2023).
Chuang, C.-Y., Robinson, J., Lin, Y.-C., Torralba, A. & Jegelka, S. Debiased contrastive learning. Adv. Neural Inf. Process Syst. 33, 8765–8775 (2020).
Tian, Y. et al. What makes for good views for contrastive learning? Adv. Neural Inf. Process Syst. 33, 6827–6839 (2020).
Rim, D. N., Heo, D. & Choi, H. Adversarial training with contrastive learning in nlp. arXiv https://arxiv.org/abs/2109.09075 (2021).
V, J. P., S, A. A. V., P, G. K. & N,K, K. A novel attention-based cross-modal transfer learning framework for predicting cardiovascular disease. Comput. Biol. Med. 170, 107977 (2024).
Prakash, V. J. & Vijay, S. A. A. A unified framework for analyzing textual context and intent in social media. ACM Trans. Intell. Syst. Technol.15, https://doi.org/10.1145/3682064 (2024).
Koopman, B. & Zuccon, G. A test collection for matching patients to clinical trials. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. https://api.semanticscholar.org/CorpusID:5630619 (2016).
Roberts, K., Demner-Fushman, D., Voorhees, E. M., Bedrick, S. & Hersh, W. R. Overview of the trec 2022 clinical trials track. https://trec.nist.gov/pubs/trec31/papers/Overview_trials.pdf (2022).
Team, G. et al. Gemini: a family of highly capable multimodal models. arXiv https://arxiv.org/abs/2312.11805 (2023).
The claude 3 model family: Opus, sonnet, haiku. https://api.semanticscholar.org/CorpusID:268232499.
Velupillai, S. et al. Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances. J. Biomed. Inform. 88, 11–19 (2018).
Nazi, Z. A. & Peng, W. Large language models in healthcare and medical domain: a review. arXiv https://arxiv.org/abs/2401.06775 (2024).
Tavabi, N., Singh, M., Pruneski, J. & Kiapour, A. M. Systematic evaluation of common natural language processing techniques to codify clinical notes. Plos One 19, e0298892 (2024).
Li, J. & Li, G. The triangular trade-off between robustness, accuracy and fairness in deep neural networks: a survey. ACM Comput. Surv. 57, 6 (2024).
Kirchdorfer, L. et al. Analytical uncertainty-based loss weighting in multi-task learning. arXiv https://arxiv.org/abs/2408.07985 (2024).
Wu, H., Li, B., Tian, L., Feng, J. & Dong, C. An adaptive loss weighting multi-task network with attention-guide proposal generation for small size defect inspection. Vis. Comput. 40, 681–698 (2024).
Kundi, B., El Morr, C., Gorman, R. & Dua, E. Artificial intelligence and bias: a scoping review. AI Soc. 10, 510 (2023).
Ferrara, E. Fairness and bias in artificial intelligence: A brief survey of sources, impacts, and mitigation strategies. Sci 6, 3 (2023).
Chen, R. J. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7, 719–742 (2023).
Timmons, A. C. et al. A call to action on assessing and mitigating bias in artificial intelligence applications for mental health. Perspect. Psychol. Sci. 18, 1062–1096 (2023).
Polevikov, S. Advancing AI in healthcare: a comprehensive review of best practices. Clin Chim. Acta 548, 117519 (2023).
Romano, Y., Bates, S. & Candes, E. Achieving equalized odds by resampling sensitive attributes. Adv. Neural Inform. Process. Syst. 33, 361–371 (2020).
Roelofs, R., Cain, N., Shlens, J. & Mozer, M. C. Mitigating bias in calibration error estimation. arxiv https://arxiv.org/abs/2012.08668 (2022).
Xiao, T., Wang, X., Efros, A. A. & Darrell, T. What should not be contrastive in contrastive learning. arXiv https://arxiv.org/abs/2008.05659 (2020).
Acknowledgements
The research reported in this article was partially supported by the National Institutes of Health awards UL1 TR001857, U24 TR004111, U01MH136020, and R01 LM014306. The sponsors had no role in study design, data collection, analysis, interpretation, report writing, or decision to submit the paper for publication. We would like to thank Qiao Jin, Yifan Yang, and Zhiyong Lu from the National Center for Biotechnology Information for their insightful explanations of the TrialGPT results, which greatly assisted our work.
Author information
Authors and Affiliations
Contributions
Y.J. conceptualized, designed, and organized this study, analyzed the results, and wrote, reviewed, and revised the paper. W.M. analyzed the results, and wrote, reviewed, and revised the paper. S.S., H.Z., E.M.S., Z.L., X.W., and S.V. wrote, reviewed, and revised the paper. Y.W. conceptualized, designed, and directed this study, wrote, reviewed, and revised the paper.
Corresponding author
Ethics declarations
Competing interests
Y.W. has ownership and equity in BonafideNLP, LLC, and S.V. has ownership and equity in Kvatchii, Ltd., READE.ai, Inc., and ThetaRho, Inc. The other authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ji, Y., Ma, W., Sivarajkumar, S. et al. Mitigating the risk of health inequity exacerbated by large language models. npj Digit. Med. 8, 246 (2025). https://doi.org/10.1038/s41746-025-01576-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-025-01576-4