Introduction

Large language models (LLMs)1,2,3,4,5,6 have demonstrated significant promise across a range of medical applications. Models such as GPT-4 can process vast amounts of text, generating human-like responses, summaries, and contextually relevant insights. This capability holds significant promise for advancing both patient care and medical research. LLMs are particularly valuable in tasks like clinical trial matching and medical question answering (MQA), which are crucial for translational research and clinical decision support, respectively. These applications underscore the transformative role LLMs can play in improving healthcare outcomes and streamlining research efforts.

However, despite these impressive capabilities, LLMs may exacerbate persistent healthcare inequities worldwide. In many clinical settings, especially in low-resource environments, biased decision-making can further exacerbate disparities in treatment and access to care. This urgent challenge calls for artificial intelligence (AI) systems that are not only powerful but also fair and unbiased. To address this, we propose EquityGuard, a novel framework that employs contrastive learning to actively mitigate bias in LLM outputs. In this study, we validate EquityGuard on two primary medical tasks: clinical trial matching and medical question-answering.

Clinical trial matching (CTM), an essential process for accelerating translational research, involves identifying and pairing patients with appropriate clinical trials based on complex eligibility criteria derived from patient medical records and trial protocols7. Although LLMs offer transformative solutions by automating this process, they can inadvertently propagate bias, leading to the systematic exclusion of certain demographic groups from clinical trials.

Similarly, medical question-answering (MQA) systems powered by LLMs8,9,10,11,12,13,14 hold great potential for enhancing clinical decision support by integrating diverse sources such as clinical guidelines, research papers, and patient-specific information. Yet, biased outputs in MQA tasks may lead to misinformation and disproportionately affect underrepresented communities.

Our evaluation includes state-of-the-art models such as GPT-4 as well as the latest releases, including Gemini and Claude (2024 versions). Although these models demonstrate remarkable performance improvements, they still inherit biases from their training data.

In this study, we aim to address two key research questions:

  • RQ1: To what extent do LLMs exhibit inequities across two major medical applications, i.e., CTM and MQA tasks?

  • RQ2: What techniques can be applied to mitigate inequities when applying LLMs in medical applications, and how effective are they in promoting health equity?

Understanding how inequities manifest across healthcare care tasks is essential to address these issues. Previous research has identified several sources of inequity, including inherent biases in training data, underrepresentation of certain groups, and algorithmic design flaws15,16,17. However, there remains a need for focused investigations into how these inequities affect specific healthcare tasks, such as CTM and MQA. This paper aims to fill that gap by identifying and mitigating inequities in these applications. Two examples are illustrated in Fig. 1.

Fig. 1: Inequities when applying LLMs to two major medical applications.
figure 1

Clinical Trial Matching (left) and Medical Question Answering (right). On the left, including race and sex information (e.g., “African-American” and “woman”) in the patient note, despite being irrelevant to matching the correct clinical trials, resulted in altered clinical trial recommendations generated by the LLMs. On the right, adding race information (e.g., “Native American”) to the question, which should not affect the response, led to incorrect answers from the LLMs. These examples show that non-decisive socio-demographic factors in different patient populations can lead to incorrect outputs from LLMs, which may lead to harmful clinical outcomes to these patient populations and eventually exacerbate healthcare inequities.

The proposed EquityGuard framework is based on contrastive learning and could systematically evaluate and mitigate inequities in LLMs18,19,20,21. EquityGuard uses contrastive learning techniques22,23,24 to disentangle socio-demographic determinants of health (SDOH) factors from task-related embeddings, ensuring that these attributes do not unduly influence model predictions. Through a series of experiments, we show that EquityGuard could enhance equity in LLMs for medical applications, specifically CTM and MQA tasks. EquityGuard is designed to be adaptable across diverse healthcare settings, including low-resource environments25,26, thereby effectively mitigating bias even when clinical data is scarce and promoting equitable outcomes in both CTM and MQA tasks.

Results

Our experiments focused on examining how race, sex, and SDOH factors (including low income, LGBT+, homeless, illiteracy, disabled, and unemployed) influence the outputs of LLMs and potentially introduce inequity and inaccuracy. To address these issues, we proposed the EquityGuard framework, which leverages contrastive learning to mitigate the effects of irrelevant SDOH attributes by aligning embeddings of similar inputs. This approach aims to improve the fairness of LLM outputs by reducing the influence of sensitive demographic factors.

We evaluated the models on five datasets across two key medical applications: CTM and MQA tasks. The CTM datasets include SIGIR 201627, TREC 2011, and TREC 202228, while the MQA datasets are MedQA8 and MedMCQA9. We added specific terms for race, sex, and each SDOH category in the input to different LLMs, in the same way as illustrated in Fig. 1, to examine the output. We tested four LLMs for the evaluation: GPT-4, GPT-4o Mini, Gemini (Gemini 1.5 Flash)29, and Claude (specifically, Claude-3-5-sonnet-20240620)30. For the EquityGuard implementation, we mainly used the open-source LLMs, including LLaMA3 8B and Mistral v0.3, and compared with baseline GPT-4 model. More details about the EquityGuard framework and approach used in this study can be found in the Method section.

Comparison of equity in LLMs

Figure 2 presents radar plots compared the performance of the LLMs on CTM and MQA tasks when different SDOH factors are introduced into the dataset. Performance for the CTM task is measured using the Normalized Discounted Cumulative Gain at rank 10 (NDCG@10), with higher values indicating better performance. For the MQA task, error rates are used, with lower values indicating better performance.

Fig. 2: Performance of various LLMs when specific SDOH factors were introduced into the dataset.
figure 2

The clinical trial matching (CTM) performance is measured using NDCG@10 (higher is better), while the medical question answering (MQA) performance is measured using error rate (lower is better). SDOH factors include race, sex, low income, LGBT+ status, homelessness, illiteracy, disability, and unemployment. Each sensitive attribute was incorporated into the input data for both CTM and MQA tasks during the evaluation.

Among the evaluated models, GPT-4 consistently demonstrated the best overall performance across a variety of SDOH factors. In the CTM task, GPT-4 maintained relatively stable NDCG@10 scores, even when different SDOH factors were included in the input. For instance, GPT-4 performed particularly well across low-income, unemployed, and disabled groups, showing minimal variation in NDCG@10 scores compared to other models. Moreover, GPT-4 exhibited balanced performance across racial groups, including Black, White, and Hispanic, reflecting greater fairness in handling diverse queries (Fig. 2 left).

In contrast, both Gemini and Claude showed greater variability in performance across SDOH factors. These models experienced significant declines in NDCG@10 scores for Native American, Middle Eastern, and Pacific Islander groups, indicating poorer handling of less-represented racial categories. Furthermore, they exhibited noticeably higher error rates in the MQA task when facing queries involving homelessness, unemployment, and low income categories, revealing challenges in equity concerning SDOH factors.

While GPT-4 maintained lower error rates for most MQA categories, especially in sex and race-related queries, Gemini and Claude demonstrated a higher propensity for errors in underrepresented groups. For example, Gemini exhibited error rates as high as 0.31 for low income cases and 0.29 for homeless populations. These disparities highlight that, while GPT-4 offers better equity, Gemini and Claude are more prone to producing inequitable outputs for vulnerable groups, particularly in MQA tasks.

To quantify fairness, we used the equal opportunity (EO) and demographic parity (DP) metrics. GPT-4 again outperformed the other models, showing consistent results with higher EO and DP scores across different SDOH factors. Gemini and Claude, however, displayed greater disparities, particularly in their treatment of unemployed and low income groups, suggesting that these models struggle to maintain fairness across diverse populations (Fig. 3).

Fig. 3: Fairness metrics including equal opportunity (EO) and demographic parity (DP) to assess equity in LLMs.
figure 3

Higher EO and DP scores indicate better equity, with EO focusing on ensuring equal positive outcomes for qualified individuals across groups and DP evaluating overall equity across all groups.

Fairness and correlation analysis

To further investigate the inequities observed in the LLM models, we conducted a correlation analysis between different categories. We evaluated whether certain race, sex and SDOH factors tend to produce similar biased outcomes, which is critical for understanding systemic biases and improving fairness across categories.

For each pair of inequity categories, we calculated the correlation based on the following criteria:

  • If both categories result in the same wrong answer or wrong rerank order, a correlation of +1 is assigned.

  • If one category is correct while the other is wrong, a correlation of -1 is given.

  • If both categories either get the same correct answer or the right rerank order, a correlation of 0 is assigned.

In the CTM task (Fig. 4, left), several race, sex and SDOH factors exhibited strong correlations, revealing compounded inequity patterns. The Black and Pacific Islander categories displayed a high correlation coefficient of 0.5, suggesting that model decisions were consistently similar for these groups. Additionally, the socioeconomic factors unemployed and low income showed a notable correlation of 0.25, indicating that inequities related to SDOH heavily influence model outputs in the CTM task.

Fig. 4: Correlation heatmaps of inequity categories in CTM and MQA tasks.
figure 4

The left plot shows the correlation between inequity categories in CTM tasks, illustrating how different inequity-modified queries resulted in similar trial rankings or selections by the models. The right plot shows the correlation between inequity categories in MQA tasks, displaying how often different inequity-modified queries led to the same answers or error patterns. These heatmaps help analyze how inequities across categories are interconnected, impacting model fairness.

Other factors, such as low income and Black, demonstrated moderate correlations (0.25), pointing to shared inequities between economic disadvantage and racial categories. Conversely, correlations between Hispanic and low income resulted in a negative correlation (-0.26), highlighting disparities in how the model treats these categories. The unemployed and Mixed Race categories showed a weaker positive correlation of 0.2, indicating less interconnection between these inequities compared to others.

In the MQA task (Fig. 4, right), similar trends were observed. Strong correlations were found between race and SDOH factors. The unemployed group was closely related to the disability category, with correlation values exceeding 0.17. This implies that inequities in SDOH factors significantly align with racial inequities, further entrenching model biases when answering medical questions.

Interestingly, the low income category showed negative correlations with all other categories. This suggests that, in this particular task, the model treats low income as a distinct factor not strongly linked to other SDOH factors or demographic attributes. One possible explanation is that the task might focus more on clinical or health-related issues, and SDOH factors like income level may not be as directly relevant. Consequently, the model pays less attention to low-income as a critical feature in this context, leading to these lower correlations.

Inequity mitigation in CTM

We evaluated the effectiveness of the EquityGuard framework in mitigating inequities in LLMs for the CTM task by analyzing model performance across race, sex, and SDOH factors. The models assessed were LLaMA3 8B, Mistral v0.3, both with and without the application of EquityGuard (denoted as w/ EquityGuard and w/o EquityGuard, respectively). GPT-4 was compared as a baseline model.

Tables 1 and 2 present the NDCG@10 scores across different race and sex categories. Models trained with EquityGuard exhibited more uniform performance across race and sex compared to their counterparts without EquityGuard. For instance, LLaMA3 8B w/ EquityGuard maintained NDCG@10 scores around 70% across all categories, whereas LLaMA3 8B w/o EquityGuard showed greater variability, with scores ranging from 67.7% (Native American) to 72.6% (Asian). The performance disparities between these groups indicate inequities in models without EquityGuard.

Table 1 Performance comparison across race and sex categories in CTM task
Table 2 Performance comparison across additional race categories

Table 3 extends the analysis to SDOH factors, including LGBT+, low income, unemployed, disabled, illiterate and homeless. Models with EquityGuard displayed more consistent performance across these categories. For example, LLaMA3 8B w/ EquityGuard achieved higher NDCG@10 scores in the low income (89.8%) and unemployed (87.4%) categories compared to w/o EquityGuard (81.3% and 83.4%, respectively). This improvement suggests that EquityGuard enhances fairness in CTM by mitigating inequities associated with SDOH factors.

Table 3 NDCG@10 score comparison across SDOH categories for CTM task

Inequity mitigation in MQA

We further assessed EquityGuard’s impact on the MQA task using the MedQA and MedMCQA datasets. Error rates across SDOH categories are presented in Tables 4 and 5. Implementing EquityGuard led to a noticeable reduction in error rates across all race, sex and SDOH categories. For instance, LLaMA3 8B w/ EquityGuard achieved an average error rate of 19.8%, compared to 21.2% for w/o EquityGuard, representing a relative decrease of approximately1.3%.

Table 4 Error rate comparison across race and sex categories in MQA task
Table 5 Error rate comparison across additional race categories in MQA task

Notably, the reduction in error rates was significantly greater in categories that initially exhibited higher inequities. In the Black category, LLaMA3 8B’s error rate decreased from 22.0% (w/o EquityGuard) to 20.3% (w/ EquityGuard). Mistral v0.3 showed similar improvements, with error rates decreasing from 21.4% to 19.5% in the Black category after applying EquityGuard.

Table 6 presents error rates across SDOH categories for the MQA task. EquityGuard effectively reduced error rates in categories such as LGBT+, low-income, and unemployed. For LLaMA3 8B, the error rate in the low-income category decreased from 18.4% (w/o EquityGuard) to 12.7% (w/ EquityGuard). This significant reduction highlights EquityGuard’s capability to mitigate inequities associated with SDOH factors.

Table 6 Error rate comparison across SDOH categories for MQA task

Enhanced fairness metrics

To quantify the fairness improvements achieved by EquityGuard, we calculated the EO and DP differences for the LLaMA3 8B models (Fig. 5). Models with EquityGuard (w/ EquityGuard) demonstrated reduced EO and DP differences across SDOH factors, indicating enhanced fairness. Specifically, the EO difference increased by an average of 28%, and the DP difference increased by approximately 32% compared to the models without EquityGuard.

Fig. 5: Equal opportunity (EO) and demographic parity (DP) metrics for LLaMA3 8B models.
figure 5

Models trained with EquityGuard (`w/EquityGuard̀) show reduced EO and DP differences, indicating enhanced fairness.

Overall impact of EquityGuard

Our results demonstrate that EquityGuard significantly mitigates inequities in LLMs across both CTM and MQA tasks. Key observations include:

  • Uniform performance across demographics: Models with EquityGuard provided more consistent NDCG@10 scores and lower error rates across all race, sex and SDOH categories, indicating reduced inequity.

  • Improved fairness metrics: Enhanced EO and DP scores affirm that EquityGuard promotes equitable model behavior, ensuring that sensitive demographic factors do not disproportionately influence predictions.

Overall, the application of EquityGuard contributes to more fair and equitable decision-making processes in healthcare AI systems by minimizing the influence of sensitive attributes on model outputs. This is critical for addressing health disparities and ensuring equitable healthcare delivery.

Discussion

LLMs are typically trained on vast amounts of publicly available text data drawn from the internet, books, social media, and other human-generated content. Because this data reflects societal norms, perspectives, and historical narratives, it inevitably contains biases—particularly those that marginalize or misrepresent disadvantaged populations. As a result, LLMs can internalize and replicate these biases in the way they generate text, make predictions, or support decision-making. When such biased models are deployed in health-related tasks—such as clinical decision support, clinical trial matching, patient education, or triage—they risk propagating and even amplifying existing disparities. This may lead to skewed outcomes that disproportionately affect already underserved communities, ultimately exacerbating health inequities rather than alleviating them. In this study, we examined and confirmed the existence of such potential inequities using existing LLMs. We then introduced EquityGuard, a framework employing contrastive learning to mitigate inequities in LLMs applied to healthcare tasks, specifically CTM and MQA. By reducing the undue influence of race, sex and SDOH factors, EquityGuard enhances fairness in LLM outputs. Our experiments across five datasets demonstrated that even advanced models like GPT-4, Claude, and Gemini are susceptible to inequities, which EquityGuard effectively mitigated, leading to more equitable outcomes in both tasks.

Despite these promising results, there are limitations to our approach. One significant challenge lies in accurately identifying and processing social demographic determinant factors within the datasets. While we utilized Bio_ClinicalBERT for named entity recognition and developed a custom pipeline to enhance accuracy, the detection of these factors is not foolproof. Misidentification or omission can adversely affect the effectiveness of bias mitigation. Future work could explore more advanced methods for detecting SDOH factors, possibly incorporating additional context or leveraging unsupervised learning techniques31,32,33. Another limitation pertains to balancing bias mitigation with task performance. While EquityGuard significantly improves fairness metrics, it is observed that models such as LLaMA3 8B and Mistral v0.3 exhibit a slight performance decrease compared to GPT-434. This can be attributed to the fact that GPT-4, with its substantially larger number of parameters and more sophisticated architecture, inherently possesses greater capacity for handling complex tasks. In contrast, the smaller models experience a trade-off between bias mitigation and task accuracy when the additional contrastive loss is incorporated. Future work will explore parameter-efficient techniques to better balance this trade-off. The incorporation of additional loss components for bias mitigation may impact overall model performance. Future research could investigate adaptive strategies or alternative loss functions that more effectively balance fairness and performance35,36. It is important to note that LLMs inherently exhibit stochastic behavior. Although we report the average performance along with the standard deviations, the variability in outputs may pose risks in healthcare applications, such as inconsistent clinical recommendations. Future work will investigate ensemble methods and post-processing techniques to further mitigate such randomness. Note that this study focuses exclusively on text-based clinical tasks (CTM and MQA) and does not address bias mitigation in medical imaging a topic to be explored in future work.

Furthermore, our study focused on a limited set of SDOH factors. The complex nature of biases in healthcare suggests that other factors, such as age, or the intersectionality between attributes, could also contribute to biased outcomes37,38,39,40. Expanding EquityGuard to account for a broader range of factors would enhance its applicability and robustness in real-world settings. The evaluation metrics used, DP and EO, provide insights into the models’ fairness but may not capture all dimensions relevant in healthcare contexts38,39,41. Although EquityGuard exhibits strong bias mitigation capabilities, its deployment at scale within large healthcare systems entails significant challenges related to computational efficiency and ethical oversight. Future work could incorporate additional fairness metrics, such as Equalized Odds or calibration error, to provide a more comprehensive assessment of model fairness42,43.

In conclusion, while EquityGuard shows promise in mitigating inequities in LLMs for CTM and MQA tasks, addressing its current limitations is crucial for the advancement of AI-driven healthcare systems that are not only effective but also equitable. Future work will focus on enhancing social demographic determinant detection, refining inequity mitigation strategies, expanding the range of considered inequities, and exploring additional fairness metrics. By advancing these areas, we aim to contribute to the development of AI models that support fair, transparent, and equitable decision-making in healthcare, ultimately fostering more inclusive and trustworthy technologies for diverse patient populations.

Methods

Overview

We proposed EquityGuard, a contrastive learning-based framework designed to mitigate inequities in LLMs applied to healthcare tasks22,44. Contrastive learning is a self-supervised machine learning technique that aims to learn effective data representations by contrasting positive and negative pairs of samples. The core idea is to map similar data points closer together in the feature space while pushing dissimilar ones further apart. Specifically, we focus on two tasks: CTM and MQA. EquityGuard aims to reduce the influence of race, sex, and SDOH factors on model predictions by aligning embeddings through contrastive learning targeted at biased inputs.

Data processing

Our experiments evaluated EquityGuard on two tasks CTM and MQA using five datasets. For CTM, we used the SIGIR 2016, TREC 2021, and TREC 2022 datasets, which provide clinical trial descriptions and patient case reports. For MQA, we employed the MedQA and MedMCQA datasets containing complex medical questions and corresponding answer options.

To assess the impact of sex, race, and SDOH factors on model predictions, we applied Bio_ClinicalBERT for named entity recognition and collaborated with medical experts to filter out topics explicitly related to these factors. Detailed data distributions are provided in Tables 7 and 8. Additionally, we generated counterfactual queries by systematically altering these factors (e.g., changing “Black” to “White”) to enable controlled contrastive learning and bias evaluation.

Table 7 Race composition across datasets (count and percentage)
Table 8 Sex composition across datasets

EquityGuard framework

EquityGuard employs contrastive learning to minimize the influence of social demographic determinant factors on model outputs by targeting biased inputs. For each query, we construct triplets consisting of an anchor, a positive sample, and a negative sample:

  • Anchor (xanchor): The original query without race, sex and SDOH factors (neutral version).

  • Positive (xpos): A query that includes a race, sex and SDOH factor, differing minimally from the anchor.

  • Negative (xneg): A query that includes additional or different factors compared to the anchor.

The goal is to align the model’s embeddings such that the anchor and positive samples (which share the same medical context) are close in the embedding space, while the negative sample (which introduces additional inequities) is farther away. Figure 6 illustrates the overall process of EquityGuard, where contrastive learning is employed to align the embeddings of the original (anchor) and minimally perturbed (positive) queries while separating those with additional bias (negative). This approach actively mitigates the influence of socio-demographic factors on the final model predictions.

Fig. 6
figure 6

An overview of the EquityGuard framework for inequity detection and correction.

Table 9 provides examples of anchor, positive, and negative samples used in the contrastive learning process.

Table 9 Examples of anchor, positive, and negative samples in contrastive learning for inequity mitigation

Model architecture and training

We build upon the LLaMA model, extending it to handle both ranking and question-answering tasks. The model shares a transformer-based backbone and is adapted for each task:

Clinical Trial Matching (Ranking Task): The model encodes patient notes and trial eligibility criteria into embeddings. We compute a relevance score between the patient note and each trial using a scoring function and rank the trials accordingly. The objective is to maximize the NDCG@10 metric while minimizing inequity.

Medical Question Answering (Classification Task): The model encodes medical questions and predicts the correct answer choice. We use cross-entropy loss for training, aiming to minimize the error rate while reducing inequity.

The overall loss function \({\mathcal{L}}\) combines the task-specific loss \({{\mathcal{L}}}_{{\rm{task}}}\) and the contrastive loss \({{\mathcal{L}}}_{{\rm{contrastive}}}\) for inequity mitigation:

$${\mathcal{L}}={{\mathcal{L}}}_{{\rm{task}}}+\lambda {{\mathcal{L}}}_{{\rm{contrastive}}}$$
(1)

where λ is a hyperparameter controlling the trade-off between task performance and inequity mitigation.

For the contrastive loss, we use the triplet loss function:

$${{\mathcal{L}}}_{{\rm{contrastive}}}=\mathop{\sum }\limits_{i=1}^{N}\max \left(0,m+d(f({x}_{\rm{anchor}}^{{(i)}}),f({x}_{{\rm{pos}}}^{(i)}))-d(f({x}_{{\rm{anchor}}}^{(i)}),f({x}_{\text{neg}\,}^{(i)}))\right)$$
(2)

where d( , ) is a distance metric (e.g., cosine distance), m is the margin, f ( ) is the embedding function, and N is the number of triplets.

We trained the models using the Adam optimizer with a learning rate of 1 × 10−5. The hyperparameter λ was set to 0.1, and the margin m was set to 1.0, tuned on a validation set. The training was conducted on four NVIDIA V100 GPUs with 32GB memory each. We repeated all experiments five times with different random seeds and report the average performance along with the standard deviation to quantify the stability of the model outputs. We performed a sensitivity analysis for the contrastive loss weight λ over the range 0.05 to 0.20. The experimental results (see Table 10) indicate that λ = 0.10 achieves the best trade-off between task performance and fairness improvement.

Table 10 Sensitivity analysis for the contrastive loss weight λ

Evaluation

To evaluate the effectiveness of EquityGuard, we measured both task performance and fairness metrics. For CTM, we used the Normalized Discounted Cumulative Gain at rank 10 (NDCG@10) to evaluate the ranking quality. For MQA, we used the error rate to assess the accuracy of the model in answering questions.

To assess the models’ fairness, we computed two metrics:

  • Demographic parity (DP): Measures the difference in the probability of positive outcomes across different demographic groups.

  • Equal opportunity (EO): Measures the difference in true positive rates across different demographic groups.

We compared EquityGuard with several baseline models: LLaMA3 8B without inequity mitigation, Mistral v0.3 without inequity mitigation, and GPT-4 (a state-of-the-art LLM without explicit inequity mitigation). We also included versions of LLaMA3 8B and Mistral v0.3 with EquityGuard applied to assess the effectiveness of our proposed method. This approach promotes fairness and equity in healthcare applications by mitigating inequities in LLM predictions. The performance of the teacher model and the API cost can also be seen in these Supplementary Tables 1–7.