Introduction

In recent years, deep learning (DL)-based algorithms have demonstrated exceptional performance across various medical applications, including classification, segmentation, and detection1. These advancements in digital healthcare technologies hold significant potential for promoting health equity, particularly in low-income countries or resource-limited regions. By leveraging high-quality pre-trained model weights optimized with extensive medical data from diverse organizations and sites, DL-based solutions can provide accessible and efficient diagnostic tools to underserved populations.

Ultrasound imaging is a widely used medical imaging modality that utilizes high-frequency sound waves to generate real-time images of internal organs, tissues, and blood flow2. Its non-invasive nature, portability, and cost-effectiveness make it an especially convenient diagnostic tool, particularly for point-of-care and bedside applications. The general workflow for ultrasound diagnosis is illustrated in Fig. 1. After acquiring ultrasound images, physicians typically identify and annotate the lesion area, calculate radiologic characteristics such as lesion area and length, and then provide a final diagnosis. In this process, DL models play a critical role in automating lesion segmentation. This automation is often achieved through models pre-trained on large-scale ultrasound imaging datasets3.

Fig. 1: General pipeline of ultrasound image diagnosis.
figure 1

It usually consists of four stages: image acquisition, lesion segmentation, radiologic parameters extraction, and diagnosis.

Although numerous studies have demonstrated that the use of pre-trained weights can significantly enhance the utility of deep learning (DL) models for ultrasound image diagnosis, recent research has revealed that these pre-trained models often inherit biases from their training datasets, leading to unfair performance in downstream tasks4. This presents significant challenges to achieving health equity. For instance, Jin et al.5 observed that their pre-trained MedicalMAE model exhibited substantial utility disparities between male and female groups in pleural effusion diagnosis using chest X-ray images.

This is a critical issue that demands attention, as model unfairness undermines patients’ rights to equitable treatment, diminishes the reliability and trustworthiness of DL models, and poses severe risks to privacy preservation and health equity6,7. In essence, algorithmic fairness is the principle that DL models should deliver comparable outcomes across diverse patient demographics, particularly those involving sensitive attributes such as age, sex, and race.

Currently, researchers aim to mitigate unfairness in DL-based healthcare applications through three primary approaches: 1) Pre-processing, which involves modifying the input data distribution using sampling strategies8 or generative adversarial networks (GANs)9; 2) In-processing, which incorporates fairness constraints10 or employs adversarial network architectures11; and 3) Post-processing, which adjusts the output logits for different subgroups10 or prunes model parameters12. However, these methods often require re-training or fine-tuning model weights, which is both resource-intensive and time-consuming—particularly when addressing unfairness in large foundation models.

In this paper, we address the challenge of improving fairness in ultrasound image diagnosis, focusing specifically on lesion segmentation, under a stricter scenario where the pre-trained model weights remain fixed due to limited computational resources. Inspired by ref.13, which generates perturbations using GANs, we propose a novel approach called APPLE. APPLE achieves fairness by manipulating the latent embedding with protected-attribute-aware perturbations. Unlike methods that perturb the input image directly14, perturbing the latent space is more effective, as features in deeper layers carry higher-level semantic information and are more separable.

Specifically, APPLE comprises a generator that perturbs the latent embedding of the fixed segmentation encoder to obscure the sensitive attribute of the input image and a discriminator that attempts to identify the sensitive attribute from the perturbed latent embedding. This adversarial setup ensures that the learned perturbations enhance fairness without modifying the pre-trained model weights.

Results

Unfairness exists in ultrasound image segmentation

Accurate segmentation of lesion regions in ultrasound images is a critical first step in ultrasound image diagnosis, as it forms the foundation for subsequent analysis. A precise lesion mask provides abundant information, including the lesion’s surface area, aspect ratio, and sphericity. Conversely, biased segmentation directly impacts the computation of these variables, potentially leading to inaccuracies in diagnosis.

To evaluate segmentation precision, we utilize the Dice Similarity Coefficient (DSC)15, a widely used metric that quantifies the overlap between the ground truth mask and the predicted mask. The group-wise DSC results on two ultrasound datasets are shown in Fig. 2. The results reveal a consistent pattern: across all four deep learning models, the DSC scores for the Female group are higher than those for the Male group in both datasets. For the TUSC dataset, the Young group achieves slightly higher DSC scores than the Old group, whereas this trend reverses in the QDUS dataset.

Fig. 2: Subgroup performance on the two datasets.
figure 2

All the results are evaluated on the respective test set. The U-Net and TransUnet are trained on the TUSC / QDUS training set. SAM and MedSAM are used as zero-shot segmentators with the ground truth bounding box as the prompt. a Result on TUSC-Sex; b Result on TUSC-Age; c Result on QDUS-Sex; d Result on QDUS-Age.

Interestingly, U-Net and TransUnet are trained on stratified-sampled training sets, while SAM and MedSAM operate as zero-shot segmentators. While SAM-based models exhibit less bias and significantly higher overall utility compared to train-from-scratch models (e.g., U-Net and TransUnet), the bias direction remains consistent, with DSCFemale > DSCMale and DSCYoung > DSCOld (on TUSC dataset, while DSCYoung < DSCOld on QDUS dataset). This may stem from SAM-based models using the ground truth bounding box of the lesion as a prompt, providing additional information that benefits segmentation.

These results highlight the persistent presence of unfairness in ultrasound image segmentation tasks, emphasizing the need to address this issue to ensure fair and reliable downstream diagnosis.

Downstream diagnosis inherit unfairness from predicted masks

In the general pipeline of ultrasound diagnosis, several radiomics parameters are extracted from both the ultrasound images and their corresponding predicted masks. These parameters are crucial for clinicians in determining whether a lesion is benign or malignant. However, disparities in lesion mask predictions between subgroups can propagate through to the computation of these parameters, leading to biased or unfair diagnoses.

To investigate this issue, we selected three commonly used parameters in ultrasound diagnosis—surface area, energy, and perimeter—and computed the absolute differences between values derived from the ground truth mask and those derived from the predicted mask using PyRadiomics16. The subgroup distributions for these parameters are visualized in the first row of Figs. 3 and 4.

Fig. 3: Disparity of radiomics parameter errors between the male and the female.
figure 3

All results are computed on the TUSC dataset. JS denotes Jensen-Shannon divergence and a lower JS means the two distributions are closer to each other. ac Subgroup disparity of surface area, perimeter, and energy of the original TransUnet; df Subgroup disparity of surface area, perimeter, and energy of the TransUnet+APPLE.

As illustrated in the figures, all three parameters show significant disparities between subgroups. Specifically, the absolute differences are consistently higher in the Male group compared to the Female group. Similarly, with respect to age, the Old group exhibits a greater absolute difference than the Young group. These results are consistent with the observed segmentation accuracy patterns, emphasizing that unfairness in the segmentation process can propagate into downstream diagnostic tasks. This further underscores the importance of addressing bias in lesion segmentation to ensure fair and equitable diagnoses.

Unfairness mitigation via APPLE

To address the issues outlined in ultrasound diagnosis, we focus on the lesion segmentation step and propose using adversarial protected-attribute-aware perturbations within the latent embedding space. The results of our approach are presented in Table 1. For each backbone, we inspect APPLE and other two state-of-the-art unfairness mitigation methods including FEBS17 and InD18. The FEBS adds a new loss term to the origin loss function, i.e., DiceCE loss. The InD finetunes the base model using subgroup-only data, resulting in a subgroup-specific model for each group.

Table 1 Unfairness mitigation using APPLE

Our proposed APPLE framework effectively reduces unfairness while preserving comparable overall segmentation performance across the two datasets, as compared to baseline models and other unfairness mitigation methods. Notably, TransUnet+APPLE achieves both a higher average DSC (\(\overline{\,\text{DSC}}\)) and a lower disparity (DSCΔ) for the sex attribute on the QDUS dataset, underscoring its capability to enhance fairness without compromising segmentation accuracy.

Furthermore, using TransUnet and the TUSC dataset as an example, we present the absolute difference distributions of both the baseline model and our proposed model for the sex and age attributes in Figs. 3 and 4. Compared to the baseline model, the proposed model shows a more similar overall distribution shape, and the JS-divergence19 between subgroups is lower, indicating a greater similarity between the subgroups.

Fig. 4: Disparity of radiomics parameter errors between the young and the old.
figure 4

All results are computed on the TUSC dataset. JS denotes Jensen-Shannon divergence. ac Subgroup disparity of surface area, perimeter, and energy of the original TransUnet; df Subgroup disparity of surface area, perimeter, and energy of the TransUnet+APPLE.

Integrating APPLE with segmentation foundation models

We also apply APPLE to segmentation foundation models, including SAM and MedSAM, to evaluate its potential in addressing fairness in larger, prompt-based models, which have demonstrated strong performance in medical segmentation tasks. The results are shown in Table 2. By integrating APPLE, the fairness of MedSAM improves slightly, while the DSCΔ for SAM increases by ~0.2%, demonstrating the potential for mitigating unfairness in foundation models with a large number of parameters.

Table 2 Results on TUSC dataset focusing on age attribute using SAM-family models with APPLE integrated

Confounder removal in latent space

To verify that unfairness mitigation is achieved through perturbation of the latent space, we use t-SNE20 to visualize the latent feature space of both the baseline model and the perturbed model. Additionally, we employ the Davies-Bouldin Index (DBI)21 and Silhouette Score (SS)22 to assess clustering quality. A lower Davies-Bouldin score and a higher Silhouette Score indicate better clustering quality. Intuitively, the perturbed model should exhibit more challenging clustering due to the attribute-aware perturbations. The results, shown in Fig. 5, demonstrate that compared to the pre-trained model, the mitigation model exhibits higher DBI and lower SS, indicating a decrease in clustering quality.

Fig. 5: t-SNE of the latent feature space.
figure 5

a t-SNE of the TUSC dataset regarding sex, red: male, green: female; b t-SNE of the TUSC dataset regarding age, red: young, green: old. Compared to the pre-trained model, the mitigation model has a higher Davies-Bouldin index (DBI) and lower silhouette score (SS), denoting a worse clustering quality.

Achieving controllable fairness constraining

In APPLE, a weighted factor, β, is used to adjust the degree of constraint on the target task and fairness. Experiments are conducted on the TUSC dataset focusing on age attribute, where we modified β from 0.01 to 10.0 to see if it can control the constraint on fairness successfully. The result is shown in Fig. 6. With the increase of β, both the \(\overline{\,\text{DSC}}\) and DSCΔ show a descending trend, illustrating a drop in overall utility and an improvement in fairness metrics. This phenomenon is consistent with the original assumption, i.e., the controllable fairness constraint can be achieved by adjusting β.

Fig. 6: Ablation on fairness constraining factor β.
figure 6

β = 0.0 denotes the baseline model.

Discussion

In this paper, we address unfairness mitigation in deep learning-based ultrasound image diagnosis under a more stringent constraint, where the weights of pre-trained models are frozen and cannot be modified. This is achieved by manipulating latent embeddings using protected-attribute-aware perturbations. The following key observations are highlighted for further discussion.

Firstly, unfairness is prevalent across the two ultrasound datasets with respect to both sex and age attributes, regardless of the segmentation backbone (i.e., CNN-based, Transformer-based, or SAM-based models). As shown in Fig. 2, this biased trend persists across each dataset, although different architectures exhibit varying degrees of unfairness. This indicates that unfairness arises from both data and model factors, which aligns with findings in ref.6, disparities in subgroup performances remain, which are even more pronounced compared to the zero-shot SAM and MedSAM models. In terms of overall performance, SAM-based models demonstrate higher zero-shot performance on the TUSC dataset, possibly due to the inclusion of bounding-box prompts. However, SAM’s overall performance drops significantly on the QDUS dataset.

Secondly, unfairness in lesion mask predictions propagates to downstream diagnosis. By comparing the errors in radiomics parameters such as surface area, perimeter, and energy between the predicted and ground truth masks, we observe that the Male group tends to have higher errors compared to the Female group on the TUSC dataset, and the Old group tends to show higher errors than the Young group. These discrepancies in error directly impact the clinicians’ diagnostic process and may lead to unfair treatment.

Thirdly, extensive experiments across two ultrasound image datasets demonstrate the effectiveness of the proposed method in improving fairness across different backbones and attributes. These improvements are reflected in both numeric fairness metrics (shown in Table 1) and error distribution similarities (depicted in Figs. 3 and 4). Unlike previous fairness mitigation approaches that modify training data through sampling or data synthesis, or alter network architectures with adversarial branches or pruning biased neurons, our method introduces protected-attribute-aware perturbations directly into the latent feature embeddings, preserving both the original network architecture and weights. The strong performance of our method across various experimental settings highlights its generalizability.

Compared to other unfairness mitigation methods, APPLE achieves the best or second-best in each group under most experiment settings. Although InD has the best fairness scores and utilities in TUSC-Sex, we need to emphasize that it requires sensitive attributes in the inference stage, which means additional information requirements compared to APPLE and FEBS, and may bring difficulties in real-world applications. On the other hand, while FEBS demonstrates better utility preservation compared to APPLE, it outperforms APPLE solely in terms of fairness metrics within the TUSC-Sex dataset when utilizing U-Net as the backbone architecture.

Fourthly, given the increasing use of large-scale foundation models for image segmentation, such as SAM and MedSAM, we also integrate our proposed APPLE into these pre-trained models. As shown in Table 2, APPLE effectively improves fairness in these foundation models, which is often challenging, as noted by Jin et al.23, who found that fairness mitigation methods for smaller models might not always be effective for foundation models. In contrast, our approach preserves the pre-trained model weights while applying perturbations to the latent space, thereby leveraging large pre-training datasets more effectively.

Moreover, we observed that the integration of APPLE with the two SAM-family models effectively preserved overall model utility while simultaneously mitigating bias. This finding is particularly significant, as most existing bias mitigation approaches typically compromise model performance to achieve fairness, thereby diminishing the inherent advantages of large-scale models trained on extensive datasets.

Moreover, by visualizing the latent feature space of the pre-trained models, we observe group-wise aggregations based on both sex and age attributes. This supports the hypothesis proposed in ref.24 that the encoding of sensitive attribute information by deep learning models may contribute to unfair outcomes. Upon comparing the latent spaces of the pre-trained model and the mitigation model, we find that group-wise clustering is reduced, leading to improvements in the corresponding fairness metrics. This relationship provides valuable insights for the development of more effective unfairness mitigation algorithms in future studies.

Lastly, while both model utility and fairness are crucial in real-world applications, there are scenarios where the utility is too low for fairness analysis to be meaningful. For example, if the overall DSC is around 40%, the focus on fairness becomes less relevant. Conversely, when utility is sufficiently high for diagnosis, ensuring fairness across different demographics becomes more critical. To address these varying scenarios, a manually adjustable weighting factor is necessary to balance the trade-off between utility and fairness. Through ablation of the hyperparameter β (ranging from 0.01 to 10.0), we observe that adjusting β allows fine-tuning of the balance between utility and fairness, offering flexibility for deployers to meet specific fairness requirements based on the application context.

Our study does have limitations. We focus solely on unfairness mitigation in segmentation tasks, leaving other important tasks in medical image analysis, such as classification, detection, and low-level tasks, unevaluated. While the characteristics of our method suggest potential applicability to these tasks (e.g., the latent space between the feature extractor and the multilayer perceptron (MLP) head for classification, before ROI pooling layers for object detection, and between encoder and decoder for low-level tasks), further experiments are needed to verify its effectiveness. Additionally, current fairness measurements for segmentation tasks treat all pixels equally, which may not be ideal. Pixels at the center of the ground truth mask and those at the boundary likely carry different significance. Therefore, fairness metrics that account for pixel importance should be developed to provide a more accurate evaluation of fairness in medical segmentation tasks.

In summary, we propose an algorithm, APPLE, designed to improve model fairness in ultrasound segmentation and diagnosis tasks without altering the architecture or parameters of pre-trained base models. APPLE achieves this by perturbing the latent embeddings of the base model using a GAN, ensuring that sensitive information is not transmitted to the decoder of the segmentor. Extensive experiments on a public and a private ultrasound dataset demonstrate APPLE’s effectiveness across different base segmentors. We hope this research emphasizes the importance of fairness in AI models, draws attention to fairness in medical segmentation, and provides insights for mitigating unfairness in other tasks and foundation models, ultimately contributing to the development of a more equitable healthcare system and health equity for all.

Methods

Datasets

Two ultrasound image datasets are involved in this study, including a publicly available Thyroid UltraSound Cine-clip dataset25 (TUSC) and a private Thyroid Ultrasound dataset (QDUS)26. The TUSC dataset comprises ultrasound cine-clip images from 167 patients gathered at Stanford University Medical Center. Due to slight differences in the images extracted from the ultrasound video, we resampled the dataset at a ratio of 5, resulting in a dataset of 860 images. The download link for the TUSC dataset can be found in the Supplementary Materials. The QDUS dataset is collected from the Ultrasound Department of the Affiliated Hospital of Qingdao University, consisting of 1942 images from 192 cases, annotated by three experienced radiologists. We categorized the samples into the Male and the Female group regarding sex, and split the samples into the Young group and the Old group with a threshold of 50 years old. The data distributions of the two datasets are presented in Table 3. For both datasets, we partition them into training and testing sets with a ratio of 7:3 using a stratified sampling strategy to ensure equal representation of each subgroup.

Table 3 Dataset distribution of TUSC and QDUS

Image pre-processing and model pre-training

Before passing the image into deep learning models, all the images were resized to 256 × 256 and normalized with a mean of (0.485, 0.456, 0.406) and standard deviation of (0.229, 0.224, 0.225). Random rotation and random flipping were applied for data augmentation.

One CNN-based segmentator, i.e., U-Net27, and one Transformer-based segmentator, i.e., TransUnet28, are used to segment the lesion part from the ultrasound images. The two models are trained from scratch following the general pipeline in ref.29. Besides, we also include two Segment Anything Models (SAM30, MedSAM31) which have been pre-trained on largescale image datasets and inspect their zero-shot performances.

The implementation details and hyperparameters can be found in the Supplementary Materials.

Adversarial protected-attribute-aware perturbation on latent embeddings

The key idea of APPLE comes from the conclusion discovered in ref.24, i.e., the prediction of a model was fairer if less sensitive attribute-related information was encoded into the latent feature embeddings. This mainly resulted from the less usage of spurious relationships between the confounding attributes and the target task.

The commonly used segmentators including U-Net, TransUnet, SAM, and MedSAM can be regarded as an encoder Es and a decoder Ds.

Generally, the input image x passes the encoder and decoder sequentially and generates the predicted mask \(\hat{y}\). In our paper, the latent embedding denotes the feature vector fo extracted by the encoder Es. The subscript s denotes segmentation, while o denotes the original network. Thus, the pipeline is given by the following equation:

$$\hat{y}={D}_{s}({E}_{s}(x))={D}_{s}({f}_{o})$$
(1)

Based on the above motivation, we tried to mitigate unfairness by manipulating the latent Embedding, fo, which decorated the fo such that the sensitive attributes cannot be recognized from the perturbed Embedding using generative adversarial networks.

Our APPLE contains two networks, the adversarial perturbation generator Gp, and the sensitive attribute discriminator Dp. The Gp takes the original latent embedding fo as the input and generates a perturbation δ with the same shape of fo. Then, the original embedding fo and the perturbation δ are added together to get the perturbed latent embedding fp = fo + δ. Finally, the perturbed fp is passed to the decoder Ds to predict the segmentation mask. The pipeline is shown in Eq. (2).

$$\hat{y}={D}_{s}({f}_{p})={D}_{s}({f}_{o}+\delta )={D}_{s}\left({E}_{s}(x)+{G}_{p}({E}_{s}(x))\right..$$
(2)

On the other hand, the sensitive attribute discriminator Dp tried to distinguish the sensitive attribute from the perturbed latent embedding fp, which can be optimized by Equ. (3).

$${{\mathcal{L}}}_{D}={{\mathcal{L}}}_{{\rm{CE}}}({D}_{p}({f}_{p}),a),$$
(3)

where a is a binary variable denoting the sensitive attributes of the sample. a = 0 denotes the male/young group, while a = 1 denotes the female/old group.

The loss function of the perturbation generator Gp consisted of two parts: the segmentation utility preserving part \({{\mathcal{L}}}_{G}^{seg}\) and the fairness constraining part \({{\mathcal{L}}}_{G}^{fair}\). \({{\mathcal{L}}}_{G}^{seg}\) was defined as the Dice-CE Loss between the predicted mask and the ground truth mask as shown in Eq. (4):

$${{\mathcal{L}}}_{G}^{seg}={{\mathcal{L}}}_{{\rm{Dice}}\text{-}{\rm{CE}}}=\frac{1}{2}({{\mathcal{L}}}_{{\rm{CE}}}(y,\hat{y})+{{\mathcal{L}}}_{{\rm{Dice}}}(y,\hat{y})).$$
(4)

The fairness constraints part aimed to generate perturbations that confuse the distinction of the sensitive attribute, which was controlled by the negative cross-entropy loss of sensitive attribute prediction. Besides, a regularization \({\mathcal{H}}\) termed on the entropy of Dp(fp) was added to avoid space collapse. Therefore, \({{\mathcal{L}}}_{G}^{fair}\) was given by Eq. (5):

$${{\mathcal{L}}}_{G}^{fair}=-{{\mathcal{L}}}_{{\rm{CE}}}({D}_{p}({f}_{p}),a)-\alpha {\mathcal{H}}({D}_{p}({f}_{p})),$$
(5)

where \({\mathcal{H}}\) is the entropy. Therefore, the overall loss of Gp was defined as Equ. (6):

$${{\mathcal{L}}}_{G}=\beta {{\mathcal{L}}}_{G}^{seg}+{{\mathcal{L}}}_{G}^{fair},$$
(6)

Where β is the weighted factor that balances the requirements for fairness and utility, a higher β means that the model requires more utility than fairness.

Specifically, The Gp consisted of a 3-layer encoder with channels of (32, 64, 128), a 4-layer ResBlock-based bottleneck, and a 3-layer decoder with channels of (64, 32, Nec), where Nec is the number of embedding channels. The Dp is composed of 3 Linear-BatchNorm blocks, and the output dim is set to the number of classes of sensitive attributes. The detailed architectures of Gp and Dp are shown in Fig. 7.

Fig. 7: Scheme of the attribute-aware latent feature perturbation.
figure 7

a Pipeline of the proposed method. x: input image; y: ground truth segmentation mask; \(\hat{y}\): predicted segmentation mask; Es: segmentation encoder; Ds: segmentation decoder; Gp: perturbation generator; Dp: perturbation discriminator; fo: origin feature vector; fp: perturbed feature vector; pa: predicted attribute; a: sensitive attribute; \({\mathcal{H}}\): entropy; α, β: weighted factor. b Detailed scheme of Gp; (c) Detailed scheme of Dp. Note that the parameters of Es and Ds were pre-trained and frozen, only the parameters of Gp and Dp were trainable.

Following adversarial training, our proposed APPLE was optimized by the min-max game, and the pseudocode of our algorithm was presented in Algorithm 1.

Algorithm 1

Pseudo code of APPLE

Input: Encoder Es, Decoder Ds, input image x, segmentation mask y, sensitive attribute a, weighted factor α, β, number of epoch Nepoch

 for i = 1: Nepoch do

  Get the original latent embedding: foEs(x).

  Get the perturbed latent embedding: fpfo + Gp(fo).

  Get the predicted segmentation mask: \(\hat{y}\leftarrow {D}_{s}({f}_{p})\).

  Compute Loss of Dp: \({{\mathcal{L}}}_{D}={{\mathcal{L}}}_{CE}({D}_{p}({f}_{p}),a)\).

  Optimize Dp using \({{\mathcal{L}}}_{D}\).

  Compute Fairness Loss of Gp: \({{\mathcal{L}}}_{G}^{fair}=-{{\mathcal{L}}}_{CE}({D}_{p}({f}_{p}),a)-\alpha {\mathcal{H}}({D}_{p}({f}_{p}))\).

  Compute Segmentation Loss of Gp: \({{\mathcal{L}}}_{G}^{seg}=\frac{1}{2}({{\mathcal{L}}}_{CE}(y,\hat{y})+{{\mathcal{L}}}_{Dice}(y,\hat{y}))\).

  Optimize Gp using \({{\mathcal{L}}}_{G}=\beta {{\mathcal{L}}}_{G}^{seg}+{{\mathcal{L}}}_{G}^{fair}\).

 end for

Output: Perturbation Generator Gp and Discriminator Dp

Fairness criteria for segmentation and downstream diagnosis

Metrics evaluating segmentation utility, diagnosis parameters, and fairness are used to validate the effectiveness of our proposed method.

Dice Similarity Coefficient (DSC) is used to measure the precision of the predicted segmentation mask, which is given by (7).

$${\text{DSC}}\,=2\cdot \frac{| \widehat{y}\cap y| }{| \widehat{y}| \cup | y| }$$
(7)

where y and \(\hat{y}\) denoted the ground truth mask and predicted mask, respectively.

For radiomics parameters, three metrics, including the surface area, perimeter, and the energy of the lesion part are selected and calculated using PyRadiomics16 for diagnosis.

As for fairness criteria, following previous work23, we use the subgroup disparity (DSCΔ), standard deviation (DSCSTD), and skewness (DSCSKEW), which are given by Eqs. (8),(9), and (10).

$${\text{DSC}}_{\Delta }=\mathop{\max }\limits_{k}({\text{DSC}}^{k})-\mathop{\min }\limits_{k}({\text{DSC}}^{k})$$
(8)
$${\text{DSC}}_{{\rm{STD}}}=\sqrt{\frac{\mathop{\sum }\nolimits_{k = 1}^{K}{({\text{DSC}}^{k}-\overline{\text{DSC}})}^{2}}{K-1}}$$
(9)

where \(\overline{\,\text{DSC}}=\frac{1}{K}\mathop{\sum }\nolimits_{k = 1}^{K}{\text{DSC}}^{k}\), and DSCk is the utility on the k-th subgroup.

$${\text{DSC}}_{{\rm{SKEW}}}=\frac{\mathop{\max }\nolimits_{k}(1-{\text{DSC}}^{k})}{\mathop{\min }\nolimits_{k}(1-{\text{DSC}}^{k})}$$
(10)