Introduction

Recent studies indicate that AI systems can exhibit biases against demographic groups based on attributes such as age, race, gender, and socioeconomic status, particularly in medical imaging tasks like disease detection and treatment recommendations, which raises significant ethical concerns1,2. To address privacy issues related to medical data, FL has emerged as a preferred approach, with several key methods tackling fairness in different ways. FedAvg3 averages model updates from clients but may lead to biased outcomes if certain demographic groups are underrepresented. FedProx4 introduces a proximal term to keep local models aligned with the global model, mitigating the impact of Out-of-Distribution (OoD) data. FedNova5 normalizes local updates based on the number of training steps to balance contributions from clients with varying data amounts, enhancing convergence. SCAFFOLD6 employs control variates to stabilize learning and reduce bias from OoD data distributions. Due to its privacy-preserving capabilities, FL has gained significant attention in the medical field in recent years.

However, most FL research has focused on performance fairness—achieving consistent accuracy across clients—while overlooking group fairness, a gap that risks exacerbating healthcare disparities by underrepresenting diverse demographic groups. Addressing this issue requires models that go beyond privacy preservation to actively ensure fairness. Prioritizing such models is critical to tackling ethical concerns, fostering equitable treatment across demographic groups, and mitigating biases in healthcare applications. Moreover, these advancements would enhance the robustness and generalizability of medical AI systems, paving the way for more inclusive and dependable healthcare solutions.

Several algorithms aim to enhance group fairness in centralized learning7,8,9,10. For example, Fair-Mixup10 generates a distribution path connecting sensitive groups and regularizes the smoothness of the path to improve the generalization of group fairness metrics. Some works11,12 formulate the constrained optimization problem as a two-player game and analyze the solutions and generalization bounds. Additionally, Fair-CDA13 illustrates that group fairness can be promoted by regularizing models along the transitional paths of sensitive attributes among groups. While these strategies show promise for achieving better accuracy and fairness across various benchmarks, their application in medical settings often requires data from multiple centers, which raises privacy concerns. Integrating FL with fair machine learning presents a potential solution by allowing institutions to collaboratively train models without sharing raw patient data14. However, directly addressing group fairness in heterogeneous distributed settings often necessitates feature exchange10,13, which can lead to privacy leaks and contradict the fundamental principles of FL.

Most importantly, achieving fairness while preserving privacy in medical imaging tasks remains a significant challenge15,16,17. FairFed18 addresses this challenge by using fairness-aware aggregation to enhance fairness in FL. However, FairFed primarily focuses on group fairness criteria and is not specifically designed to ensure performance fairness, which we refer to as equal accuracy (EA) in this study. To address these limitations, we propose FlexFair, a FL framework that integrates three fairness criteria: EA19, demographic parity19 (DP), and equal opportunity8 (EO). FlexFair was evaluated across four distinct medical imaging tasks: polyp segmentation, fundus vascular segmentation, cervical cancer segmentation, and skin disease classification. Specifically, for cervical cancer segmentation, we curated a multi-centre and diverse dataset of 678 patients from four hospitals, reflecting the demographic diversity in clinical settings. By leveraging data from multiple institutions, FlexFair enhances generalizability and ensures its approach can be effectively applied in real-world scenarios. The pipeline of FlexFair is shown in Fig. 1. Our framework adaptively balances trade-offs between fairness and accuracy, and theoretical analysis shows it can accommodate different fairness metrics by modifying a component of the loss function. Results demonstrate that FlexFair achieves high performance and robustness while adhering to fairness criteria like DP, EO, and EA, providing an effective mechanism for ensuring fairness and privacy protection in medical imaging research.

Fig. 1: Overview of our method for fairness and privacy.
Fig. 1: Overview of our method for fairness and privacy.
Full size image

a Overview of the proposed FlexFair and its comparison with both the centralized learning and the vanilla FL method, FedAvg. FlexFair effectively mitigates prediction disparities from task models through a weighted penalty mechanism while prioritizing data privacy by integrating a federated framework. b Detailed design of FlexFair. FlexFair addresses fairness and privacy challenges in federated environments by incorporating multiple sensitive attributes, e.g., age, gender, and site, into its framework. It evaluates fairness using metrics like EA, DP, and EO, and integrates these attributes into a weighted regularized loss to ensure the training process promotes fairness across all groups.

Results

To evaluate the performance of FlexFair, we conducted experiments across various medical imaging tasks, including segmentation and diagnostic challenges, covering real-world scenarios with diverse data distributions and complexities. To ensure a thorough assessment, we employed a comprehensive set of evaluation metrics that simultaneously measured accuracy and fairness. These metrics included dice scores for segmentation and overall accuracy for diagnostic tasks, alongside fairness criteria such as EA, DP and EO. Each experiment was executed across five distinct random seeds to enhance the reliability of our findings, and we reported our results with accompanying statistical analyses to provide a comprehensive understanding of FlexFair’s performance and consistency across different initialization.

We conducted a comparative analysis of FlexFair against several established FL and fair machine learning methods, such as FedAvg, FedProx, FedNova, SCAFFOLD, FairFed, and FairMixup. This allowed us to assess its ability to balance accuracy and fairness in diverse scenarios. The results demonstrated FlexFair’s superior capacity to manage the inherent variability in medical imaging tasks while reducing performance disparities across clients. By enhancing the overall effectiveness of FL in medical applications, FlexFair also highlights its potential to promote equitable healthcare delivery, particularly in multi-institutional and resource-limited settings.

FlexFair achieves flexible fairness in diverse medical imaging scenarios

FlexFair exhibits consistent improvements in both fairness and performance metrics, as demonstrated by the comparative analysis in Fig. 2. The figure highlights FlexFair’s superior performance across multiple datasets, including polyp, fundus vascular, cervical cancer, and skin disease, where it consistently achieves lower fairness gaps, indicating more equitable outcomes. By running each method across five random seeds and averaging the results, the robustness of FlexFair’s performance is confirmed. The Pareto front plots, which report the top 20 test results under different weights, further underscore FlexFair’s ability to balance high accuracy (measured by dice scores and accuracy) with minimal fairness gaps (EA, DP, EO). This comprehensive evaluation reveals that FlexFair not only excels in predictive accuracy but also maintains fairness across various demographic attributes, making it a promising approach for both segmentation and diagnostic tasks in diverse medical applications.

Fig. 2: FlexFair achieves superior fairness and accuracy across diverse medical datasets.
Fig. 2: FlexFair achieves superior fairness and accuracy across diverse medical datasets.
Full size image

We compare FlexFair with six baseline methods (FedAvg, FedNova, FedProx, SCAFFOLD, FairFed, and FairMixup) across four datasets: polyp, fundus vascular, cervical cancer, and skin disease. Each method is evaluated on fairness (EA, DP, EO) and accuracy metrics (dice score for segmentation tasks and accuracy for diagnostic tasks). ac illustrate the Pareto front for segmentation datasets, highlighting trade-offs between fairness and accuracy. FlexFair highlighted with red color consistently achieves superior dice scores and fairness gap. df depict maximum gap values for dice scores, where lower values indicate greater fairness. FlexFair outperforms other methods by minimizing the max dice gap across sites. gj analyze fairness and accuracy in diagnostic tasks on the skin disease dataset, emphasizing FlexFair’s ability to balance demographic parity and equal opportunity across age and gender attributes. kn confirm that FlexFair achieves the lowest max dice gap values, ensuring equitable performance across all metrics and datasets. Source data are provided as a Source Data file.

In terms of EA, we evaluated three segmentation tasks and chose the site as the sensitive attribute. We measured the maximum dice performance gap across different sites, calculated the overall mean dice performance, and reported the maximum gap between the dice performance of each site and the overall mean dice as the fairness gap, as shown in Equation (1). Figure 2a–c presents the Pareto front, highlighting the trade-off between dice scores and fairness across various methods. FlexFair, depicted in red, stands out for its superior performance, achieving high dice scores while maintaining robust fairness metrics. Among the baseline methods, most advanced FL approaches outperform FedAvg, which yields moderate dice scores but often falls short in terms of fairness. FedProx demonstrates competitive performance but struggles to strike an optimal balance between fairness and accuracy, frequently exhibiting higher max dice gaps that indicate greater unfairness in segmentation outcomes. SCAFFOLD shows inconsistent results, performing well on the fundus vascular dataset but significantly underperforming on the cervical cancer dataset, as shown in Fig. 3. Due to its poor performance on the cervical cancer dataset, the Pareto front for SCAFFOLD is omitted. More detailed results are shown in Table 1. Figure 2d–f presents the maximum gap values of dice performance across different sites, serving as a measure of unfairness-higher maximum gap values indicate greater unfairness. The minimal maximum gap values for each method, exceeding a certain performance threshold, are reported. In Fig. 2, FedAvg exhibits the highest maximum gap, signalling the greatest unfairness, while FlexFair achieves the lowest maximum gap, indicating the most equitable performance. In Fig. 2e for the fundus vascular Dataset, the maximum gap values span from ~0.11 to 0.13. FedAvg again demonstrates the highest maximum gap, suggesting the highest level of unfairness, whereas FlexFair maintains the lowest maximum gap, indicating the most balanced performance. In Fig. 2f for the cervical cancer dataset, SCAFFOLD shows the highest maximum gap, reflecting significant unfairness, while FlexFair again exhibits the lowest maximum gap, reinforcing its effectiveness in ensuring fairness.

Fig. 3: Comparative segmentation analysis.
Fig. 3: Comparative segmentation analysis.
Full size image

We evaluate FlexFair against baseline methods on segmentation tasks across three datasets: cervical cancer, polyp, and fundus vascular. Violin-box plots depict the distribution of the top 20 test results for each method across different weight configurations and random seeds. The boxes represent the interquartile range (IQR), with the median marked by the white line and the mean indicated by the red 'x'. Whiskers extend to data points within 1.5 times the IQR, with black diamonds showing outliers. Dice scores, serving as a metric for segmentation accuracy, highlight FlexFair’s consistently superior performance across all tasks, characterized by a tighter distribution around higher median and mean values compared to the baseline methods. Source data are provided as a Source Data file.

Table 1 Performance was assessed on three segmentation datasets (polyp, fundus vascular, and cervical cancer) by reporting the mean ± standard deviation of the top 20 test results for each random seed (5 seeds in total)

In terms of DP, we ensured that the predictor \(\hat{Y}\) treats different sensitive attribute groups equally by requiring that the prediction probabilities remain the same regardless of the value of the sensitive attribute A. We selected age and gender as sensitive attributes for this metric. To measure the expected predictions under a binary classification task, we applied the softmax function to the output logits of samples and selected the output value at index 1 to represent the probability of positive labels. We then calculated the overall expected predictions and reported the maximum gap between the expected predictions of each sensitive group and the overall expected predictions as the fairness gap, as shown in Equation (2).

In terms of EO, we ensured that the predictor \(\hat{Y}\) maintains EO for correct predictions across different sensitive attribute groups. We applied age and gender as sensitive attributes for this metric as well. To measure the expected predictions under a binary classification task, we applied the same approach as for DP: we applied the softmax function to the output logits of samples and selected the output value at index 1 to represent the probability of positive labels. For samples with positive labels, we calculated the overall expected predictions and reported the maximum gap between the expected predictions of each sensitive group and the overall expected predictions as the fairness gap, as shown in Equation (3). Similar to segmentation tasks, Fig. 2g–n presents the Pareto front and bar charts for the skin disease diagnosis task. The Pareto front uses accuracy as the performance metric, while DP and EO considering different attributes (age and gender) are used as fairness metrics. Notably, FlexFair (represented by the red line) consistently demonstrates superior performance across these metrics. For instance, in the age DP plot (g), FlexFair shows a significant improvement in accuracy as DP value increases. Similarly, the gender DP plot (i) highlights FlexFair’s ability to maintain fairness across genders while improving accuracy. These trends are consistently observed in the age EO (h) and gender EO (j) plots as well, further reinforcing FlexFair’s balanced and equitable performance across different demographic groups. The bottom row bar charts (k-n) provide a comparative analysis of the fairness metrics across different methods. In all four charts—age DP (k), age EO (l), gender DP (m), and gender EO (n)—FlexFair consistently exhibits lower values compared to other methods. The lower maximum gap value across all metrics indicates that FlexFair achieves more equitable performance, effectively reducing bias related to age and gender in diagnostic tasks. Overall, these visualizations underscore FlexFair’s superior performance in maintaining high accuracy while ensuring fairness across diverse patient groups.

FlexFair enhances accuracy in both segmentation and diagnostic tasks

Figure 3 presents a comparative analysis of various FL methods alongside FlexFair on segmentation tasks for three medical conditions. The dice scores, which serve as a measure of segmentation accuracy, consistently indicate that FlexFair outperforms the other methods under comparison. This superior performance is reflected in the violin plots, which visualize central tendency measures (mean and median), interquartile ranges, and outliers. The results highlight FlexFair’s ability to deliver both accuracy and robustness across diverse datasets, emphasizing its effectiveness.

To determine whether there are statistically significant differences in performance between FlexFair and the baseline methods, we performed t tests on each method’s accuracy, calculating the p value for each comparison. Table 1 and Table 2 present the mean and standard deviation from five random seeds for both segmentation and diagnostic tasks, which include polyp segmentation, fundus vascular segmentation, cervical cancer segmentation, and skin disease classification. The p values, which highlight the statistical significance of the results, were derived by comparing each method’s accuracy with that of FlexFair.

Table 2 Performance was assessed on one classification dataset, the skin disease dataset, by reporting the mean ± standard deviation of the top 20 test results for each random seed (5 seeds in total), with fairness gaps lower than a specified threshold

In these tables, FlexFair consistently outperforms other methods under similar fairness gaps. For example, in the polyp dataset, FlexFair achieves a dice score of 0.885 ± 0.004, and in the fundus vascular dataset, it scores 0.658 ± 0.007. FlexFair also excels in the cervical cancer dataset with a dice score of 0.801 ± 0.003. For skin disease classification, FlexFair records an accuracy of 0.824 ± 0.004 for age DP and 0.824 ± 0.003 for gender DP. The low p values (<0.05) in these tables confirm the statistical significance of FlexFair’s superior performance, indicating that its improvements are not due to random variations.

FlexFair exhibits consistent improvements in fairness and performance while maintaining user privacy

FlexFair prioritizes user privacy through its FL framework, which ensures that sensitive data remains decentralized. By allowing individual users to retain control of their data, FlexFair mitigates the risks associated with data sharing and potential breaches. This decentralized approach not only aligns with privacy regulations but also fosters a collaborative environment for model training without compromising user confidentiality. The distributed training approach enables multiple clients to work together effectively, training a shared model while keeping their raw data secure and local.

In addition to its decentralized architecture, FlexFair incorporates a flexible regularization term that accommodates various fairness criteria, addressing the inherent variability in medical imaging tasks. This adaptability is crucial for reducing disparities in model performance across different demographic groups, ensuring that healthcare delivery is equitable. Through empirical validation across diverse datasets, FlexFair demonstrates its ability to maintain high accuracy and fairness while safeguarding user privacy, making it an ideal solution for sensitive applications in resource-constrained settings.

Discussion

FlexFair stands out by effectively integrating three critical fairness criteria: EA, DP and EO. These criteria are essential for ensuring that machine learning models provide equitable outcomes across diverse demographic groups while safeguarding user privacy, particularly in sensitive applications like healthcare. In terms of EA, FlexFair demonstrates significant superiority over existing methods. FairMixup emphasizes fairness and achieves better Pareto front results than some baseline methods, though it remains less competitive compared to FlexFair. However, FairMixup’s approach comes with a notable drawback: it compromises privacy, a crucial concern in FL scenarios. On three segmentation datasets, FairFed underperforms compared to FlexFair, suggesting that it is not specifically tailored for these scenarios. In terms of DP and EO, FairFed occasionally attains higher accuracy. Despite this, FairFed struggles to achieve a better trade-off between fairness and accuracy, highlighting that while it excels in specific conditions, FlexFair demonstrates more consistent and superior performance across various tasks. Also, FlexFair’s simple implementation allows for easy integration and plug-and-play use, making it accessible for a wide range of applications. By consistently achieving lower fairness gaps and higher performance metrics, FlexFair showcases its ability to address the inherent variability in medical imaging tasks while maintaining robust and accurate outcomes. This dual capability highlights FlexFair’s potential as a valuable tool in achieving both fairness and accuracy in FL settings.

FlexFair distinguishes itself from traditional FL and centralized learning methods by its ability to meet various fairness criteria while ensuring privacy. Traditional FL methods often struggle to guarantee group fairness, as this typically requires centralized access to raw data or features, which is challenging to achieve without compromising data privacy. FlexFair addresses these limitations by incorporating a flexible regularization term within an FL framework. While FairFed aims to enhance group fairness in FL, it struggles to achieve a better balance between fairness and accuracy compared to FlexFair. In conclusion, FlexFair enables decentralized data processing, preserves user privacy, and promotes equity, making it particularly vital for applications in healthcare where both fairness and privacy are critical.

Another key contribution of this work is the collection of a private dataset for the important yet data-scarce disease of cervical cancer. By gathering clinical data from multiple centres and establishing a comprehensive multi-centre data setting, we have validated FlexFair’s effectiveness in a real-world scenario. This dataset not only enhances the demographic diversity required for thorough evaluation but also supports the development of an efficient diagnostic model for cervical cancer. It underscores the practical applicability and potential of FlexFair to support equitable and precise diagnostic outcomes, ultimately contributing to improved healthcare delivery in multi-institutional and resource-constrained settings.

After consulting with several clinicians, we gathered valuable feedback on the FlexFair algorithm. These clinicians, with extensive experience in the collection and analysis of multi-centre clinical data, provided insightful input on the algorithm’s real-world applications. They unanimously agreed that FlexFair significantly improved both model fairness and accuracy while ensuring data privacy. The ability of FlexFair to address various fairness criteria, such as DP and EO, was particularly praised, as it is crucial for mitigating diagnostic bias across diverse demographic groups. Additionally, clinicians highlighted its exceptional performance in multi-centre data environments, showcasing its potential for widespread adoption across different medical institutions. They believe that applying this algorithm will substantially enhance the fairness and effectiveness of medical diagnostics, particularly in resource-limited healthcare settings.

In conclusion, FlexFair’s innovative integration of fairness criteria significantly enhances model performance while addressing critical privacy concerns in healthcare applications. By effectively balancing EA, DP, and EO, FlexFair offers a promising approach for promoting equitable outcomes in FL. Despite its advantages, FlexFair is not without limitations. One notable challenge is the potential communication overhead associated with FL frameworks. As model updates are transmitted between local devices and the central server, there may be delays that can impact the speed of convergence during training. Additionally, achieving group fairness often necessitates access to sensitive attributes, which complicates effective performance in scenarios where such labels are not available. Therefore, developing our method in an unsupervised manner is essential. Addressing these challenges will be crucial for maximizing FlexFair’s effectiveness in diverse real-world scenarios, ensuring that it continues to deliver on its promise of fairness and accuracy while upholding user privacy.

Methods

The private dataset for this retrospective study was created under a waiver of informed consent, as the institutional review boards determined that the retrospective design and use of de-identified data posed minimal risk to participants. We conducted the research with a strict commitment to fairness, transparency, and respect, ensuring that all data were meticulously handled and protected. Additionally, we upheld diversity, inclusivity, academic integrity, and ethical guidelines, with no conflicts of interest throughout the study. The ethical considerations underlying this work were rigorously reviewed and approved by the ethics committees of all participating institutions: the Ethics Committee of Sun Yat-sen Memorial Hospital of Sun Yat-sen University (SYSKY-2024-400-01), the Ethics Committee of Guangdong Maternal and Child Health Hospital (202401201), the Ethics Committee of Sun Yat-sen University Cancer Center (SL-G2023-231-01), and the Ethics Committee of Guangdong Province Traditional Medical Hospital (BE2023-146).

Private cervical cancer dataset collection

Cervical cancer

Cervical cancer is the fourth most common malignancy in women, with a 6.5% incidence and 7.7% mortality worldwide. In 2020, there have been more than 340,000 deaths due to cervical cancer were reported worldwide, which remains a significant threat to female health20. External beam radiation therapy (EBRT) and brachytherapy (BT) are the primary radiation modalities for locally advanced cervical cancer21. Accurate segmentation of clinical target volumes and organs at risk is a crucial step for EBRT and BT treatment options, as inaccuracies may result in either over-irradiation of normal tissues or insufficient radiation dose delivery to the tumor22. Magnetic resonance imaging (MRI)-guided treatment planning in EBRT and BT for cervical cancer demonstrates a significant advantage in tumor localization and assessment of tumor infiltration23,24. However, manual MRI image segmentation is a cumbersome process and may be inaccurate due to the inherent bias of the radiation oncologists. This underlines the necessity of rapid and accurate automatic segmentation methods that would improve the workflow efficiency of clinicians and reduce variability in radiotherapy planning.

Cervical cancer dataset collection

The dataset is composed of multiple pre-treatment pelvic MRI scans in female patients at four institutions, i.e., Sun Yat-sen Memorial Hospital of Sun Yat-sen University (center A), Sun Yat-sen University Cancer Center (center B) Guangdong Province Traditional Medical hospital (center C) and Guangdong Maternal and Child Health Hospital (center D). We define the inclusion criteria as follows: (1) an age ≥18 years; (2) confirmed pathological diagnosis of cervical cancer. We define the exclusion criteria as follows: (1) previous history of chemoradiotherapy therapy at cervical cancer; (2) those presenting tumors with a diameter of <5 mm that were invisible on MRI images; (3) image quality with severe artifact affecting the subsequent analysis. The workflow is illustrated in Fig. 4. The MRI protocols for the four centers are as follows: center A includes an axial T2-weighted sequence (repetition time [TR], 3500 ms; echo time [TE], 129 ms; slice thickness, 5 mm; acquisition matrix 384 × 269); center B includes an axial T2-weighted sequence (TR, 5100 ms; TE, 85 ms; slice thickness, 6 mm; acquisition matrix, 320 × 224); center C includes an axial T2-weighted sequence (TR, 5050 ms; TE, 72 ms; slice thickness, 5mm; acquisition matrix, 256 × 320);center D includes an axial T2-weighted sequence (TR, 3000 ms: TE, 98 ms: slice thickness,6mm; acquisition matrix, 256 × 288).

Fig. 4: Multi-center cervical cancer dataset collection.
Fig. 4: Multi-center cervical cancer dataset collection.
Full size image

The cervical cancer dataset is collected across four medical centers with a detailed process outlining patient selection, exclusion criteria, and final cohort composition. From an initial pool of 1144 patients, individuals who meet the inclusion criteria (age ≥18 years and pathology-confirmed cervical cancer) and do not meet exclusion criteria (prior chemoradiotherapy, tumor diameter <5 mm, or severe motion artifacts in MRI) are included in the analysis. After applying these criteria, the final dataset comprises 89, 65, 278, and 246 patients from centers A, B, C, and D, respectively.

Cervical cancer dataset annotation

Radiologists used ITK-SNAP software (www.itksnap.org) to draw regions of interest (ROIs) around the tumor on T2W images to delineate the whole tumor volume. Discrepancies between the readers were resolved through consensus. These labeled ROIs are considered the ground truth data in training.

Dataset statistics

Polyp dataset

The polyp dataset comprises data from two distinct FL clients: CVC-30025 and Kvasir26, with sample sizes of 610 and 1000, respectively, as shown in Table 3. The table also displays the proportion of samples between these datasets, showing Kvasir constituting 62.1% and CVC-300 making up 37.9% of the total dataset. This notable discrepancy in data volume among groups underscores the imperative for models capable of adapting to diverse data sources while preserving accuracy amidst such imbalances.

Table 3 Datasets overview across four medical domains: polyp detection, fundus vascular segmentation, skin disease classification (stratified by age and gender), and cervical cancer diagnosis

Skin disease dataset

The skin disease dataset leverages the HAM-1000027 and BCN-2000028 datasets, known for their comprehensive annotations encompassing sensitive attributes such as age, gender, and skin type. After filtering samples without gender and age information, the sample sizes are 8819 for HAM-10000 and 7705 for BCN-20000. The gender distribution shows a slight male majority with 52.7% male and 47.3% female in HAM-10,000, and a slight female majority with 48.7% male and 51.3% female in BCN-20000. The age distribution indicates a higher number of samples (80.3% in HAM-10000 and 72.6% in BCN-20000) in the age group under 60 for both datasets. This dataset provides a robust foundation for evaluating the fairness and efficacy of skin disease detection algorithms across varied demographics.

Fundus vascular dataset

The fundus vascular dataset includes three primary datasets: CHASE-DB129, DRIVE30, and STARE31, with sample sizes of 28, 40, and 20, respectively. These datasets are limited in size, as illustrated in Table 3, presenting significant challenges in training robust machine learning models due to the constrained dataset size and the risk of overfitting.

Cervical cancer dataset

As previously mentioned, the cervical cancer dataset is collected from four centers: center A, center B, center C, and center D, with sample sizes of 1383, 1332, 323, and 328 respectively. Table 3 illustrates the distribution of samples among these centers, with center A having the largest share at 41.1%.

Network architecture

We selected SANet32 as our foundational architecture due to its effectiveness in medical image segmentation. SANet addresses challenges across three key dimensions: image color, background noise, and foreground-background distribution. It uses data augmentation methods like random color swapping during training to focus on shape and structural information, rather than lesion color. Attention mechanisms are employed to suppress background noise, and a post-processing strategy balances the distribution of predicted results during inference. For segmentation tasks, SANet serves as the backbone model, while ResNet-5033 is used for classification tasks.

Experimental settings

In our comparative study of FL methods, we assess and improve upon four FL methods: FedAvg3, FedProx4, SCAFFOLD6, FedNova34, and two state-of-the-art group fairness methods: FairMixup10, and FairFed18. For each dataset and task, we ensure that all methods use the same learning rate, batch size, and number of epochs.

To achieve the fairness-accuracy trade-off, we adjust specific hyperparameters within each method. For FlexFair, we adapt the weight of the penalty, while FedAvg remains unchanged as it is not designed for fairness. For FedProx, we adjust the penalty constant μ. Similarly, SCAFFOLD, FedNova, FairMixup, and FairFed have specific hyperparameter adjustments tailored to their frameworks, such as local client stepsize ηt, client momentum factor ρ, penalty weight λ, and perturbation ratio β. These adjustments allow us to comprehensively evaluate and enhance the performance of each method, highlighting the trade-offs between fairness and accuracy.

Statistics & reproducibility

To comprehensively evaluate both the performance and fairness of each algorithm, all experiments were conducted under five different random seeds, and each method was further explored with multiple hyperparameter settings (e.g., μ in FedProx, fairness penalty λ in FlexFair and FairMixup, and local client stepsize ηt in SCAFFOLD). This strategy ensures a thorough search of the fairness-accuracy trade-off space and provides robust conclusions about each method’s sensitivity to initialization and parameter choices.

In the bar charts and tables (Tables 1 and 2), each bar or entry corresponds to the averaged result and associated standard deviation over five seeds at a particular hyperparameter setting. To establish the statistical significance of any observed performance differences, we conducted paired t tests with the null hypothesis that no performance difference exists compared to our proposed method; p values below 0.05 indicate statistically significant improvements.

By systematically varying hyperparameters and repeating experiments under multiple random seeds, we ensure that our comparisons and conclusions about fairness and accuracy are robust to minor changes in initialization and parameter tuning. Sample size selection was based on literature references and practical data considerations. No statistical method was used to predetermine the sample size. No data were excluded from the analyses. Furthermore, the experimental process did not involve randomization, and no blinding was performed.

Evaluation metrics

We utilize three fairness metrics: EA, DP, and EO to assess the fairness of our methods. Specifically, we use EA to measure performance fairness in segmentation tasks, and DP and EO to measure group fairness in diagnostic tasks. Suppose that we have a set of data points {(xiyi)} drawn from an unknown joint distribution over cx × cy, where cx is a subset of \({{\mathbb{R}}}^{d}\). The attribute A represents a sensitive characteristic that should not influence decision-making differently. \({{{\mathcal{A}}}}_{k}\) represents the condition set satisfying \(\cup {{{\mathcal{A}}}}_{k}={{\mathcal{A}}}\).

EA is a fairness metric used to address disparities in prediction accuracy across different groups. It quantifies unfairness by measuring the maximum difference in prediction accuracy between these groups. We specifically apply it to assess the gap in dice performance for segmentation tasks. The formula for EA is shown in Equation (1). In this context, \({{{\mathcal{A}}}}_{k}\) represents distinct hospitals. A predictor \(\hat{Y}\) satisfies EA if it minimizes the maximum difference in prediction accuracy across these hospital groups.

$${F}_{{{\rm{EA}}}}={\max }_{k}| {{\rm{Dice}}}({{{\mathcal{A}}}}_{k})-\overline{{{\rm{Dice}}}}| .$$
(1)

DP ensures that a predictor \(\hat{Y}\) treats different sensitive attribute groups equally. Specifically, it requires that the prediction probabilities remain the same regardless of the value of the sensitive attribute A: \(P(\hat{Y}| A=0)=P(\hat{Y}| A=1)\). In other words, the model’s predictions should not be influenced by variations in the sensitive attribute. DP emphasizes group fairness, aiming to ensure that individuals within different groups based on sensitive features receive positive decisions at equal rates. To evaluate the fairness of a trained model f under the DP definition, we use a relaxed metric called FDP. This metric quantifies the difference between the expected predictions for different sensitive attribute groups, as shown in Equation (2). The goal is for FDP to approach zero, indicating that the model achieves DP. However, meeting strict DP requirements can lead to reduced prediction accuracy, especially for certain predictions (such as hobbies or expertize) where genuine differences exist between groups. As an alternative, we consider other fairness criteria to address these limitations.

$${F}_{{{\rm{DP}}}}={\max }_{k}| \Pr ({{{\rm{f}}}}_{{{\boldsymbol{\theta }}}}({{\bf{x}}})=1| {{{\mathcal{A}}}}_{k})-\Pr ({{{\rm{f}}}}_{{{\boldsymbol{\theta }}}}({{\bf{x}}})=1)| .$$
(2)

EO ensures that a predictor \(\hat{Y}\) maintains EO of correct predictions across different sensitive attribute groups. Specifically, the EO violation metric is given by Equation (3). Unlike DP, EO considers the correlation between Y and A, allowing for variations in base rates across groups. In real-world applications, EO serves as a fairness criterion when there are strict requirements for accurate predictions, and we prioritize the qualifications of candidates when making decisions. The goal is to minimize FEO while ensuring fairness across different sensitive attribute groups.

$${F}_{{{\rm{EO}}}}={\max }_{k}| \Pr ({{{\rm{f}}}}_{{{\boldsymbol{\theta }}}}({{\bf{x}}})=1| y=1,{{{\mathcal{A}}}}_{k})-\Pr ({{{\rm{f}}}}_{{{\boldsymbol{\theta }}}}({{\bf{x}}})=1| y=1)| .$$
(3)

Algorithm

In the context of collaborative training in FL, it is possible to create a high-performing global model that may inadvertently incorporate latent discriminatory biases against specific demographic groups in the dataset. To address this issue, we develop a weighted-variance-regularization approach that aims to enhance the model’s fairness while preserving prediction accuracy.

For group fairness assessment, there have been various metrics, which quantify the disparity between the model’s performance on specific demographic groups and the average performance across all groups8,35,36,37,38,39,40,41,42. In this study, we consider the maximum performance gap among different groups as an evaluation measure to gauge the fairness of the learned model. The maximum performance gap is defined as follows:

$$F({\mathcal{A}},\ell,{\bf{w}}) \, \triangleq \mathop{\max }\limits_{k\in \{1...,K\}}| {R}_{k}({\boldsymbol{\theta }};{{\mathcal{A}}}_{k})-\bar{R}({\boldsymbol{\theta }},{\bf{w}};{\mathcal{A}})| \\ \,=\mathop{\max }\limits_{k\in \{1...,K\}}| {\mathbb{E}}[\ell ({{\rm{f}}}_{{\boldsymbol{\theta }}}({\bf{x}}),y)| {{\mathcal{A}}}_{k}]-{\mathbb{E}}[\ell ({{\rm{f}}}_{{\boldsymbol{\theta }}}({\bf{x}}),y)| {\mathcal{A}}]|,$$
(4)

where w represents a weight vector, denotes an utility function, \({{{\mathcal{A}}}}_{k}\) represent the k-th group and \({{\mathcal{A}}}={{{\mathcal{A}}}}_{1}\bigcup {{{\mathcal{A}}}}_{2}\bigcup \cdots \bigcup {{{\mathcal{A}}}}_{K}.\) Here \({R}_{k}({{\boldsymbol{\theta }}};{{{\mathcal{A}}}}_{k})\) and \(\bar{R}({{\boldsymbol{\theta }}},{{\bf{w}}};{{\mathcal{A}}})\) are empirical estimates of \({\mathbb{E}}[\ell ({{{\rm{f}}}}_{{{\boldsymbol{\theta }}}}({{\bf{x}}}),y)| {{{\mathcal{A}}}}_{k}]\) and \({\mathbb{E}}[\ell ({{{\rm{f}}}}_{{{\boldsymbol{\theta }}}}({{\bf{x}}}),y)| {{\mathcal{A}}}]\) respectively:

$${R}_{k}({{\boldsymbol{\theta }}};{{{\mathcal{A}}}}_{k})=\frac{{\sum }_{({{\bf{x}}},y)}\ell ({{{\rm{f}}}}_{{{\boldsymbol{\theta }}}}({{\bf{x}}}),y){\mathbb{I}}\{({{\bf{x}}},y)\in {{{\mathcal{A}}}}_{k}\}}{{\sum }_{({{\bf{x}}},y)}{\mathbb{I}}\{({{\bf{x}}},y)\in {{{\mathcal{A}}}}_{k}\}}$$
(5)

and

$$\bar{R}({{\boldsymbol{\theta }}},{{\bf{w}}};{{\mathcal{A}}})=\sum _{k=1}^{K}{w}_{k}{R}_{k}({{\boldsymbol{\theta }}};{{{\mathcal{A}}}}_{k}).$$
(6)

Next, we show that both DP and EO can be written in the form of Equation (4).

Example 1 (DP)

The DP violation metric is given by

$${F}_{{\rm{DP}}} \, =\mathop{\max }\limits_{k\in \{1...,K\}}| \Pr ({{\rm{f}}}_{{\boldsymbol{\theta }}}({\bf{x}})=1| {{\mathcal{A}}}_{k})-\Pr ({{\rm{f}}}_{{\boldsymbol{\theta }}}({\bf{x}})=1)| \\ \,=\mathop{\max }\limits_{k\in \{1...,K\}}| {\mathbb{E}}[{\mathbb{I}}\{{{\rm{f}}}_{{\boldsymbol{\theta }}}({\bf{x}})=1\}| {{\mathcal{A}}}_{k}]-{\mathbb{E}}[{\mathbb{I}}\{{{\rm{f}}}_{{\boldsymbol{\theta }}}({\bf{x}})=1\}| {\mathcal{A}}]| .$$
(7)

Therefore, \({F}_{{{\rm{DP}}}}=F({{\mathcal{A}}},\ell,{{\bf{w}}})\) if \(\ell ({{{\rm{f}}}}_{{{\boldsymbol{\theta }}}},z)={\mathbb{I}}\{{{{\rm{f}}}}_{{{\boldsymbol{\theta }}}}({{\bf{x}}})=1\}\) and \({w}_{k}=\frac{{\sum }_{({{\bf{x}}},y)}{\mathbb{I}}\{({{\bf{x}}},y)\in {{{\mathcal{A}}}}_{k}\}}{{\sum }_{k}{\sum }_{({{\bf{x}}},y)}{\mathbb{I}}\{({{\bf{x}}},y)\in {{{\mathcal{A}}}}_{k}\}}.\)

Example 2 (EO)

The EO violation metric is given by

$${F}_{{\rm{EO}}} =\mathop{\max }\limits_{k\in \{1...,K\}}| \Pr ({{\rm{f}}}_{{\boldsymbol{\theta }}}({\bf{x}})=1| y=1,{{\mathcal{A}}}_{k})-\Pr ({{\rm{f}}}_{{\boldsymbol{\theta }}}({\bf{x}})=1| y=1)| \\ =\mathop{\max }\limits_{k\in \{1...,K\}}| {\mathbb{E}}[{\mathbb{I}}\{{{\rm{f}}}_{{\boldsymbol{\theta }}}({\bf{x}})=1\}| y=1,{{\mathcal{A}}}_{k}]-{\mathbb{E}}[{\mathbb{I}}\{{{\rm{f}}}_{{\boldsymbol{\theta }}}({\bf{x}})=1\}| y=1,{\mathcal{A}}]| .$$
(8)

Then, \({F}_{{{\rm{EO}}}}=F({{\mathcal{A}}},\ell,{{\bf{w}}})\) if we change the grouping strategy \(\left\{{{{\mathcal{A}}}}_{k},k=1,\ldots,K\right\}\) to \(\left\{{{{\mathcal{A}}}}_{k}\cap \{ \, y=1\},k=1,\ldots,K\right\}\).

For smooth training, we propose a variance-regularization approach that directly constrains the max performance gap and allows for easy extension to incorporate other types of fairness considerations. The regularization is designed as:

$${{\rm{Penalty}}}({{\boldsymbol{\theta }}},{{\bf{w}}};{{\mathcal{A}}})=\sum _{k=1}^{K}{\left[{R}_{k}({{\boldsymbol{\theta }}};{{{\mathcal{A}}}}_{k})-\bar{R}({{\boldsymbol{\theta }}},{{\bf{w}}};{{\mathcal{A}}})\right]}^{2}.$$
(9)

Note that \(F({{\mathcal{A}}},\ell,{{\bf{w}}})\le \sqrt{{{\rm{Penalty}}}({{\boldsymbol{\theta }}},{{\bf{w}}};{{\mathcal{A}}})}\le \sqrt{K}F({{\mathcal{A}}},\ell,{{\bf{w}}}),\) which implies that constraining the variance-regularization is equivalent to constraining the max performance gap. To facilitate optimization and implementation, we use dice loss as \({R}_{k}({{\boldsymbol{\theta }}};{{{\mathcal{A}}}}_{k})\) in the segmentation task and cross entropy loss for the classification task. This approach bears a similarity to the VREx41 method. However, our method is not limited to task loss; \({R}_{k}({{\boldsymbol{\theta }}};{{{\mathcal{A}}}}_{k})\) can be adapted based on specific metrics. This flexibility allows our approach to be tailored to different fairness metrics and objectives.

Our method is specifically designed for the FL context, addressing the unique challenges of FL environments. By incorporating various fairness criteria while ensuring privacy, our weighted-variance-regularization approach provides a robust framework for achieving equitable outcomes. It allows decentralized data processing, preserves user privacy, and promotes equity, making it particularly crucial for applications in healthcare and other sensitive fields.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.