Background & Summary

Bladder cancer (BCa) is the most common malignant tumor of the urinary system with an incidence ranking tenth in malignant tumors worldwide1. The diagnosis and treatment of BCa involves various facets, such as imaging-based diagnosis, pathological image analysis, prognostic prediction, treatment planning, and research on molecular markers and genomics2. In diagnostic imaging studies of BCa, the performance of artificial intelligence (AI) techniques, particularly deep learning (DL) methods, has demonstrated comparable efficacy to that of experienced radiologists3,4. In those investigations, researchers use multi-center data for the development of DL models, aiming to enhance their accuracy and adaptability. Furthermore, external datasets are also utilized for validation purposes. However, the rise in privacy concerns complicates the sharing of sensitive medical data between different centers. It becomes more and more difficult to collect extensive clinical data from numerous centers for the training of DL models5.

In response to data privacy concerns arising from multi-center data modeling processes, Google introduced Federated Learning (FL) in 20166. FL is a distributed machine learning paradigm that allows each center (act as “a client”) to train models locally and then combine the local models into a global model. FL achieves the goal of jointly training a global model without exchange of local data. Presently, the limited availability of multi-center, standardized datasets for medical imaging of BCa poses a significant challenge to the widespread application and advancement of FL in the field of BCa.

In this study, we present a standardized multi-center BCa magnetic resonance imaging (MRI) dataset7, derived from real clinical scenarios. The dataset gathers data from four different hospitals. These hospitals are located in three cities, and the data collection follows the same patient inclusion and exclusion criteria. This variability in data sources, combined with differing characteristics such as scanning equipment and data volume, makes the dataset particularly well-suited for FL applications, as it effectively captures and addresses the heterogeneity found in real-world clinical settings.

The dataset consists of 275 three-dimensional (3D) T2-weighted (T2W) MRI scans of 228 BCa patients, with each patient bearing one or more bladder tumors. Each tumor in the dataset includes labels for tumor muscle invasion and annotations of tumor lesion contouring. Each tumor is accompanied by pathological examination results for muscle invasion in bladder cancer. BCa is typically categorized into non-muscle invasive bladder cancer (NMIBC) and muscle invasive bladder cancer (MIBC), depending on how deeply the tumor has grown into the bladder’s muscle wall. The two different tumor types exhibit different treatment modalities, prognostic indicators, and survival8,9,10,11,12, making accurate preoperative predictions of muscle invasion crucial for the clinical management of BCa treatment and prognosis. Tumor segmentation plays a critical role in clinical treatment, especially in radiation therapy-based cancer and oncology treatments. With the release of the dataset, the development of automated bladder tumor segmentation based on T2-weighted imaging (T2WI) can be significantly advanced.

To the best of our knowledge, existing FL medical image datasets are released through the challenge competition and are not accessible after the competition. Consequently, the introduction of the open-access multi-center MRI medical imaging dataset, carries significant implications for advancing FL research in the domain of medical image analysis. The dataset includes labels for tumor muscle invasion and tumor lesion annotations, making it possible to efficiently train MIBC diagnostic models and automated tumor segmentation models. The diversity of dataset’s labels provides the ground for investigating multi-task learning13 and mixed supervised learning14. Sourced from four different centers, the BCa dataset enables researchers to delve into areas such as FL6, domain generalization15, and domain adaptation16.

To validate the usage of the dataset for FL studies, we conduct a comprehensive survey of classical FL methods. These methods include FedAvg6, a pioneering approach in the field, and SiloBN17, a FL method that effectively addresses the challenge of disparate data distributions through the incorporation of a batch normalization layer. Additionally, we examine FedProx18 for effective management of data heterogeneity and FedBN19 for enhanced privacy protection. By leveraging these four FL methods, we constructed corresponding baseline for the dataset, specifically in diagnosing MIBC and performing automatic segmentation of BCa.

Methods

Cohort

The dataset for this retrospective study was created under a waiver of informed consent, as the Ethics Committees determined the data to be non-sensitive and the study posed minimal risk to participants. The waiver was approved in accordance with the recommendations of the Ethics Committees of Dongguan Hospital affiliated with Southern Medical University (KYKT2019-027), the Ethics Committee of Sun Yat-sen University Cancer Prevention and Treatment Centre (B2023-552-01), the Ethics Committee of Zhuhai Hospital affiliated with Jinan University (2024-KT-34), and the Ethics Committee of Fifth Affiliated Hospital of Sun Yat-Sen University (L011-1). We conduct a retrospective collection of bladder T2WI data and clinical information from Dongguan Hospital, affiliated with the Southern Medical University (center 1), the Sun Yat-Sen University Cancer Centre (center 2), the Zhuhai Hospital affiliated with the Jinan University (center 3) and Fifth Affiliated Hospital of Sun Yat-Sen University (center 4) between November 2019 and July 2022, and included a total of 279 patients. All patients underwent either radical cystectomy, partial cystectomy, or transurethral resection of bladder tumor within 2 weeks after multiparametric MRI scanning.

The inclusion criteria for this study are as follows: (a) patients who are untreated or received only diagnostic transurethral resection of bladder tumor and (b) patients with bladder cancer confirmed by radical or partial cystectomy or transurethral resection of bladder tumor within 2 weeks of the multiparametric MRI. The following patients are excluded: (a) no surgical treatment, and pathological T stage could not be obtained (11 tumors); (b) histopathological type of non-urothelial carcinoma (inverted papilloma in two tumors, leiomyoma in two tumors, adenocarcinoma in three tumors, glandular cystitis in two tumors, and mesenchymal tumor in one case); (c) tumor recurrence after BCa surgery in 6 tumors; and (d) 11 patients with multiple tumors, but the corresponding pathological diagnosis results of the tumor are lost. The final dataset includes 228 patients. All patients are Asian due to the geographical location of the hospital. Table 1 presents data characteristics of tumors across each center.

Table 1 Patient data characteristics.

MRIs

The images of T2WI are collected in four MRI scanners from four hospitals respectively. This result in a large data variability, due to the various imaging protocols used in different machines, scanners changes and updates. Shortly, the T2WIs are all performed in 3.0 T (100%). Summaries of the acquisition parameters for all the MRI modalities in the Table 2. The T2WIs have high resolution (1x1mm, or less) in horizontal planes, and typical slice thickness (3–5 mm) in clinical practice.

Table 2 Scanning parameters of bladder cancer in four different centers.

The images are fully de-identified by removing all direct and indirect identifiers protected under HIPAA (Health Insurance Portability and Accountability Act). The original DICOM (Digital Imaging and Communications in Medicine) files are converted to Neuroimaging Informatics Technology Initiative (Nifti) format (nii.gz) using dcm2niix (https://github.com/rordenlab/dcm2niix) with the anonymization option. Another round of visual quality control is preformed to secure complete anonymization, including 3D reconstruction of each image to guarantee that individuals could not be identified. The overall structure of the archive is represented in Fig. 1

Fig. 1
figure 1

Overall description of the archive. All images are anonymized in Nifti format. The itemized description of the metadata is recorded in “.xlsx” format.

Tumor annotations

The tumor annotations in the dataset are delineated on the T2WI images by an experienced radiologist (J.L., who has 14 years of work experience). These annotations are then reviewed and, if necessary, modified by another experienced radiologist (L.D. with 14 years of work experience). In instances of disagreement between the two radiologists, discussions are held until a consensus is reached, ensuring the quality of the annotations. Examples of T2WI and annotations are shown in Fig. 2.

Fig. 2
figure 2

Example T2-weighted imaging (T2WI) images. Columns A and C show T2WI images from the four centers. Examples in column A are all NMIBC, while examples in column C are all muscle invasive bladder cancer (MIBC). Columns B and D show tumor annotations for images in columns A and C.

All these patients are pathologically confirmed with BCa. For patients who underwent transurethral resection of a bladder tumor, a piece of detrusor muscle tissue at the tumor base is also removed for histopathologic examination to evaluate for detrusor muscle invasion. Pathologic specimens are obtained by TURBT in 222 tumours or by surgical resection in the other 57 tumors. Since each patient may have multiple tumors, the BCa dataset includes data for 275 tumors, with 160 tumors from Center 1, 48 from Center 2, 32 from Center 3, and 35 from Center 4. A total of 27 patients exhibit multiple tumors in the study cohort.

Typical tumors from four centers are shown in Fig. 2. Each center utilizes unique MRI equipment and scanning parameters, as detailed in Table 2.

In the BCa dataset, to protect the privacy of patients, basic clinical information (e.g., gender and age) is not disclosed. Figure 3 presents the distribution of tumor characteristics among the four centers.

Fig. 3
figure 3

(A) shows the image intensity distribution of each central bladder region. (B) shows the bladder tumor voxel distribution at each center. (C) shows the number of tumors at each center. (D) shows the distribution of NMIBC/MIBC at each center.

Data Records

The dataset7 is deposited in Zenodo (https://zenodo.org/records/10409145). Because the data are originally assembled under a waiver of patient consent, the dataset is released under a CC-BY license, allowing for open access and use with proper attribution. The data structure, format, and naming are shown as follows (Fig. 4):

Fig. 4
figure 4

The dataset’s data structure, format, and nomenclature.

The process of our data collection and processing is illustrated in Fig. 1.

  1. 1.

    Within the “FedBCa” directory, image data in the Nifti format (nii.gz) is sorted into four subdirectories, each corresponding to a different data collection center:

    “Center 1” stores T2W images collected from Center 1.

    “Center 2” stores T2W images collected from Center 2.

    “Center 3” stores T2W images collected from Center 3.

    “Center 4” stores T2W images collected from Center 4.

  2. 2.

    Each “Center X” folder is meticulously organized to encompass two subfolders and an Excel spreadsheet.

The “Image” subfolder is dedicated to storing T2W image data specific to the center, containing a collection of subject images in the Nifti format (nii.gz).

The “Annotation” subfolder within each center’s directory contains the manual annotation data for the images, offering a precise delineation of tumors, also in the Nifti format (nii.gz).

“Center_X_label.xlsx” records the filenames of the image data, their corresponding annotation filenames, and includes the pathological labels of MIBC.

Technical Validation

Quality control for images and annotations

In this study, rigorous quality control is applied to MRI images and annotations. Firstly, to ensure that the population of research subjects is sufficiently consistent on key characteristics, all MRI images are selected based on uniform inclusion and exclusion criteria. Secondly, each image underwent quality assessment to ensure the absence of motion blur or artifacts, and to maintain sufficient clarity for accurately depicting details of the regions of interest. For image annotations, experienced radiologists are tasked with precise tumor localization and annotation. To guarantee the accuracy and consistency of annotations, a double-review process is employed, where one radiologist performs the annotation and another experienced radiologist reassess each annotation to ensure reliability. We calculate intra-rater reliability using the Dice similarity coefficient, which indicates if the same voxels are being selected as part of the lesion mask or not. For Dice calculation, we compare the annotations of two radiologists for all 275 cases, and the intra-rater Dice coefficient is 0.870. We also calculated the intraclass correlation coefficient (ICC) for the lesion volumes. The ICC ranges from 0–1; 1 is total agreement. The intra-rater ICC is 0.988. These quality control measures aim to enhance the validity and credibility of the dataset in bladder cancer diagnosis research.

Experimental verification in federated learning tasks

To assess the enhancement of accuracy and generalization provided by FL, we utilize FL methods, centralized training (mixed data from four centers), and single-center training to develop a MIBC prediction model or automated tumor segmentation model, respectively. To build the baseline of FL in the dataset, we conduct a survey on classical FL methods.

These methods include FedAvg6, SiloBN17, FedProx18, and FedBN19, each with distinct algorithm designs and implementation details. FedAvg6 is a foundational algorithm that trains a global model across multiple clients while keeping data localized. FedAvg involves initializing a global model, performing local training on each client, sending model updates to the server, and averaging these updates to form a new global model, iterating until convergence. SiloBN17 addresses data heterogeneity in multi-center medical investigations by combining a local batch normalization (BN) layer with center-specific statistics. This approach results in a model that is jointly trained and tailored to each center. SiloBN enhances robustness under varying data conditions while minimizing the risk of information leakage by avoiding the sharing of center-specific activation statistics. FedProx18 improves the handling of non-IID data through a re-parameterization module and targeted parameter modifications for individual clients. FedProx also allows for varying quantities of local tasks across devices and stabilizes the method with an approximation term. FedBN19 facilitates feature transfer among heterogeneous clients by enabling the exchange of extracted model attributes instead of raw data. Local BN is employed to align feature distributions across clients, ensuring consistency and supporting local model training.

We use these four FL methods to build corresponding baseline of the dataset in diagnosing MIBC and performing automatic segmentation of BCa. Subsequently, we compare the performance of these methods on the test set (Tables 3 & 4).

Table 3 Results of Classification Task for the Dataset.
Table 4 Results of segmentation tasks for the Dataset.

In this study, we conduct all experiments using PyTorch for training on NVIDIA A100 GPUs. The models are trained in a Python environment (version 3.8; https://www.python.org/), utilizing PyTorch (version 1.13.1; https://pytorch.org/). Our computing system is equipped with Intel Xeon Gold 6326 processors.

We refine the preprocessing of bladder MR images, adapting to different tasks in this study. Each slice of the 3D T2WI is cropped to uniform dimensions. For classification task, original T2WI slices are cropped to create 128 × 128 patches centered around the tumor annotations. For segmentation tasks, the cropping frame size of T2WI slices is set at 160 × 160. The cropped frame, centered around annotations, is randomly offset by 10 to 15 pixels in the x-y axes. Figure 5 shows an overview of the experimental process.

Fig. 5
figure 5

An overview of the experimental procedure. Each center acts as a client. For each round of communication, a certain percentage of clients are randomly selected to the train local model and send the local model to the server. The server aggregates the new global model and updates the model of client.

We use image augmentation techniques, including horizontal and vertical flipping, image cropping, and affine transformations, to optimize the utilization of our data representation. For model optimization, we utilize the Adam optimizer with a fixed learning rate of 1e-05. In model training, the Cross-entropy loss20 function is adopted for classification tasks, while Dice loss21 is utilized for segmentation tasks. The batch size is set to 24, and the training is conducted over 500 epochs.Considering the limited sample size from center 2, 3, and 4, we select U-Net22 network, which is effective with small datasets, as the backbone for our segmentation tasks. We select ResNet-5023, a well-regarded classification network, as the backbone for the classification tasks. We randomly select 40% of data from each center for testing in classification tasks. For segmentation tasks, a randomly selected subset of 30% patients from each center is used to assess the performance of the model.

To balance computational efficiency and model accuracy, we set the proportion of clients participating in federated aggregation per round is set to 0.5, meaning approximately half of the clients participate in each global model aggregation. The number of local training epochs before each aggregation is set to 1, indicating that the local model trains for one epoch before aggregation. The batch size for local model training is set to 24. In this study, we utilize the Area Under the ROC Curve (AUC) to evaluate the performance of the classification models, and Dice similarity coefficient (DSC) to evaluate the segmentation performance.

The classification task results for the dataset are presented in Table 3. The Centralized training, which combines the training data of four centers, exhibits the highest AUC, with a mean value of 0.866. Among FL methods, SiloBN achieves the highest average AUC (0.849), followed by FedBN (AUC = 0.842). FedAvg and FedProx show competitive performance with AUCs of 0.839 and 0.824, respectively.

The prediction model trained on a single center demonstrates average AUCs ranging from 0.783 to 0.811. Among these, the model trained by Center 1 achieves the highest diagnostic accuracy. Notably, the diagnostic accuracy of all models trained on a single center is lower than the FL method.

Centralized training achieves the highest automatic segmentation accuracy (DSC = 0.841), as detailed in Table 4. The model trained by the data from Center 1 achieves the highest single-center training results (DSC = 0.770), which may be due to its larger data volume. All the four FL methods outperform single-center training. Among them, the FedProx method achieves a segmentation accuracy (DSC = 0.840) second only to centralized training. FedBN and SiloBN show competitive performance with DSCs of 0.837 and 0.831, respectively. It is noteworthy that the FL methods not only achieve superior segmentation accuracy over single-center training on average DSC, but this trend is consistently observed across each center. Figure 6 presents the segmentation results of four typical cases of the dataset with different methods, indicating that the models trained by centralized training and FL are more accurate in segmentation.

Fig. 6
figure 6

Four typical cases from the dataset. Each case includes the T2-weighted image, segmentation annotations (ground truth), and the predicted segmentation results.

It is worth noting that models trained at a single center do not always perform well on test data from their own center, both in Classification and Segmentation tasks. The analysis of the data reveals several reasons for this issue. Firstly, each center’s dataset may not capture the full range of variability in the overall data distribution, leading to models that are overly specialized and fail to generalize well even within the same center. For example, the model trained at Center 1 has an AUC of 0.720 on its own test data but performs better on data from other centers, achieving an AUC of 0.900 on Center 3’s test data. Secondly, small sample sizes and data noise within each center can affect the model’s ability to learn robust features, leading to suboptimal performance. This is evident in the model trained at Center 4, which has an AUC of 0.750 on its own test data. These performance discrepancies highlight the challenges of single-center training and emphasize the advantages of centralized and federated learning approaches in developing more robust and generalizable models.

Usage Notes

The FAIR (Findable, Accessible, Interoperable, and Reusable) Principles24 have gained widespread adoption in the realm of open data management. Existing FL datasets such as those utilized in FeTS challenge (https://fets-ai.github.io/Challenge/) and FL Breast Density Challenge (https://zenodo.org/records/6362204) from the MICCAI challenge do not fulfill the principle of “Accessible” after the competition.

We share a multi-center bladder T2WI dataset with labels for tumor muscle invasion and tumor lesion annotations, in alignment with the broad aim of the biomedical community to share FAIR data. Despite the inherent challenges in image processing, the image heterogeneity is an important feature of the dataset as it guarantees that tools developed using these images can be applied broadly. As shown in Fig. 3, BCa from different centers differed in grey value distribution, tumor size, tumor number and NMIBC/MIBC on T2WI. Sourced from four centers, the dataset proposed in this study facilitates research into FL6, domain generalization15, and domain adaptation16.

We have organized the data in accordance with the structure used in the Medical Segmentation Decathlon25, a popular abdominal organ segmentation competition. To facilitate the sharing and replication of findings, we have segregated the data into training and testing sets. Additionally, we provide the code for FL model training, which can be accessed at https://github.com/MedcAILab/FedBCa. Our data are deposited in Zenodo (https://zenodo.org/), which can be easily used by the AI community and is user-friendly organized to improve access to non-expert data analysts.

The dataset introduced in this study, being the first open-source multi-center bladder T2WI dataset, exhibits substantial research potential. In this study, we mine the usage of this dataset for FL studies. The strength of FL lies in its ability to train a global model, which outperforms the diagnostic accuracy and generalization performance of a model trained solely at a single center, while ensuring data privacy. Our findings serve as validation for the aforementioned advantages. To this end, we build a FL model training framework FedBCa (https://github.com/MedcAILab/FedBCa) based on PyTorch (https://pytorch.org/). Given the provision of preprocessed image data, users are only required to adjust the data paths within the code. The FedBCa framework is a publicly available, user-friendly FL training tool, with detailed user instructions provided, as well as code for four classical FL methods.