Introduction

Prostate cancer (PCa) is the third most commonly diagnosed cancer globally and a leading cause of cancer-related deaths among males1,2,3,4,5,6,7. The diagnostic pathway typically starts with measuring prostate-specific antigen (PSA) levels8,9,10,11,12,13,14,15,16,17,18,19. If PSA levels are elevated, clinicians may proceed with additional assays, such as 4Kscore20,21,22,23,24,25,26, PCA327,28,29,30,31,32,33, or phi34,35,36,37,38,39, followed by imaging studies like MRI to identify potential PCa lesions40,41,42,43,44,45. In cases where abnormalities are detected, biopsies are often performed for histopathological and genomic evaluation. This diagnostic approach is critical for determining PCa prognosis, especially in men under active surveillance, which has significantly increased in the past decade46,47,48,49. However, overdiagnosis remains a significant concern, contributing to adverse physical and psychological health outcomes50. Therefore, improving diagnostic accuracy is essential to enhance patient outcomes.

Machine learning (ML) has gained traction in recent years for automating various stages of PCa diagnosis and prognosis51. ML algorithms, trained on data from tissue samples, medical images, and clinical information, can detect patterns associated with the disease51,52,53,54,55,56,57. Particularly effective in the automated analysis of MRI scans and biopsy slides, ML models are being used to identify suspicious prostate areas with greater precision than traditional manual methods58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78. Technologies such as PathAI, PaigeAI, and Deciplex convert digital histology into data that can differentiate between cancerous and non-cancerous areas79,80,81,82,83.

Despite these advances, current technologies face limitations, primarily due to the reliance on clinical trial datasets for training84,85,86,87. Clinical trial data often represent specific patient groups and disease stages, limiting the generalizability of models to the broader population81,88,89,90,91. Additionally, inconsistencies in staining, slide preparation, and imaging resolution introduce variability that can negatively impact ML model performance92,93. Moreover, ethical concerns arise from the use of sensitive patient data, which must be handled in accordance with strict privacy guidelines.

To address these challenges, synthetic data generation has emerged as a complementary strategy to expand training datasets while preserving patient privacy94. Our previous study have demonstrated that GANs and other generative models can create histopathological images that capture critical morphological features for training AI models94. While prior research has successfully used synthetic data augmentation in medical imaging and general histopathology, its application in Gleason grading for prostate cancer remains underexplored95. Furthermore, the integration of synthetic data into AI-driven pathology models has largely focused on image augmentation rather than standalone training and validation strategies95,96, highlighting the need for further investigation.

To address these challenges, we propose reducing reliance on clinical trial data by generating synthetic data. This study presents a novel approach using customized generative adversarial networks (GANs) to create synthetic histopathological images from a small set of original radical prostatectomy and needle biopsy images. The synthetic images were analyzed using spatial heterogeneous recurrence quantification analysis (SHRQA) modules to assess granularity compared to real images. We then trained a convolutional neural network (CNN, EfficientNet) on these synthetic image patches and evaluated its grading performance against a CNN trained on real image patches. For validation, we utilized histology images from various sources, including the TCGA, PANDA Challenge83 and MAST trial.

The CNN model trained on synthetic data significantly improved grading accuracy compared to models trained solely on original data. These findings demonstrate the potential of this method to overcome data scarcity in AI model training, offering a promising solution for PCa diagnostics in clinical settings.

Results

Network architecture and model selection

We evaluated four CNN models using 20 randomly selected histology images from the TCGA Prostate Adenocarcinoma dataset. AlexNet was the first CNN to be defined, and it consists of a series of convolutional pooling layers and an interconnected output layer97, achieved 55% accuracy, matching pathologists’ assessments and TCGA scoring. ResNet, which incorporates residual networks based on skip connections within the network, significantly improving AlexNet98, also achieved 55% accuracy on the same images. Xception model, which is based on the Inception model, improves traditional network architectures such as AlexNet and separates space and depth into different dimensions in its convolutional layers99, improved accuracy to 60%. EfficientNet, which utilizes a novel scaling method that employs a compound coefficient to scale up a CNN model in a more structured manner100,101, outperformed all models with 65% accuracy. Based on these results, EfficientNet was selected for this study (Supplementary Fig. 1, Supplementary Table 1).

Image preprocessing

To address variability in staining protocols, tissue quality, and section thickness102, pre-processing normalization was applied to the datasets. For the TCGA dataset (500 images), 21 outliers with mean RGB intensity beyond two standard deviations were excluded. Similarly, one outlier was removed from 32 radical prostatectomy section images from the University of Miami. For 3949 needle biopsy slides from the PANDA challenge83, area evaluation and color normalization excluded 257 images. These steps minimized variability and ensured a more consistent dataset for AI model predictions (Table 1).

Table 1 Similarity index values returned from the 10 fold cross validation run for RP images (left) and needle biopsy images (right)

Quality-controlled annotation and patch generation

RP sections from TCGA and UM were divided into pseudo-cohorts based on Gleason patterns and analyzed with HistoQC103 to exclude low-quality samples. Of 143 randomly reviewed sections, only 33 with agreement between two pathologists and TCGA scoring were selected for gold standard training data. Using PyHIST, 96 × 96 and 256 × 256 pixel patches were generated, each containing at least 75% tissue and no more than 25% whitespace. The minimum patch size was set at 96 pixels, matching the smallest overlapping annotation. From 33 images, 219 patches were created and augmented to 2082 patches (Fig. 1A).

Fig. 1: Overview of synthetic image generation workflow from prostate cancer histology.
figure 1

A Illustration showing pipeline used in generating synthetic images from prostate cancer digital histology. Images were preprocessed by PyHist and HistoQC. Those that passed QC were then given to pathologist for scoring, and then cut into small patches for modeling. B Original and Synthetic images were generated for each primary Gleason pattern 3, 4, and 5, respectively.

GAN model selection

GANs are a class of deep learning models used to generate synthetic images by learning patterns from real data through an adversarial training process. A GAN consists of two neural networks: a generator, which produces synthetic images, and a discriminator, which differentiates between real and generated images. The networks are trained iteratively, with the generator improving its ability to create realistic images while the discriminator refines its capacity to distinguish them. Various GAN architectures have been developed to optimize image synthesis for different applications. In this study, three GAN variants—conditional GAN (cGAN), StyleGAN, and deep convolutional GAN (dcGAN)104,105,106 were evaluated. cGAN incorporates labeled input data to guide the generation process, allowing for controlled image synthesis based on predefined categories. StyleGAN introduces a style-based generator that enhances image quality and provides greater control over fine-grained image features. dcGAN, a simpler and computationally efficient architecture, utilizes convolutional layers to improve image synthesis while maintaining relatively low training complexity.

To select the optimal GAN architecture, 2082 RP image patches were divided by Gleason patterns (GS3, GS4, GS5) and tested using each GAN model. Each GAN generated 1,000 synthetic images, which were classified using a CNN. StyleGAN and dcGAN showed similar accuracies (0.65 and 0.64, respectively), but dcGAN was significantly faster (901 min vs. 2372 min for 1000 images). Based on these results, we selected dcGAN to generate synthetic images for training. The Adam optimization algorithm determined the best iteration value at 14,000 iterations. We generated synthetic images in 128 × 128 and 256 × 256 pixel sizes and performed a manual quality control (QC) assessment on a randomized set of images by board-certified pathologists, who approved 80% of the images.

For needle biopsy analysis, 300 annotated biopsies produced 1712 cancer-specific (752 for GS3, 592 for GS4, and 368 for GS5, respectively) and 539 benign tissue patches (256 × 256 pixels). Various patch sizes (from 512 × 512 to 32 × 32) were tested, with CNN accuracy dropping significantly below 64 × 64 pixels. We selected 64 × 64 patches, upscaled to 256 × 256, for optimal performance. GAN training on a single NVIDIA T1000 GPU generated 1000 synthetic images in 2.5 h, with loss functions balancing at approximately 50,000 epochs (Supplementary Fig. 2). A repository of 2000 patches was created for tumor and normal samples (Fig. 2). Pathologists manually assessed these images, with an 80% approval rate for adequacy in patch size.

Fig. 2: GAN training and synthetic biopsy image generation pipeline.
figure 2

A Workflow for needle biopsy images that were used in developing the training database to be used in the GAN. Images were normalized, then fed into the GAN, and then assessed for quality. B Example original and synthetic histology images generated for prostate cancer needle biopsies.

Benchmarking

To determine the optimal number of synthetic images for EfficientNet training, we fine-tuned the model with synthetic data generated by dcGAN and assessed overfitting using the similarity index and Frechet Inception Distance (FID)107. We generated synthetic images based on Gleason patterns in batches of 10 K, 50 K, and 100 K. FID scores for RP images were 25.1 (10 K), 18.8 (50 K), and 36.2 (100 K). The 50 K batch produced the best results, with accuracy increasing from 33% to 70% for GS3, 35% to 65% for GS4, and 41% to 71% for GS5, stabilizing beyond 50 K images. Thus, we integrated a maximum of 50 K synthetic images into the model.

To further avoid overfitting, we applied 10-fold cross-validation, randomly organizing all real and synthetic image tiles. Accuracy remained consistent, ranging from 0.68 to 0.71 for GS3, 0.63 to 0.70 for GS4, and 0.66 to 0.73 for GS5, with similarity index values between 0.8 to 1.0 (Fig. 2) (Table 1). This process confirmed the quality and variability of synthetic images.

For needle biopsies, FID scores were 21.2 (10 K), 20.2 (50 K), and 33.7 (100 K). Cross-validation of 1000 samples yielded stable accuracy across Gleason scores (GS3: 0.67–0.71, GS4: 0.63–0.70, GS5: 0.65–0.73) with similarity index values ranging from 0.92 to 1.0 (Table 1), confirming the quality of synthetic needle biopsy images (Supplementary Fig. 2).

Quality evaluation of synthetic images

Next, we sought to address the extent of technical variations that exist between the synthetically generated images and the original images. This examination is crucial because each Gleason pattern possesses specific morphological characteristics108,109 that synthetically generated images should ideally replicate. For this, we utilized SHRQA, a robust technique capable of measuring complex microstructures based on spatial patterns110,111,112. The SHRQA process, as shown and detailed in Supplementary Fig. 3, involves six key steps111,112,113,114,115,116,117,118. It begins with the application of the 2D-Discrete Wavelet Transform (2D-DWT) using the Haar wavelet to reveal patterns that are not apparent in the original image111,112,113,114,115,116. Subsequently, each image is converted into an attribute vector using the Space-Filling Curve (SFC), which plays a crucial role in maintaining the spatial proximity between pixels within the vector. This step is essential for representing the image’s geometric recurrence in vector form. The attribute vector is then projected into state space, forming a trajectory that emphasizes the geometric structure of the image. To further analyze spatial transition patterns, Quadtree segmentation is employed to divide the state space into distinct subregions117,118. Following this, an Iterated Function System projection is applied, transforming each attribute vector into a fractal plot that captures recurrence within the fractal topology. Finally, the fractal structures are quantified to provide insights into the geometric properties of the image. In one of our earlier studies, we used the SHRQA method to quantify synthetically generated image patches from eight different genitourinary organs, including the testis, kidney, prostate, bladder, vagina, cervix, ovary, and uterus94. Following a similar approach, the SHRQA method was applied to examine the spatial recurrence properties of real and synthetic image patches across four Gleason patterns (GS0/3/4/5) in the RP section and needle biopsies.

For RP, our sample set included an equal number of patches from real and synthetic sources, with a balanced representation of each Gleason pattern. We analyzed 4000 image patches, each 256 × 256 pixels, evenly split between real and synthetic. A one-layer 2D-DWT with Haar wavelet decomposed each image into four sub-images, revealing fine details. SHRQA quantitatively outlined each patch’s microstructures. From an initial extraction of 1997 spatial recurrence features per patch, LASSO119 selected 1819 features as significant to the Gleason pattern. Hotelling’s T-squared test, compared the spatial recurrence attributes. The resulting p-values of 0.8991 signified no significant differences in spatial recurrence properties between real and synthetic patches, as confirmed by the T-squared tests’ p-values for each Gleason pattern (refer to Table 2, left). We also employed PCA on the spatial recurrence properties120,121,122,123,124,125,126, visualized using radar charts, revealing that the six principal components capture 82% of the variability. This allowed us to map the distributions of spatial properties for real and synthetic images across Gleason patterns, as depicted in Fig. 3. Notably, while distributions for the same Gleason pattern aligned closely between real and synthetic images, significant differences were evident across different patterns.

Fig. 3: Comparison of spatial recurrence features in real and synthetic radical prostatectomy images.
figure 3

A The distributions of spatial recurrence properties (in the first 6 Principal Components (PCs), which contain 82% of data variability) underlying different Gleason scores for both real and synthetic patches on Radical Prostatectomy. Note that the purple lines indicate the mean values of each feature, and the gray area shows the 95% confidence interval. Our results indicate that while the distributions of spatial properties are closely aligned between real and synthetic images under the same Gleason Score, they markedly differ when comparing different Gleason Scores. B The comparison of spatial recurrence properties between real and synthetic on the first six PCs (contain 82% of data variability). The distributions of this four PCs are similar between real and synthetic.

Table 2 Hotelling’s T-squared Two-sample Test: comparisons of spatial recurrence properties between real and synthetic images under different Gleason patterns for radical prostatectomy (RP: left) and needle biopsies (NB: right)

A similar analysis was conducted for the needle Biopsies section (Fig. 4), examining 1600 patches with an equal split between real and synthetic images distributed evenly across GS0, GS3, 4, and 5. SHRQA extracted 2585 initial features, with LASSO identifying 1578 as significant. Hotelling’s T-squared tests in the NB section corroborated the RP section’s results (refer to Table 2, right), demonstrating that synthetic images reliably replicate the geometric nuances of real images for each Gleason pattern. These findings across both RP and NB sections validate the model’s efficiency in capturing the geometric intricacies consistent with real images.

Fig. 4: Validation of spatial recurrence consistency in synthetic needle biopsy images.
figure 4

A The distributions of spatial recurrence properties (in the first 16 Principal Components (PCs), which contain 80% of data variability) underlying different Gleason scores for both real and synthetic patches on Needle Biopsy. Note that the purple lines indicate the mean values of each feature, and the gray area shows the 95% confidence interval. Our results indicate that while the distributions of spatial properties are closely aligned between real and synthetic images under the same Gleason Score, they markedly differ when comparing different Gleason Scores. B The comparison of spatial recurrence properties between real and synthetic on the first eight PCs (contain 70% of data variability). The distributions of this four PCs are similar between real and synthetic.

Furthermore, we subjected a randomized number of synthetic image patches allocated into Gleason 3, 4, and 5 via SHRQA quantification models to cross-validation for characterization by pathologists. Specific features associated with each phenotype, such as uniform glandular arrangement, moderate differentiation, fibrous stroma, mild nuclear atypia, clear lamina, and minimal desmoplasia associated with Gleason 3; Moderate to severe nuclear atypia, poor differentiation, irregular and cribriform patterns, inflammatory infiltrate associated with Gleason 4; and Sheet of cells, scant stromal tissue, undifferentiated tumor cells, sever nuclear atypia, complete loss of glandular structure associated with Gleason 5 were accurately identified by the SHRQA models. These findings across both RP and NB sections validate the model’s efficiency in capturing the geometric intricacies consistent with real images (Fig. 5A–C)

Fig. 5: SHRQA-derived granular features distinguishing Gleason patterns in synthetic images.
figure 5

Shows the distributions of granular features associated with (A) Gleason pattern 3, (B) Gleason pattern 4, and (C) Gleason pattern 5 as identified by the SHRQA quantification and verified by the pathologists.

Validation of CNN performance post-training with enhanced synthetic data

Next, to determine whether synthetic images could substitute original images for training a CNN model, we trained the CNN model with two sources of image patches. The first source was the patches derived from original RP sections from TCGA and inhouse images, which were classified according to the Gleason pattern (normal (n = 175), GS3 (n = 726), GS4 (n = 1029), and GS5 (n = 152)). The second source of patches was derived from a combination of original and synthetic images, with 5000 image patches used for each Gleason pattern. The grading capabilities of this CNN model were then compared by allowing it to assign Gleason scoring to the RP images in the TCGA (n = 475). The results demonstrated that the CNN’s accuracy significantly improved in GS3 from 0.53 to 0.67 (p = 0.0010), in GS4 from 0.55 to 0.63 (p = 0.0274), and in GS5 from 0.57 to 0.75 (p < 0.0001) when trained with the combination of original and synthetic image patches compared to just being trained with original image patches. Moreover, the comparative analysis revealed a notable enhancement in accuracy and the receiver operating characteristic curve (ROC) for the combined (original and synthetic) dataset relative to the original (p = 0.0381). Additionally, the in-house RP images (n = 24) were subjected to grading using CNN models described above. Results demonstrated a significant improvement of the CNN model to accurately assign the grade.

Furthermore, we extended the validation of the CNN model using needle biopsies. Similar to the RP sections, we trained the CNN model with two sources of image patches. The first source was the patches derived from original needle biopsy sections, classified according to the Gleason pattern (normal (n = 539), GS3 (n = 610), GS4 (n = 890), and GS5 (n = 212)). The second source was derived from a combination of original and synthetic images, classified according to the Gleason pattern, with a total of 2000 image patches used for each Gleason pattern. The grading capabilities of this CNN model were then compared by allowing it to assign Gleason grading to the needle biopsy images (n = 3649). The comparative analysis revealed a significant enhancement in accuracy and the ROC for the amalgamated dataset relative to the original (Fig. 6). Specifically, the enhancement was consistent across both benign and malignant samples, with an overall accuracy increase from 91% when using the original training images to an accuracy of 95% when using original and synthetic images combined (p = 0.0402). The original and synthetic combined training database yielded a sensitivity of 0.81, and a specificity of 0.92.

Fig. 6
figure 6

Showing cumulative improvement in accuracy through ROC curves between synthetic+original against the original dataset of RP (left) and needle biopsies (right). p < 0.05 in both cases.

Moreover, to confirm the efficiency of the validated AI model, we subjected it to evaluate the images derived from MAST trial samples. While the MAST trial analysis focused on baseline data, it provides a unique real-world validation setting for AI applications in active surveillance workflows. The outcome of the AI model evaluation was compared to the pathologist grading. Results showed an overall accuracy of 87% for Gleason grading, with sensitivity and specificity of 81% and 92%, respectively. Together, cross-validation of the final model yielded consistent performance, with mean accuracy across folds varying by less than 1.5 percentage points (95% CI, ±1.5%). The improved accuracy across datasets highlights the potential of synthetic data for enhancing model robustness and reproducibility (Fig. 4).

Discussion

The integration of AI-based tools into cancer diagnostics, particularly for prostate cancer (PCa), holds significant promise. Over the past decade, numerous image analysis algorithms have been developed to assist with tumor grading, classification, and metastasis detection127,128,129,130,131,132,133,134,135,136,137. Despite their potential, many AI tools face substantial barriers to widespread clinical adoption138. These challenges stem primarily from biases in training data, the inability to generalize across diverse populations, and technological constraints such as model drift139,140. This study sought to address these limitations by employing GANs to generate synthetic histopathological data, supplementing real-world datasets, and enhancing the accuracy and robustness of convolutional neural network (CNN) models for PCa Gleason grading81,141,142.

AI models trained on datasets with narrow demographic representation often fail to generalize effectively to diverse clinical environments, risking misdiagnosis and suboptimal performance when applied to unseen populations. This study addressed this limitation by integrating data from three diverse datasets: TCGA, PANDA Challenge, and the MAST trial. While these datasets provided robust foundations for model training and validation, they are not fully representative of global patient populations. Future research should prioritize external validation across broader ethnic, geographic, and demographic cohorts. Doing so would ensure that AI models trained for Gleason grading are equitable and clinically reliable in a wide range of settings.

GANs were leveraged to generate synthetic histopathological images, providing a scalable solution to address data scarcity and augment training datasets. Unlike prior approaches where synthetic data was used primarily for general augmentation, this study validated synthetic images against pathologist-annotated real-world images using SHRQA, FID, and Hotelling’s T-squared test. These quantitative validation methods confirmed that GAN-generated images preserved key histological structures, including cribriform patterns and glandular arrangements, with no significant differences in spatial recurrence properties. However, while synthetic data significantly improved model performance, real-world validation remains critical, as synthetic images may not fully capture the biological variability inherent in tissue samples. Future studies should explore the optimal balance between synthetic and real-world data to maximize both accuracy and generalizability in AI-driven diagnostic models.

Given the growing prominence of diffusion models for synthetic data generation, their potential role in histopathology warrants discussion. While diffusion models offer superior image fidelity143,144,145, they require significantly greater computational resources and iterative refinement steps, making them less practical for large-scale AI training146,147. In contrast, dcGAN was chosen for its computational efficiency, rapid convergence, and ability to generate clinically relevant images validated through SHRQA and FID metrics. While diffusion models remain a promising alternative, their added complexity did not provide a clear advantage for this study’s objectives. However, future work should investigate hybrid models that integrate the strengths of both GANs and diffusion-based approaches for optimizing synthetic histopathological image generation.

A crucial consideration in this study was dataset heterogeneity, particularly in combining RP and needle biopsy datasets. While merging these datasets into a single training framework was considered, it posed technical and biological challenges. Technically, needle biopsy images were processed at 64 × 64 and upscaled to 256 × 256, whereas RP sections were directly processed at 256 × 256. Combining these datasets without accounting for resolution differences could have led to inconsistencies in feature extraction and CNN performance. Biologically, RP specimens provide larger, whole-mount tissue architecture, while needle biopsies capture smaller, fragmented tumor regions, leading to distinct histological patterns relevant for Gleason grading148,149. Training separate CNN models for RP and needle biopsies ensured that tissue-specific morphological patterns were preserved, improving the robustness and clinical applicability of the AI-based grading system.

The impact of patch size and resolution variability on CNN performance was also systematically evaluated as it influences feature extraction, model convergence, and classification accuracy150,151,152. Ablation studies demonstrated that patches smaller than 64 × 64 resulted in decreased accuracy, likely due to the loss of critical histological details, while patches larger than 512 × 512 increased computational cost without proportional accuracy gains. These findings align with prior studies in digital pathology (Litjens et al. 2017; Campanella et al. 2019) and emphasize the importance of optimizing input resolution to balance accuracy and efficiency. Additionally, image standardization techniques, including color normalization and histogram-based intensity adjustments, were employed to minimize artifacts and domain shift across datasets. A key challenge in deep learning models for histopathology is interpretability, which is essential for clinical adoption. While this study prioritized quantitative validation techniques (SHRQA, FID, Hotelling’s T-squared test, PCA), which provide structured insights into synthetic data fidelity, we acknowledge the importance of explicit interpretability techniques such as Grad-CAM or SHAP. Future work will explore attention-based visualization methods to further enhance clinician trust in AI-driven Gleason grading.

Performance comparison with existing AI-based Gleason grading models highlights the effectiveness of incorporating synthetic histopathological images into CNN training pipelines. The proposed model achieved 95% accuracy when trained on a combined dataset of real and synthetic images, surpassing the 91% accuracy of models trained on real data alone (p = 0.0402). Independent validation on the MAST trial dataset demonstrated 87% accuracy, reinforcing the robustness and generalizability of the approach in real-world clinical settings. These findings align with existing AI-based models such as PathAI and PaigeAI, which, despite high diagnostic accuracy, face dataset bias and generalizability issues153. Notably, the inclusion of GAN-generated synthetic data improved classification performance across all Gleason patterns. Unlike prior AI approaches, which rely solely on real-world datasets, this study demonstrates that synthetic data can be strategically integrated to enhance AI-based pathology models.

While this study demonstrated significant technical advancements, transitioning from research to real-world clinical implementation requires addressing practical challenges. Integration into clinical workflows demands robust digital infrastructure, clinician training, and seamless interoperability with existing pathology systems. To facilitate adoption, future efforts should focus on developing intuitive user interfaces and decision-support systems that enhance, rather than disrupt, clinician workflows. Moreover, regulatory approvals and adherence to ethical guidelines will be pivotal in fostering trust and acceptance among healthcare providers and patients.

Although this study focused on PCa diagnosis, the methodologies demonstrated here—particularly the use of GAN-generated synthetic data—can be extended to other cancer types. The ability to generate high-fidelity synthetic data holds immense potential for augmenting AI models across a spectrum of oncological applications. Future research should refine these models for PCa-specific contexts, such as active surveillance and advanced disease stages, while also exploring their adaptability to other malignancies.

Together this study addresses key limitations in AI-driven cancer diagnostics by introducing synthetic data to overcome dataset scarcity, integrating diverse datasets to improve generalizability, and demonstrating the feasibility of using CNN models for Gleason grading. While challenges such as model drift, workflow integration, and demographic diversity remain, these findings lay a strong foundation for future longitudinal studies and clinical applications. As AI continues to evolve, its potential to transform cancer diagnostics and personalized medicine becomes increasingly evident, particularly when developed in collaboration with clinicians and validated in real-world settings.

Methods

Scanning and image annotation

Images are deconstructed by first taking individual areas in patchwise fashion using the software package PYHist. Briefly, the image is taken from raw.svs format and scanned for areas that are defined as tissue regions. This is done by color definition compared to whitespace background. Each predefined block is taken at a 512 × 512 pixel area which has been identified as the lowest amount of space that our pathologist graded any given image. These patches were then used as the training database from which other images are annotated. The image annotation for a given test image then is a whole image from which a sliding window approach is taken that moves through each of the given pixel windows and scored against the training database.

Cohorts used

Samples were taken from TCGA. Histology images from 500 individuals were considered. 32 local samples were taken from the University of Miami pathology core. The Institutional Review Board (IRB) protocol was approved by the University of Miami Miller School of Medicine, Miami, FL, to ensure that the research adhered to ethical guidelines and principles. Furthermore, the study included 3949 needle biopsies sourced from Radboud University Medical Center and Karolinska Institute (the PANDA challenge)83,154. The MAST Trial dataset comprised 141 patients stratified into NCCN risk groups: Very Low (44.2%), Low (38.1%), and Intermediate (17.3%). Mean patient age in the MAST dataset was 62.8 years (SD, 8.5 years; range, 43–85 years) (Table 3). Across datasets, pathologists annotated Gleason patterns, ensuring robust and consistent training data (Fig. 1).

Table 3 Summary of Key Variables in MAST Trial Data

Algorithm design training and testing procedure

We selected 3 initial algorithms from which to test out the convolutional network. These were defined as the major stepwise breakthroughs in AI with the AlexNET, ResNet, and Xception models. Initially, training images were taken from a single pathologist defined areas and 10 random images were used as test images. We used granular level annotation (detail in result section) to select the model with the highest accuracy. Post model selection, the hyperparameters were turned. Here a Tree-structured Parzen Estimator was implemented to complete a sequential model optimization. In addition, tree weights were investigated by a population based training approach. From this, the highest performing hyperparameters were used along with the appropriate tree weights that were used to define the network.

Calculation of performance

We defined accuracy in 2 different ways. First, in granular accuracy we started with the areas that were selected and annotated with the exact same Gleason pattern by all the 3 pathologists. Gold standard granular accuracy is when the convolutional network overlaps with the agreement between all pathologists. This is defined by the number of overlapping pixels in the defined area compared with the definition between the annotated areas of the pathologist pattern. Second, patient level accuracy is defined when the artificial intelligence pipeline has correctly identified a patient’s Gleason Score as defined by the TCGA pathology confirmed score.

Development and evaluation of a conditional Generative Adversarial Network

A preliminary conditional Generative Adversarial Network (cGAN) was designed and implemented to assess the performance accuracy of various GAN architectures. The cGAN was developed utilizing Python 3.7.3 and the Tensorflow Keras 2.7.0 package. The generator component of the cGAN comprises three input layers and a single output layer. In parallel, the discriminator component is configured with analogous input, hidden, and output layers. The cGAN’s total parameter count was 19.2 million for each of the evaluated Gleason patterns (Supplementary Fig. 1).

Implementation and adaptation of StyleGAN for tissue image analysis

StyleGAN, a progressive generative adversarial network architecture, serves as a baseline for comparison to the cGAN, featuring a distinct generator configuration. The architecture was adopted from the original StyleGAN publication with minimal alterations to the generator and discriminator networks. A notable modification involved substituting human face images in the StyleGAN with tissue images to create a tissue image GAN. The generator’s total parameter count amounted to 28.5 million, in contrast to 26.2 million in the original StyleGAN publication and 23.1 million in a conventional generator. Particular emphasis was placed on refining the GS4 and GS5 images to ensure adequate representation of tumor heterogeneity.

dcGAN

The dcGAN weights were initialized randomly from a normal distribution with mean = 0 and a standard deviation of 0.02. The generator neural network was constructed using 13 layers consisting of transpose function with batch normalization and ReLU functions. The walkthrough of the generator layers is as follows :

ConvTranspose2d -> BatchNorm2d -> ReLU -> ConvTranspose2d -> BatchNorm2d -> ReLU -> ConvTranspose2d -> BatchNorm2d -> ReLU -> ConvTranspose2d -> BatchNorm2d -> ReLU -> ConvTrnaspose2d -> Tanh.

The discriminator neural network was constructed using 12 layers of Conv2d, LeakyReLU functions with batch normalization. The walkthrough of the layers is as follows :

Conv2d -> LeakyReLU -> Conv2d -> BatchNorm2d -> LeakyReLU -> Con v2d -> BatchNorm2d -> LeakyReLU -> Conv2d -> BatchNorm2d -> LeakyReLU -> Conv2d -> Sigmoid.

Estimation of the number of parameters used for a single run ranged from 1.3 T to 1.4 T calculations per run to generate 1,000 synthetic images that were run on standard GPU chips (GET INFO). Time taken for a single run was dependent on the number of GPU processors available. For multi-threaded GPU the time taken ranged from 1.5 h to 5 h for a run while for a single GPU it ranged from 3 h to 12 h. Time was dependent on the level of resolution desired.

Finally, the BCE loss function was used and the Adam optimizer was implemented for both the generator and discriminator. Iteration level statistics were generated at the end of each run and saved for further analysis in matplot.

EfficientNET

The EfficientNET baseline model was setup with the B3 function within the TensorFlow python backend. Initial testing for the B6 function was found to be too restrictive and B1 was too simple to form complex patterns so B3 was selected as our model function. To form the model based on our images, a GlobalMaxPooling2D layer was added after the initial base as well as a Dropout layer which will help avoid overfitting. The dropout rate was set to 0.2 after initial estimations were too high. The number of classes for prediction layer was set to 4 which represented the normal tissue group as compared to the primary Gleason patterns GS3, GS4, and GS5. The EfficientNET model was setup using pre-trained weights from the “imageNet” to take advantage of transfer learning to reduce analysis time.

Image augmentation was completed using Keras ImageDataGenerator function. Image rotation range was set to 45, width_shift_range and height_shift_range was set to 0.2, and the horizontal flip was set to true for flipping the image. Fill mode was defaulting to “nearest”. Validation data was not augmented, but the images were put through a rescale function to ensure that every test image was uniform before annotation.

Statistical calculations

FID was implemented in custom scripts developed in house. The FID model was pre-trained using Inception V3 weights for transfer learning. In house code was centered around the FID model and inserted into the dcGAN to be run during each iteration. Stats were reported at intervals of 1000 and graphed with in house python scripts.

PCA analysis was performed by first transforming the images into numerical arrays. Images were separated into normal and synthetic batches and then distributed by primary Gleason score. Intensity was calculated (using the R package imgpalr and magick) as the average of the color of the entire image while keeping the matrix framework (i.e., positional arguments were retained). PCA was conducted using the general prcomp function in R and plotted results were displayed in ggplot2.