Annotation-free deep learning for predicting gene mutations from whole slide images of acute myeloid leukemia

Wei, Bo-Han; Tsai, Xavier Cheng-Hong; Sun, Kuo-Jui; Lo, Min-Yen; Hung, Sheng-Yu; Chou, Wen-Chien; Tien, Hwei-Fang; Hou, Hsin-An; Chen, Chien-Yu

doi:10.1038/s41698-025-00804-0

Download PDF

Article
Open access
Published: 03 February 2025

Annotation-free deep learning for predicting gene mutations from whole slide images of acute myeloid leukemia

Bo-Han Wei^1,2,
Xavier Cheng-Hong Tsai^3,4,5,
Kuo-Jui Sun^3,6,7,
Min-Yen Lo⁸,
Sheng-Yu Hung ORCID: orcid.org/0000-0002-5196-8797⁵,
Wen-Chien Chou^3,7,
Hwei-Fang Tien^3,9,
Hsin-An Hou³ &
…
Chien-Yu Chen ORCID: orcid.org/0000-0002-6940-6389^1,2,6,10,11

npj Precision Oncology volume 9, Article number: 35 (2025) Cite this article

3808 Accesses
1 Citations
7 Altmetric
Metrics details

Subjects

Abstract

The rapid development of deep learning has revolutionized medical image processing, including analyzing whole slide images (WSIs). Despite the demonstrated potential for characterizing gene mutations directly from WSIs in certain cancers, challenges remain due to image resolution and reliance on manual annotations for acute myeloid leukemia (AML). We, therefore, propose a deep learning model based on multiple instance learning (MIL) with ensemble techniques to predict gene mutations from AML WSIs. Our model predicts NPM1 mutations and FLT3-ITD without requiring patch-level or cell-level annotations. Using a dataset of 572 WSIs, the largest database with both WSI and genetic mutation information, our model achieved an AUC of 0.90 ± 0.08 for NPM1 and 0.80 ± 0.10 for FLT3-ITD in the testing cohort. Additionally, we found that blasts are pivotal indicators for gene mutation predictions, with their proportions varying between mutated and standard WSIs, highlighting the clinical potential of AML WSI analysis.

Integrating holotomography and deep learning for rapid detection of NPM1 mutations in AML

Article Open access 10 October 2024

Unified classification and risk-stratification in Acute Myeloid Leukemia

Article Open access 08 August 2022

Mutational synergy during leukemia induction remodels chromatin accessibility, histone modifications and three-dimensional DNA topology to alter gene expression

Article 23 September 2021

Introduction

Acute myeloid leukemia (AML), as an aggressive hematologic malignancy, exhibits significant biological and clinical heterogeneity, characterized by uncontrolled proliferation and impaired differentiation of hematopoietic precursors¹. Precise risk stratification is the cornerstone for optimizing treatment efficacy and minimizing treatment-related complications. Currently, a range of cytogenetic changes and gene mutations have been integrated into the risk stratification, shaping the treatment landscape². Nucleophosmin 1 (NPM1) and FMS-like tyrosine kinase-3 internal tandem duplication (FLT3-ITD) are the most prevalent recurrent gene mutations in patients with AML^3,4. The NPM1 mutation occurs in 15-20% of cases, FLT3-ITD in 10-20%, and both mutations together in 10.5%^5,6. Considerable efforts have been devoted to developing targeted therapies against these mutations^7,8, highlighting their crucial significance in clinical practice. Performing molecular testing for these mutations presents significant challenges. While some institutions can provide results for FLT3-ITD and NPM1 mutations within days using straightforward PCR methods, disparities in resources and laboratory infrastructure across facilities often limit accessibility and prolong test turnaround times.

Whole slide images (WSIs) employ digital imaging technology to transform pathological specimens into high-resolution digital images detailing cellular and histological structures⁹. While deep learning has shown promise in tasks such as binary morphological classification and histological grading using WSIs^10,11, challenges persist in analyzing bone marrow aspirates due to their complex cytological nature. Aspirates typically feature small, cluttered regions with various cell types and non-cellular debris. Identifying regions of interest (ROIs) and distinguishing individual cells or objects from the background require multi-step preprocessing, including segmentation and denoising^12,13. Despite advances in deep learning for object detection, manual annotation of segmented cells by experts remains labor-intensive and time-consuming¹⁴. Developing a more efficient and accurate method for analyzing bone marrow aspirate WSIs is essential.

Contemporary histopathology research often follows a two-stage workflow, focusing on patch-level and slide-level training¹⁵. Initially, a Convolutional Neural Network (CNN) is trained on patches extracted from WSIs with patch-level annotations, learning complex patterns. In the second stage, features learned at the patch level are utilized to train a slide-level model, which is necessary for diagnosing WSIs, capitalizing on insights from patch-level analysis. Widely used for cancer identification^16,17,18, classification^19,20, and metastasis detection¹⁶, these approaches require substantial manual annotations. Multiple Instance Learning (MIL) has been employed to use slide-level labels directly^21,22,23, reducing the annotation burden by classifying slides based on the highest-scoring patch.

Apart from patch-level MIL, numerous related studies have utilized individual cells as inputs for MIL training^24,25, underscoring the existence of disease-associated cytomorphologies in cellular images^26,27,28. Therefore, we hypothesized that deep learning could predict gene mutations based on cellular morphology. Here, we present an end-to-end artificial intelligence framework for bone marrow cytology, uniquely trained using WSIs with slide-level annotations. By leveraging annotation-free WSIs for gene mutation predictions, we demonstrate the capability of deep learning in predicting gene mutations. Our results highlight that models trained at the cell level outperform those trained at the patch level. Additionally, we illustrate how techniques, such as upsampling and ensemble learning, can enhance the predictive performance of the model, especially in scenarios with limited training data.

Methods

Dataset

Between 1994 and 2015, 572 patients diagnosed with de novo AML at NTUH were enrolled in this study. Bone marrow smears and peripheral blood smears were scanned as WSIs after being stained with a modified Romanowsky stain (Fig. 2a). We utilized a Hamamatsu NanoZoomer Digital Slide Scanner equipped with a 40x NA 0.75 objective lens, achieving a resolution of approximately 0.23 μm/pixel. Gene mutation status was determined using the TruSight myeloid panel on the HiSeq platform (Illumina, San Diego, CA)⁶, and mutations were annotated as 1 (indicating the presence of pathogenic or likely pathogenic mutations) or 0 (indicating the absence of such mutations). In our cohort, a total of 34 genes were frequently found to be mutated, and in this study, we listed the mutation frequencies of more than 10% and selected the top two highest frequent mutations as the targets (Table 1). This retrospective study was approved by the NTUH Research Ethics Committee, and written informed consent was obtained from all participants in accordance with the Declaration of Helsinki (Approval number: 201802021RINC).

Table 1 Number of WSIs with mutation frequencies exceeding 10% in our cohort

Full size table

Cell image generation

The bone marrow smears of WSIs underwent a three-step filtering process to identify the cells for subsequent model training. First, we used the PyHIST tool to generate patches²⁹. This tool was applied to filter out background regions and non-smear areas in the WSIs (Fig. 1a). Patches (512 × 512 pixels) were extracted at the highest resolution (×40 magnification), using a graph-based segmentation method³⁰ as the parameter “generation method" in PyHIST, with a content threshold of 0.05. The parameters ’tilecross-downsample’ and ’mask-downsample’ adopted the default values, and the parameter ’output-downsample’ was set to 1 in order to obtain patches at the original resolution.

**Fig. 1: Overview of the proposed method.**

Secondly, the patches generated from the PyHIST tool were further classified into ROI and non-ROI patches. Due to the absence of manually annotated ROI and non-ROI data, we utilized an ROI detection model with training weights obtained from a previous study³¹. The model, based on DenseNet121³², was pre-trained and fine-tuned on labeled patches (ROI/non-ROI) in ref. ³¹. In this step (Fig. 1b), patches containing densely packed leukocytes or areas without any leukocytes were removed, significantly reducing the number of input data for the subsequent step: cell detection modeling.

After ROI selection, the ROI patches in each WSI would be used to get the cells in the third step. We applied the cell detection model using the training weights from the same study³¹. This model, based on YOLOv4³³, was trained in ref. ³¹ to predict the bounding boxes of leukocyte objects. In this step, all leukocytes in each ROI patch in bone marrow aspirates were detected (Fig. 1c).

Data upsampling

The WSI cohort exhibited a significant class imbalance, with relatively few mutated samples compared to a large number of standard samples. This imbalance could have influenced training outcomes, as the model might become biased towards predicting “standard” for all WSIs and consequently overlooking cases with mutations. To tackle this imbalance, we introduced an upsampling strategy by dividing the cells of a mutated WSI into multiple sets (bags), each containing a fixed number of cells (Fig. 1d). A previous study demonstrated that 500-cell sets are sufficient to capture the details of bone marrow³⁴. In this study, we increased this number fourfold to 2000 cells to enhance the representativeness of each WSI. These bags were treated as separate samples during MIL training, effectively increasing the mutated class in the training data and resulting in a more balanced dataset.

In addition to addressing data imbalance, we faced the challenge of working with standard WSIs, which were not upsampled and each contained tens of thousands of cells. This large volume of input data could significantly slow down model training. To address this, we randomly selected a fixed number (2000) of cells per standard WSI as a bag, thereby expediting the training process by reducing input quantity.

Multiple instance learning

Fully supervised approaches for histopathology image analysis require detailed manual annotations, which are time-consuming and intrinsically ambiguous, even for well-trained experts. Standard unsupervised approaches usually fail due to their complicated patterns. MIL, a weakly supervised learning approach, addresses this challenge by utilizing labeled groups of instances, known as bags, instead of requiring individual instance labels. It works well for the current study because it takes advantage of supervised and unsupervised approaches. The main idea of MIL is to learn local patterns using global annotations. In MIL, each bag contains multiple instances, but only the bag is labeled with a class label, and the instances within the bag are unlabeled. Previous research transformed binary classification tasks into MIL problems by dividing WSIs into multiple instances. Here, the instances could be patches²¹ or cells^{24,25,26,27,28}. Positive WSIs contain at least one positive instance, while negative WSIs do not contain any positive instances.

In this study, “bags" refers to the bags created after upsampling, while “instances" denotes individual cells in bags. After the cell detection process described above, bags B = {B_i: i = 1, 2, …, n} are generated, where ${B}_{i}=\{{c}_{i,1},{c}_{i,2},\ldots ,{c}_{i,{m}_{i}}\}$ is a bag containing m_i cells (Fig. 1e). For most bags, m_i = 2000. However, a single bag from a mutated WSI might contain fewer than 2000 cell, as the total cell count of a WSI is typically not an exact multiple of 2000. We labeled bags as “mutated” or “standard” according to the labels of their originating WSIs. In a bag, all cells are classified using an embedding model, the training process for which is detailed in the following paragraph. The cells are then ranked based on their probability of being mutated. The MIL task involves learning a cell-level embedding that effectively identifies and distinguishes characteristic cells within mutated bags from standard cells.

During training (Fig. 1e), the MIL training process consists of two alternating phases: the inference phase and the learning phase. At first, the DenseNet121 model was employed and fine-tuned to serve as the embedding model. It was initialized with pre-trained weights provided by the PyTorch library, which were previously trained with the ImageNet dataset. The model is represented as a function f_θ, where the current parameters θ map the input cells c_i,j to probabilities of “mutated." In the cell inference phase of MIL training, a vector list ${P}_{i}=\{{p}_{i,1},{p}_{i,2},\ldots ,{p}_{i,{m}_{i}}\}$ is derived for the bag B_i, representing the probabilities of the “mutated” class for the cells c_i,j (j = 1, 2, …, m_i). The weights of the model function f_θ are frozen during this phase, ensuring that the parameters remain fixed while producing the estimated probabilities.

During the learning phase of MIL training, we introduce a hyperparameter K to select the top K cells with the highest probabilities (p_i,j) from each bag as training data for the embedding model. The true labels (y_i,j) of these selected cells are determined based on the label of their respective bags: the top K cells in each mutated bag are labeled as mutated cells, while the top K cells in standard bags are labeled as standard cells. Then, the network’s output ${\widetilde{y}}_{i,j}={f}_{\theta }({c}_{i,j})={p}_{i,j}$ is compared to the label y_i,j using the cross-entropy loss function l as in Eq. (1):

$$l=\sum _{{\widetilde{y}}_{i,j}}-{w}_{1}[{y}_{i,j}log({\widetilde{y}}_{i,j})]-{w}_{0}[(1-{y}_{i,j})log(1-{\widetilde{y}}_{i,j})]$$

(1)

, where w₁ is the number of mutated bags divided by the number of total bags and w₀ = 1 − w₁. Consequently, based on the loss function Eq. (1), the weights of f_θ are updated to minimize the discrepancy between the predicted output and the actual label.

Ensemble Learning

Ensemble methods are a powerful approach that involves training and combining multiple models to tackle complex problems. The core idea is that a collection of individual “weak learners" can synergistically work to create a “strong learner." Each model contributes its perspective through voting, while the ensemble method combines these inputs to generate a final prediction. The overarching objective of ensembles is to mitigate bias and variance in predictions by harnessing the combined capabilities of multiple models instead of relying solely on one single model.

In this study, we employed an ensemble method that utilizes a loss-based weighting approach, where each model’s contribution within the ensemble is determined by its individual loss function ${l}_{{M}_{K}}$ as in Eq. (1). This weighting strategy assigns greater influence to models with lower loss values, thereby enhancing the impact of more accurate models in the ensemble’s final probability (P_ensemble). Only the top three MIL models, determined by their loss values, were used as weak learners within the ensemble. This approach aimed to optimize the combination of base models $\{{M}_{{K}_{1}},{M}_{{K}_{2}},{M}_{{K}_{3}}\}={\rm{argmin}}\,{M}_{K}\in \{{M}_{1},\ldots ,{M}_{30}\}$, where K₁, K₂, and K₃ correspond to the three MIL models with the lowest loss values among the models in {M₁, …, M₃₀}. Each base model within the ensemble was assigned a weight W_i determined by its performance as in Eq. (2):

$${W}_{i}=\frac{exp(-{l}_{i})}{{\sum }_{i\in \{{M}_{{K}_{1}},{M}_{{K}_{2}},{M}_{{K}_{3}}\}}exp(-{l}_{i})}$$

(2)

The ensemble model combined the probability ($\max {P}_{i}$) of these three models according to their respective weights (${W}_{{M}_{K}}$), as illustrated in Eq. (3):

$${P}_{ensemble}=\sum _{i\in \{{M}_{{K}_{1}},{M}_{{K}_{2}},{M}_{{K}_{3}}\}}{W}_{i}\times \max {P}_{i}$$

(3)

Results

Automatic selection of ROI patches

In digital pathology, glass slides of bone marrow aspirate smears were scanned using a digital slide scanner to generate high-resolution WSIs for hematopathologist analysis (Fig. 2a). To initiate this process, a dataset was sampled from 572 bone marrow aspirate WSIs obtained at the National Taiwan University Hospital. To address the issue of detecting ROI patches, we developed a pipeline to select ROI patches in stages, automatically identifying areas within the bone marrow aspirate WSIs that are suitable for cytological analysis.

A single WSI might contain only a few areas suitable for cytology. These areas are sparsely distributed, with minimal cell overlap and staining artifacts, and exhibit subtle and intricate cytological features required for cell classification. To acquire such areas efficiently, the PyHIST tool²⁹ was first used with the default graph method to remove the blank background regions, retaining only the stained portions. Subsequently, we employed a fine-tuned DenseNet121 architecture, utilizing pre-trained weights from previous research³¹ to classify individual patches as ROI patches or non-ROI patches (Fig. 2b).

We observed that the classification results aligned with expectation, i.e., a real-world scenario where usually only 10-20% of a WSI might be the ROI regions for cytology. The results obtained after applying DenseNet121 reveal that for most WSIs, the ROI patches were reduced to around 10-25% of their original counts after undergoing this selection process, indicating a substantial reduction in patch counts (Supplementary Fig. 1). This screening process significantly reduced the time required for subsequent cell detection processes and exhibited outstanding filtering effectiveness in removing areas with excessive cell overlap, excessive tissue artifacts, or patches devoid of blood cell presence within the designated range. These problematic patches were effectively eliminated, while, as shown in Supplementary Fig. 2, patches with a small number of cells were not erroneously dropped. This approach ensures that patches with few cells are retained, supporting data completeness and avoiding unnecessary loss of valuable information (Fig. 2b).

Leukocyte detection

After the ROI selection process, we employed a YOLOv4 model to automatically detect and classify cells and non-cellular objects within the selected ROI patches. In the absence of manually annotated cells, we used training weights that had been pre-trained and fine-tuned on labeled patches from the same study for selecting ROI patches³¹, applying them directly in this detection model. Using the ROI patches identified by the ROI detection model as input, this YOLOv4 model was designed to automatically detect and classify all cellular and non-cellular objects in bone marrow samples. We further set a confidence threshold of 0.5. Cells with confidence scores below this threshold were not captured, ensuring high-quality cell selection for subsequent MIL model training. In addition to managing confidence levels, we factored in cell size during the process. Under ultraviolet light with a 400 nm wavelength, the theoretical resolution is approximately 325.33 nm. We set the cell size range to 51 × 51 pixels to 80 × 80 pixels. This means that a cell size of 51 pixels represents approximately 16.6 μm, while 80 pixels corresponds to roughly 25.6 μm. This range was chosen based on the typical sizes of white blood cells (WBCs), which range from approximately 10–25 μm, and blasts, typically around 15-20 μm. By setting the selection threshold between 51 and 80 pixels, we aimed to capture the most representative cells that are complete and relevant for feature extraction. The chosen cells were directly resized to 64 × 64 pixels to standardize input size for MIL training, avoiding irrelevant features like neighboring cells or excessive background that could be introduced by alternative padding methods (Fig. 2c).

The number of cells left per WSI varied between 100 to 100,000, with the majority of cell counts being below 20,000 and an average of 11,273 (Fig. 3). This outcome effectively reduced the input quantity for the MIL model while maintaining the best quality of all input cells, accelerating training time without compromising accuracy. Therefore, we used leukocytes, including basophil, blast, eosinophil, lymphocyte, metamyelocyte, monocyte, myelocyte, neutrophil, and promyelocyte as inputs for MIL training, denoted as ’all cells’. Among these cells, MIL randomly selected 2000 cells as representative cells of a WSI, i.e., a bag. During the selection period, we applied the upsampling technique to the mutated WSIs to balance the data. By separating each mutated WSI into multiple bags, we can generate numerous mutated bags that constitute 2000 cells.

**Fig. 3: ROC curves for MIL models trained using different strategies.**

Cell-Level Multiple Instance Learning

The total dataset of 572 WSIs was split into training, validation, and test sets at slide level, with a ratio of around 7:1:2 (400:56:116 WSIs). We used the DenseNet121 as the base embedding model, initialized with pre-trained weights provided by the PyTorch library based on training with the ImageNet dataset. The MIL model underwent training for 100 epochs with a learning rate set to 0.0001. The loss minimization is attained through stochastic gradient descent (SGD) utilizing the Adam optimizer. The batch size is determined by the number of cells within each bag, restricted to a maximum of 2000 instances (cells) per batch. Since the training data still exhibited minor class imbalance after upsampling, we set weights (w0, w1) to (0.51, 0.29) for NPM1 mutations and (0.37, 0.63) for FLT3-ITD (see Section 2.4.). No upsampling was applied to the validation or testing sets to emulate a real-world scenario. Each epoch was evaluated using an imbalance validation set of 56 WSIs, specifically, more standard than mutated WSIs. Early stopping was implemented to prevent overfitting. The separate testing set included 96 standard and 20 NPM1 mutations WSIs, along with 95 standard and 21 FLT3-ITD WSIs.

In addition to established MIL training, we leveraged ensemble learning by employing different values of K (from K = 1 to K = 30) in MIL to diversify the models’ perspectives on the data. Each K represents the number of positive instances considered within a bag during the MIL training process, leading to the creation of models with varying focuses on distinct data subsets. Based on Eq. (2) and Eq. (3), we calculated the respective weights W_i of the top 3 MIL models with the lowest loss values among the models and then estimated the final prediction. The models’ predictive performance on the training, validation, and test datasets is shown in Supplementary Fig. 4.

The integration of upsampling at the cell level improved model performance on the testing set for both gene mutations, achieving AUC values of 0.90 ± 0.08 for NPM1 mutations and 0.80 ± 0.10 for FLT3-ITD (Fig. 3). For NPM1 mutations, upsampling cell-level MIL achieved an AUC of 0.82 ± 0.11, which exceeded the non-upsampling approaches at both the patch and cell levels. The addition of ensemble learning further enhanced the AUC to 0.90 ± 0.08, comparable to results from models trained on cell-level annotations in prior studies²⁸.

In the case of FLT3-ITD, upsampling cell-level MIL achieved an AUC of 0.77 ± 0.10, also higher than the non-upsampling patch-level and cell-level MIL models. With ensemble learning, the AUC further increased to 0.80 ± 0.10. This result demonstrates that upsampling at the cell level, combined with ensemble learning, addresses class imbalance and improves detection accuracy. The combination of these methods performs favorably compared to recent techniques based on cell-level annotation²⁸ and exceeds previously reported results¹⁴. This approach provides a structured framework for identifying mutations in bone marrow samples, especially in cases with limited annotation resources.

Moreover, our ensemble cell-level MIL, when using sensitivity as the standard for the reliability analysis of clinical diagnosis, exhibited better performance than the traditional MIL methods. As the sensitivity was set to 0.75, our approach reduced the false positive rate from 0.36 to 0.09 for NPM1 mutations. For FLT3-ITD, there was also a notable reduction from 0.60 to 0.26 (Fig. 3). This outcome once again confirms that ensemble learning and upsampling can enhance the effectiveness of the MIL model.

To clarify the importance of cell features for prediction in the learning process of MIL, we compared the proportional representation of nine types of cells. Before applying the MIL model, all cell types in each WSI would be divided by the total number of cells of all types in that WSI to obtain the proportion of each type of cell. After prediction, we listed the top 100 images of cells with the highest probability of being predicted by the model as the highest correlated with mutations in each bag. Then, we calculated the proportion of all representative cells predicted by MIL in the original WSI based on their cell types (Fig. 4). This can reveal which cells are more meaningful for MIL predictions of gene mutations, even if we were only using slide-level annotations.

**Fig. 4: Comparison of cell type proportions before and after MIL.**

It can be observed that the trend for blasts remains consistent regardless of whether it is the NPM1 or FLT3-ITD model. In bags with either genetic mutations, after applying MIL, there’s a notable increase in the proportion of blasts among the top 100 representative cells. On the other hand, in bags without genetic mutations, after applying MIL, there’s a decrease in the proportion of blasts among the top 100 representative cells. This trend indicates that the presence of blasts is an essential indicator for determining whether there are mutations in deep learning. Additionally, in bags with genetic mutations, the proportions of other cell types decrease, indicating that the majority of features used in machine learning to determine the presence of mutations are in the blast category. Therefore, in non-mutated bags, because these features cannot be found in blast cells, the proportions of other cell types increase while the proportion of blast significantly decreases.

Discussion

Analysis of AML cells often involves morphological examination^{24,25,26,27,28} and complete blood count-based approaches³⁵, which have been applied in subtype classification and disease-associated cytomorphologies. Compared to previous studies, this work explores the potential of cell-level analysis using annotation-free data, demonstrating promising performance in predicting NPM1 mutations and FLT3-ITD. Furthermore, by applying MIL with upsampling and ensemble techniques, our research achieved AUC values of 0.90 ± 0.08 for NPM1 mutations and 0.80 ± 0.10 for FLT3-ITD. This underscores the strength of our approach in addressing the challenges of gene mutation prediction without the need for manual annotation of cell images.

We further explored the impact of gene interactions in our models. We observed that the presence of FLT3-ITD does not significantly influence the predictions of the NPM1 model (Supplementary Fig. 5.a). Conversely, for the FLT3-ITD model, the absence of concurrent NPM1 mutations leads to superior predictive performance compared to cases with concurrent NPM1 mutations (Supplementary Fig. 5.b). This suggests that predictions of FLT3-ITD in the presence of NPM1 mutations require greater scrutiny and caution, as the morphological impact of NPM1 mutations may significantly obscure or overshadow the features associated with FLT3-ITD. This overlap in morphological characteristics could potentially reduce the model’s ability to discern FLT3-related features accurately. We also investigated the concurrence of DNMT3A. As shown in Supplementary Fig. 6, the presence or absence of DNMT3A mutations does not appear to have a substantial impact on the performance of either model.

In addition to examining the impact of concurrent gene mutations on model performance, we also explored other factors. Age was found to have no significant effect on the models’ performance. For hyperleukocytosis, defined as WBCs ≥ 50,000/μL, the NPM1 model showed a decline in performance for subsets with hyperleukocytosis (Supplementary Fig. 7.a), while the FLT3-ITD model was relatively unaffected (Supplementary Fig. 7.b). Furthermore, we evaluated the impact of the NPM1 VAF cutoff using a threshold of 0.44³⁶. The NPM1 model performed exceptionally well in the high-VAF subgroup, achieving a PPV of 100% (Supplementary Fig. 8.a), though the small sample size requires further validation. Interestingly, for the FLT3-ITD model, performance declined in the high-VAF subgroup (Supplementary Fig. 8.b), consistent with the observation that NPM1 mutations negatively affect FLT3-ITD predictions. These findings demonstrate the potential for our models to assist clinicians in re-evaluating mutation status, opening opportunities for targeted therapies. Importantly, this methodology enables the reduction of false positive rate for NPM1 mutations (from 0.36 to 0.09) and FLT3-ITD (from 0.60 to 0.26), highlighting its utility in enhancing diagnostic confidence in mutation identification.

Our model demonstrates significant promise for clinical application. It is well-established that FLT3-ITD is an unstable mutation that can undergo clonal evolution during treatment³⁷. While the false positive rate of our model for predicting FLT3-ITD is not fully optimized, the results could encourage clinicians to perform confirmatory testing. This is particularly valuable given the availability of effective targeted therapies for patients with FLT3-ITD, even when the mutation is acquired during relapsed or refractory stages. Moreover, this study demonstrates the importance of different cell types in predicting gene mutations using MIL. Specifically, blasts were identified as critical indicators for both NPM1 mutations and FLT3-ITD prediction. Our model significantly enhanced the representation of blasts in mutated whole slide images (WSIs), underscoring their importance in mutation detection. Conversely, in standard WSIs, the increased proportion of other cell types alongside a reduced blast representation highlights a potential morphological distinction between mutated and standard cases.

Despite the advantages of our model, we acknowledge that a 26% false positive rate for FLT3-ITD mutations is not ideal for clinical application, particularly if this method were to be employed as the sole diagnostic tool during initial diagnosis. We currently recommend using a companion GENESCAN to confirm FLT3-ITD mutation status. Therefore, potential improvements of our model include embedding more detailed features that capture the relationships between additional myeloid-related gene mutations, which depend heavily on next-generation sequencing (NGS). Since NGS typically requires around three weeks to deliver results, incorporating these predictions into our model could dramatically shorten the time to actionable insights in clinical practice, enhancing its utility in time-sensitive scenarios. Additionally, enhancing the resolution of smear images would enable more precise visualization of morphological regions affected by mutations, providing clearer links between cellular morphology and gene mutations.

Our findings underscore the immense potential of ensemble learning, upsampling techniques, and MIL in predicting gene mutations in AML patients. Importantly, we accomplished this using training data labeled exclusively at the slide level, eliminating the labor-intensive manual annotation required for cell images. This streamlined approach emphasizes the value of integrating advanced machine-learning techniques to address complex real-world challenges in medical image analysis, offering a robust foundation for further exploration of the correlation between genetic status and cellular morphology.

In conclusion, our research underscores the immense potential of ensemble learning, upsampling techniques, and MIL in predicting gene mutations in AML patients. Importantly, we accomplished this using training data labeled exclusively at the slide level, eliminating the labor-intensive manual annotation required for cell images. This approach streamlined the end-to-end training and prediction process, emphasizing the value of integrating advanced machine-learning approaches to address complex real-world challenges in medical image analysis.

Data availability

The gene mutation data used in this study was published in a previous work⁶. The WSIs contain sensitive patient information and are, therefore, not publicly available. However, the corresponding author can provide access to the WSIs upon request, subject to data-sharing agreements and privacy considerations.

Code availability

The code used for computation analysis in this study can be found at https://github.com/c4lab/AML-WSI.git. For any additional questions, please contact the corresponding author.

References

Döhner, H., Weisdorf, D. J. & Bloomfield, C. D. Acute myeloid leukemia. N. Engl. J. Med. 373, 1136–1152 (2015).
Article PubMed Google Scholar
Döhner, H. et al. Diagnosis and management of AML in adults: 2022 recommendations from an international expert panel on behalf of the ELN. Blood 140, 1345–1377 (2022).
Article PubMed Google Scholar
Stone, R. M. et al. Midostaurin plus chemotherapy for acute myeloid leukemia with a flt3 mutation. N. Engl. J. Med. 377, 454–464 (2017).
Article CAS PubMed PubMed Central Google Scholar
Erba, H. P. et al. Quizartinib plus chemotherapy in newly diagnosed patients with flt3-internal-tandem-duplication-positive acute myeloid leukaemia (quantum-first): a randomised, double-blind, placebo-controlled, phase 3 trial. Lancet 401, 1571–1583 (2023).
Article CAS PubMed Google Scholar
Papaemmanuil, E. et al. Genomic classification and prognosis in acute myeloid leukemia. N. Engl. J. Med. 374, 2209–2221 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tsai, X. C.-H. et al. Poor prognostic implications of myelodysplasia-related mutations in both older and younger patients with de novo aml. Blood Cancer J. 13, 4 (2023).
Article PubMed PubMed Central Google Scholar
Perl, A. E. et al. Gilteritinib or chemotherapy for relapsed or refractory flt3-mutated aml. N. Engl. J. Med. 381, 1728–1740 (2019).
Article CAS PubMed Google Scholar
Issa, G. C. et al. The menin inhibitor revumenib in kmt2a-rearranged or npm1-mutant leukaemia. Nature 615, 920–924 (2023).
Article CAS PubMed PubMed Central Google Scholar
Bera, K., Schalper, K. A., Rimm, D. L., Velcheti, V. & Madabhushi, A. Artificial intelligence in digital pathology-new tools for diagnosis and precision oncology. Nat. Rev. Clin. Oncol. 16, 703–715 (2019).
Article PubMed PubMed Central Google Scholar
Sirinukunwattana, K. et al. Gland segmentation in colon histology images: the glas challenge contest. Med. Image Anal. 35, 489–502 (2017).
Article PubMed Google Scholar
McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).
Article CAS PubMed Google Scholar
Chandradevan, R. et al. Machine-based detection and classification for bone marrow aspirate differential counts: initial development focusing on nonneoplastic cells. Lab. Investig. 100, 98–109 (2020).
Article PubMed Google Scholar
Reta, C. et al. Segmentation and classification of bone marrow cells images using contextual information for medical diagnosis of acute leukemias. PLoS ONE 10, e0130805 (2015).
Article PubMed PubMed Central Google Scholar
Kockwelp, J. et al. Deep learning predicts therapy-relevant genetics in acute myeloid leukemia from pappenheim-stained bone marrow smears. Blood Adv. 8, 70–79 (2024).
Article CAS PubMed Google Scholar
Dimitriou, N., Arandjelović, O. & Caie, P. D. Deep learning for whole slide image analysis: an overview. Front. Med. 6, 264 (2019).
Article Google Scholar
Bejnordi, B. E. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318, 2199–2210 (2017).
Article Google Scholar
Burlutskiy, N., Gu, F., Wilen, L. K., Backman, M. & Micke, P. A deep learning framework for automatic diagnosis in lung cancer. Preprint at https://arxiv.org/abs/1807.10466 (2018).
Chuang, W.-Y. et al. Successful identification of nasopharyngeal carcinoma in nasopharyngeal biopsies using deep learning. Cancers 12, 507 (2020).
Article PubMed PubMed Central Google Scholar
Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Article CAS PubMed PubMed Central Google Scholar
Santos, A. G., da Rocha, G. O. & de Andrade, J. B. Occurrence of the potent mutagens 2-nitrobenzanthrone and 3-nitrobenzanthrone in fine airborne particles. Sci. Rep. 9, 1 (2019).
Article PubMed PubMed Central Google Scholar
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wang, X. et al. Weakly supervised deep learning for whole slide lung cancer image analysis. IEEE Trans. Cybern. 50, 3950–3962 (2019).
Article PubMed Google Scholar
Pan, S. & Secrier, M. Histomil: a Python package for training multiple instance learning models on histopathology slides. Iscience 26, 108073 (2023).
Yudistira, N., Kavitha, M. S., Rajan, J. & Kurita, T. Attention-effective multiple instance learning on weakly stem cell colony segmentation. Intell. Syst. Appl. 17, 200187 (2023).
Google Scholar
Sadafi, A. et al. Attention based multiple instance learning for classification of blood cell disorders. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part V 23, (eds Martel, A. L., et al.) 246–256 (Springer, 2020).
Sidhom, J.-W. et al. Deep learning for diagnosis of acute promyelocytic leukemia via recognition of genomically imprinted morphologic features. NPJ Precis. Oncol. 5, 38 (2021).
Article PubMed PubMed Central Google Scholar
de Almeida, J. G. et al. Computational analysis of peripheral blood smears detects disease-associated cytomorphologies. Nat. Commun. 14, 4378 (2023).
Article PubMed PubMed Central Google Scholar
Eckardt, J.-N. et al. Deep learning detects acute myeloid leukemia and predicts npm1 mutation status from bone marrow smears. Leukemia 36, 111–118 (2022).
Article CAS PubMed Google Scholar
Muñoz-Aguirre, M., Ntasis, V. F., Rojas, S. & Guigó, R. Pyhist: a histological image segmentation tool. PLoS Comput. Biol. 16, e1008349 (2020).
Article PubMed PubMed Central Google Scholar
Felzenszwalb, P. F. & Huttenlocher, D. P. Efficient graph-based image segmentation. Int. J. Comput. Vis. 59, 167–181 (2004).
Article Google Scholar
Tayebi, R. M. et al. Automated bone marrow cytology using deep learning to generate a histogram of cell types. Commun. Med. 2, 45 (2022).
Article PubMed PubMed Central Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4700–4708 (IEEE, 2017).
Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. Preprint at https://arxiv.org/abs/2004.10934 (2020).
Abdulrahman, A. A. et al. Is a 500-cell count necessary for bone marrow differentials? a proposed analytical method for validating a lower cutoff. Am. J. Clin. Pathol. 150, 84–91 (2018).
Article PubMed Google Scholar
Alcazer, V. et al. Evaluation of a machine-learning model based on laboratory parameters for the prediction of acute leukaemia subtypes: a multicentre model development and validation study in France. Lancet Digital Health 6, e323–e333 (2024).
Article CAS PubMed Google Scholar
Patel, S. S. et al. High npm1-mutant allele burden at diagnosis predicts unfavorable outcomes in de novo AML. Blood 131, 2816–2825 (2018).
Article CAS PubMed PubMed Central Google Scholar
Schmalbrock, L. K. et al. Clonal evolution of acute myeloid leukemia with flt3-itd mutation under treatment with midostaurin. Blood 137, 3093–3104 (2021).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This project was supported by the research grants from National Science and Technology Council in Taiwan (NSTC 112-2221-E-002-184-MY3 and NSTC 112-2314-B-002-116-MY3). The authors express our sincere gratitude to the laboratory department staff at NTUH for their exceptional management of the WSIs and to the National Center for High-performance Computing (NCHC) for providing computational and storage resources.

Author information

Authors and Affiliations

Center for Advanced Computing and Imaging in Biomedicine, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei, 10617, Taiwan
Bo-Han Wei & Chien-Yu Chen
Department of Biomechatronics Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei, 10617, Taiwan
Bo-Han Wei & Chien-Yu Chen
Division of Hematology, Department of Internal Medicine, National Taiwan University Hospital, No.1, Changde St., Zhongzheng Dist., Taipei, 100229, Taiwan
Xavier Cheng-Hong Tsai, Kuo-Jui Sun, Wen-Chien Chou, Hwei-Fang Tien & Hsin-An Hou
Department of Medical Education and Research, National Taiwan University Hospital Yunlin Branch, No. 579, Sec. 2, Yunlin Rd., Douliu City, Yunlin County, 640203, Taiwan
Xavier Cheng-Hong Tsai
Department of Hematological Oncology, National Taiwan University Cancer Center, No.57, Ln. 155, Sec. 3, Keelung Rd., Da’an Dist., Taipei City, 106, Taiwan
Xavier Cheng-Hong Tsai & Sheng-Yu Hung
Genome and Systems Biology Degree Program, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei, 10617, Taiwan
Kuo-Jui Sun & Chien-Yu Chen
Department of Laboratory Medicine, National Taiwan University Hospital, No.1, Changde St., Zhongzheng Dist., Taipei, 100229, Taiwan
Kuo-Jui Sun & Wen-Chien Chou
Division of Hematology, Department of Internal Medicine, National Taiwan University Hospital Yunlin Branch, No. 579, Sec. 2, Yunlin Rd., Douliu City, Yunlin County, 640203, Taiwan
Min-Yen Lo
Department of Internal Medicine, Far-Eastern Memorial Hospital, New Taipei City, Taiwan, No. 21, Section 2, Nanya S. Road, Banqiao District, New Taipei City, 220, Taiwan
Hwei-Fang Tien
Smart Medicine and Health Informatics Program, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei, 10617, Taiwan
Chien-Yu Chen
Center for Computational and Systems Biology, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei, 10617, Taiwan
Chien-Yu Chen

Authors

Bo-Han Wei
View author publications
Search author on:PubMed Google Scholar
Xavier Cheng-Hong Tsai
View author publications
Search author on:PubMed Google Scholar
Kuo-Jui Sun
View author publications
Search author on:PubMed Google Scholar
Min-Yen Lo
View author publications
Search author on:PubMed Google Scholar
Sheng-Yu Hung
View author publications
Search author on:PubMed Google Scholar
Wen-Chien Chou
View author publications
Search author on:PubMed Google Scholar
Hwei-Fang Tien
View author publications
Search author on:PubMed Google Scholar
Hsin-An Hou
View author publications
Search author on:PubMed Google Scholar
Chien-Yu Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

Bo-Han Wei, Xavier Cheng-Hong Tsai, and Chien-Yu Chen developed the overall research idea; Bo-Han Wei and Xavier Cheng-Hong Tsai designed methods and procedures; Bo-Han Wei implemented analytic pipeline; Xavier Cheng-Hong Tsai, Min-Yen Lo, and Sheng Yu Hung collected image and clinical data; Bo-Han Wei and Kuo-Jui Sun contributed analyzed data; Bo-Han Wei, Xavier Cheng-Hong Tsai, Kuo-Jui Sun, and Chien-Yu Chen wrote the manuscript; Xavier Cheng-Hong Tsai, Wen-Chien Chou, Hwei-Fang Tien, Hsin-An Hou, and Chien-Yu Chen reviewed and edited the manuscript; and all authors approved the final manuscript.

Corresponding author

Correspondence to Chien-Yu Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

supplementary materials

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wei, BH., Tsai, X.CH., Sun, KJ. et al. Annotation-free deep learning for predicting gene mutations from whole slide images of acute myeloid leukemia. npj Precis. Onc. 9, 35 (2025). https://doi.org/10.1038/s41698-025-00804-0

Download citation

Received: 31 July 2024
Accepted: 02 January 2025
Published: 03 February 2025
DOI: https://doi.org/10.1038/s41698-025-00804-0