Introduction

Acute myeloid leukemia (AML) accounts for approximately 15–20% of childhood leukemia cases. Despite the widespread use of cytogenetics, genetic aberrations and the assessment of treatment response by minimal residual disease for risk stratifications in pediatric AML protocols1,2,3,4,5, about 30% of children with AML still experience treatment failure. While outcomes for low-risk pediatric AML have improved significantly over recent decades, high-risk AML in children remains associated with an adverse prognosis despite intensified chemotherapy1,6. The inherent heterogeneity of AML—reflected in the diversity of blast morphology, surface markers, and cytogenetic and genetic alterations—continues to pose a major challenge to effective treatment7.

The rapid identification of novel recurrent somatic mutations in adult patients with AML through next-generation sequencing has greatly advanced molecular classifications. The first recurrent somatic mutations identified were in the IDH1 gene and its related gene IDH2, discovered via whole genome sequencing. This was soon followed by the identification of a frameshift mutation in the DNA methyltransferase gene DNMT3A, observed in the first two patients studied8,9. Subsequent unbiased genome and exome sequencing revealed mutations in core RNA spliceosome genes——including SF3B1, SRSF2, U2AF1, and ZRSR2—in patients with myelodysplastic syndromes (MDS)10. These spliceosome mutations are present in 7% of de-novo adult AML and 50% of secondary AML arising from MDS11,12. Additionally, mutations in the multiprotein cohesin complex, such as RAD21, SMC1A, SMC3, and STAG2, occur in 10–15% of AML cases13,14. The discovery of these molecular genetic alterations, together with established cytogenetic abnormalities and genetic alterations, has contributed to refining the prognostic classification system in AML15,16,17,18.

Following the identification of recurrent genomic alterations in adult AML, several studies focusing on the pediatric population have revealed that gene mutations are significantly less frequent in pediatric AML compared to those in adult cases19,20,21,22. The use of RNA sequencing (RNA-seq), has highlighted that chromosomal translocations are more prevalent in pediatric patients, especially among younger children23,24. Notably, rearrangements involving KMT2A, as well as fusions with NUP98, members of the GLIS gene family, and UBTF tandem duplications (UBTF-TDs), are enriched in pediatric AML but are less in adult patients25,26,27,28. These molecular distinctions underscore the need for dedicated studies to better understand pediatric AML.

Recent AML classification systems, including the International Consensus Classification (ICC) and the World Health Organization (WHO) classification, have primarily been based on recurrent chromosomal translocations and genetic alterations identified over the past 15 years29,30. However, the utility of these frameworks to clinical risk stratification in pediatric patients remains insufficiently studied. Many newly identified recurrent driver alterations are still classified as “AML with other defined genetic alterations” under the WHO classification, or as “AML, not otherwise specified” in the ICC system. Although Umeda et al.31 conducted a comprehensive analysis of previously published samples—primarily from individuals of European ancestry—and defined 23 distinct molecular subtypes based on specific mutations, these proposed subtypes require further validation in populations of non-European ancestry, such as those of East Asian descent. Therefore, the aim of this study was to apply a similar classification strategy to pediatric AML cases in Taiwan and to examine the association between genetic alterations and clinical outcomes in this population.

Results

Patient cohort

Clinical features and major molecular subtypes are summarized in Table 1. A total of 105 pediatric patients with AML of East Asian ancestry (59 men, 46 women) were included in the study. The median age of onset was 11.55 years (range: 0.11–17.75). Initial white blood cell (WBC) counts were below 100 × 103/μL in 79% of patients, with a median WBC count of 21.58 × 103/μL. According to French–American–British (FAB) classification, the most common subtype was M2 (44.9%), followed by M4 and M5. The FAB subtype was unknown in four cases (Supplementary Fig. 1). Available cytogenetic data were consistent with the results of transcriptome analysis.

Table 1 Patients’ demographic characteristics.

Molecular classification of AML by transcriptome analysis

To genetically characterize AML according to the ICC or WHO 2022 criteria, RNA-seq and whole-exome sequencing (WES) data were used. RNA-seq was employed to assess gene expression profiles and detect fusion transcripts, enabling the identification of specific AML subtypes. Tumor-only whole-transcriptome RNA-seq were performed on all 105 patients. Tumor-germline WES were performed on 61 patients, and tumor-only WES data were available for 43 samples. Pathogenic fusions or structural variants were systematically screened, with most of these genetic alterations being class-defining for AML. Notably, 80% of patients had subtype-defined fusions such as KTM2Ar, RUNX1::RUNX1T1, CBFB::MYH11, and NUP98r. Subtypes of NPM1r and CEBPA mutations were also identified. According to a recent publication by Umeda et al.31, other subtypes identified included UBTF-TD, GATA1, and MECOMr (Table 2)31. One rare fusion involving NPM1r was demonstrated in the Supplementary Fig. 2.

Table 2 Genetic subtypes defined by transcriptome analysis.

Transcriptomic analysis of 105 AML RNA-seq samples revealed distinct molecular subgroups within the disease. Uniform Manifold Approximation and Projection (UMAP) based on the top 1,000 most variable genes identified three clearly separable subtypes—RUNX1::RUNX1T1, PML::RARA (acute promyelocytic leukemia [APL]), and CBFB::MYH11—while the remaining subtypes are not distinctly separated by gene expression alone (Fig. 1A and Supplementary Table 1). This pattern was consistent when using 351, 500, 1000 or 2000 top gene expression (Supplementary Fig. 3). Heatmap clustering of 13 AML subtypes with at least three samples confirmed distinct expression patterns for these three subtypes and highlighted a fourth cluster corresponding to KMT2Ar AML (Fig. 1B and Supplementary Table 2 for differentially expressed genes across 13 AML subtypes). Subtype-specific genes with top elevated expression were identified for each of these four groups (Fig. 1C–F). Gene set enrichment analysis (GSEA) of 50 Hallmark pathways, based on differentially expressed genes for each of four subtypes (RUNX1::RUNX1T1, APL, CBFB::MYH11 and KMT2Ar) compared to all other subtypes as controls, was demonstrated in Fig. 1G (Supplementary Table 3).

Fig. 1
figure 1

Integrative RNA sequencing (RNA-seq) analysis of acute myeloid leukemia (AML) tumors. (A) Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) plot based on the top 1,000 most variable genes from 105 AML tumor RNA-seq samples. The analysis reveals three major AML subtypes, including RUNX1::RUNX1T1, APL (acute promyelocytic leukemia), and CBFB::MYH11, which are clearly separable from the others, while the remaining subtypes are not distinctly distinguished by gene expression alone. (B) Heatmap showing subtype-level gene expression profiles (column-wise normalized log₂[CPM]) across 13 AML subtypes with at least three samples (see Supplementary Table 1 for differentially expressed genes across 13 AML subtypes) Consistent with (A), although only the same three subtypes are apparently separated, the cluster branches specific to four AML subtypes, comprising RUNX1::RUNX1T1 (n = 25), APL (n = 7), CBFB::MYH11 (n = 9), and KMT2Ar (n = 14) , are highlighted in different colors. (CF) Top genes with higher expression in each of these four AML subtypes are shown. (G) Gene set enrichment analysis (GSEA) of 50 Hallmark pathways based on differentially expressed genes for each of four subtypes highlighted in (B) compared to all other subtypes as controls.

We compared gene set enrichment profiles between the NTU cohort and the dataset reported by Umeda et al. (NG cohort)31 for RUNX1::RUNX1T1, APL, KMT2Ar, and CBFB::MYH11 subtypes, particularly focusing on gene sets with upregulated genes in each major subtype (see method for details). A total of 193 gene sets were commonly upregulated and reproducibly enriched (FDR q < 0.05) in both cohorts. Clustering by normalized enrichment score (NES) revealed distinct subtype-associated patterns (Supplementary Fig. 4A). CBFB::MYH11 exhibited predominant enrichment of 12 gene sets, including KRAS oncogene signature and TRIM24 target up (Supplementary Fig. 4B). KMT2Ar and CBFB::MYH11 shared 40 gene sets linked to immune and cancer-related pathways (Supplementary Fig. 4C). RUNX1::RUNX1T1 showed five uniquely enriched gene sets (Supplementary Fig. 4D). APL demonstrated selective enrichment of Targets of IGF1 and IGF2 up and shared Basement membranes enrichment with RUNX1::RUNX1T1 (Supplementary Fig. 4E). Additional clusters of gene sets (n = 71) enriched primarily in CBFB::MYH11 but also observed in other subtypes were demonstrated in the Supplementary Fig. 4F–H.

Somatic mutations determined by WES

Putative driver mutations were identified from tumor–germline matched samples using multiple somatic variant callers, including Mutect2 (v4.1.2.0)32, SomaticSniper (v1.0.5.0)33, VarScan2 (v2.4.3)34, MuSE (v1.0rc)35, and Strelka2 (v2.9.10)36. The median somatic mutation rate across the cohort was 0.97 mutations per megabase (Mb), and the tumor mutation burden for available samples and molecular subtypes are shown in Supplementary Fig. 5. Identified mutations were categorized into six functional pathways commonly implicated in the pathogenesis of pediatric AML, including signaling pathways, transcription factors, epigenetic regulators, RNA-binding proteins, the cohesin complex, cell cycle regulators, and protein modification pathways (Fig. 2). This classification was majorly based upon the report of Umeda et al.31. Frequently mutated pathways included signaling (68.6%, 72/105), transcription factors (37.1%, 39/105), epigenetic regulators (26.7%, 28/105), the cohesin complex (7.6%, 8/105), protein modification (5.7%, 6/105), and RNA-binding proteins (3.8%, 4/105).

Fig. 2
figure 2

Genomic Landscape and WHO Classification of Pediatric acute myeloid leukemia (AML). This figure summarizes the genomic landscape of pediatric AML with a focus on key genetic alterations identified through Genomic Random Interval analysis. Genetic mutations are illustrated across six mutated pathways: signaling, transcription factors, epigenetic regulators, cohesin complex, protein modification, and RNA-binding proteins. The genetic subtypes are categorized according to the WHO 2022 5th edition of the classification. The risk stratification was determined by classification framework proposed by Umeda et al31. WHO, World Health Organization; MDS, myelodysplastic syndromes; FAB, French–American–British; WES, whole-exome sequencing; RNA-seq, RNA sequencing

Signaling pathway mutations were identified in 80% (20/25) of patients with RUNX1::RUNX1T1 and in 89% (8/9) of patients with CBFB::MYH11. Notably, 11 c-KIT mutations were detected in 10 patients with RUNX1::RUNX1T1, representing 40% of this subtype, and in one patient with CBFB::MYH11. Further Fisher’s exact test revealed that c-KIT mutations were significantly enriched in CBF AML than in other genetic subtypes (P < 0.0001). ASXL1 or ASXL2 mutations were identified in 7 of 25 patients with RUNX1::RUNX1T1, but were absent in CBFB::MYH11 cases and detected in only 3 of 71 patients in other genetic subtypes. Overall, two ASXL1 and five ASXL2 mutations were significantly enriched in RUNX1::RUNX1T1 cases compared with those in all other subtypes combined (28% vs 3.8%, Fisher’s exact test, p = 0.001, odds ratio = 9.68). In addition, NRAS mutations were detected in 36% (5 of 14) of patients harboring KMT2Ar. Two ASXL2 and one ASXL1 mutations were identified in this group. Overall, RUNX1::RUNX1T1 exhibited a high enrichment of epigenetic mutations (56%) compared to CBFB::MYH11 (0%) and KMT2Ar (21.4%) (p = 0.0046, Fisher‘s exact test). In contrast, the signaling pathway did not show significant differences between the three genetic types. Similarly, there were no significant differences in the transcription factor and cohesin pathways, with all three mutation groups showing relatively low and comparable frequencies. Another heatmap focusing upon these three genetic subtypes were shown in the Supplementary Fig. 6.

FLT3 alterations were identified in 23 patients (21.9%), including 18 with FLT3 internal tandem duplications (FLT3-ITD; 17.1%), three with FLT3 tyrosine kinase domain mutations, and two with non-canonical FLT3 mutations (1.9%). These alterations were most prevalent in patients with NPM1 mutations (75%, 3/4), NUP98r (50%, 6/12), and DEK::NUP214 fusions (100%, 4/4). Additionally, five patients harbored RUNX1 mutations. A schematic of the RUNX1 protein domains and annotated mutations is presented in Supplementary Fig. 7. Another rare somatic mutations of USP9X were identified in two patients. The difference of clinical features, including initial white cell counts, cytogenetics and FAB classification of patients with FLT3-ITD and RUNX1 mutations was summarized in the Supplementary Table 4.

Tumor mutational signature in pediatric AML

Mutational signatures were constructed across all pair samples using WES. The most prevalent signatures identified in this cohort were COSMIC signatures 1, 5, and 6, observed across nearly all AML subtypes. According to Cosmic v10237, Signature 1 and Signature 5 are classified as clock-like mutational signatures, both correlating with the age of the individual. Notably, signature 5 has also been associated with bladder cancer harboring ERCC2 mutations and cancers linked to tobacco smoking38. Signature 3, indicative of defective homologous recombination repair, was also detected and is known to be strongly associated with biallelic inactivation of BRCA1 and BRCA239. Consistent with previous studies, AML subtypes with more than three samples in this cohort typically exhibited a combination of signatures 1 and 540. No strong subtype-specific mutational signatures were detected, likely due to the limited sample size and the restriction of WES to coding regions only (Supplementary Fig. 8).

Impacts of genetic subtypes on clinical outcomes

In accordance with previous reports and limited case numbers of some subtypes, we combined the high-risk genetic subtypes, including UBTF-TD, GLISr, PICALM::MLLT10, and HOXr as a single high-risk category due to limited case numbers in each subtype. Subtypes represented by only one case, such as KMT2A-PTD and MECOMr, were excluded from the survival analysis. We then focused our comparison on subtypes including RUNX1::RUNX1T1, NPM1r, KTM2Ar, CBFB::MYH11, GATA1, CEBPA, and DEK::NUP214—as well as the combined group of the above high-risk patients, which was analyzed as a single high-risk category. The 5 year event-free survival (EFS) and overall survival (OS) for each molecular type are presented in Supplementary Fig. 9 with event-free survival at a borderline significance (p = 0.06, Supplementary Fig. 9A) and statistically significant difference in overall survival (p = 0.034, Supplementary Fig. 9B). The high-risk genetic subtypes were significantly associated with poorer 5 year OS in multivariable analysis (p = 0.037, adjusted hazard ratio (HR) = 7.27, 95% confidence interval (CI): 1.13–46.69) although the poorer 5 year EFS was not significant in multivariable analysis (p = 0.255, adjusted HR = 4.91, 95% CI: 0.32–76.25) (Supplementary Table 5–1 and 5–2).

Additional prognostic markers provided by somatic mutations

The 5-year OS rate was significantly lower in patients with FLT3-ITD mutations (33.86%; 95% CI, 17.06–67.23) than in those without the mutation (65.96%; 95% CI, 55.63–78.21; p = 0.0066). The 5-year EFS was also poorer among patients with FLT3-ITD, although the difference showed borderline statistical significance (p = 0.058) (Fig. 3A and B). However, FLT3-ITD did not remain a significant factor in the multivariable analysis ((Supplementary Table 6–1 and 6–2). Patients with RUNX1 mutations had significantly inferior 5 year EFS and OS rates compared with those without the mutations (EFS: 0.0%; 95% CI, 0.0–0.0% vs. 51.70%; 95% CI, 42.02–63.60; p = 0.013; OS: 0.0%; 95% CI, 0.0–0.0% vs. 62.35%; 95% CI, 52.71–73.75%; p = 0.00057) (Fig. 3C and D). RUNX1 mutations remained a significant adverse prognostic factor in the multivariable analysis (p = 0.009; HR, 5.09; 95% CI, 1.51–17.14) (Supplementary Table 7–1 and 7–2). Although c-KIT mutations in RUNX1::RUNX1T1 were associated with a higher risk of relapse, it did not reach statistical significance in this cohort.

Fig. 3
figure 3

Kaplan–Meier estimates of 5 year event-free survival (EFS) and overall survival (OS) according to FLT3-ITD and RUNX1 mutation status (A, B) Patients with FLT3-ITD mutations showed significantly lower 5 year OS rates compared with those without the mutation (p = 0.007) and a trend toward poorer EFS (p = 0.058). (C, D) Patients with RUNX1 mutations demonstrated markedly inferior 5 year EFS and OS compared with wild-type cases (p = 0.013 and p = 0.0006, respectively). Survival curves were compared using the log-rank test.

Discussion

Although the cohort was relatively small due to the rarity of pediatric AML, RNA-Seq analysis enabled the identification of several recently recognized novel genetic subtypes, including ETS, UBTF-TD, GLISr, and GATA1. The clinical outcomes varied among different genetic subtypes. FLT3-ITD mutations may help identify patients who could benefit from targeted therapy; however, such treatments are currently not approved for pediatric AML in Taiwan. While RUNX1 mutations are less common in pediatric AML, their presence was associated with worse outcomes.

In the current ICC and WHO classifications, many fusion genes enriched in pediatric AML have yet to be assigned into distinct categories29,30. Our pediatric cohort identified several rare, high-risk AML subtypes. AML with NUP98r has been confirmed as high-risk in multiple pediatric studies due to its association with an increased risk of relapse27,28. Most cases of AML with KMT2Ar are also considered high-risk, though prognosis may vary depending on the fusion partner gene31. AML with GLISr, first described by Tanja et al.25. in acute megakaryocytic leukemia (AMKL), is not currently defined as a distinct molecular subtype in the ICC or WHO classifications25. The recently discovered UBTF-TDs, initially identified in pediatric AML, have also been reported in adults and are associated with poor prognosis26,41. When analyzed together, the high-risk genetic subtypes including GLISr, UBTF-TDs, PICALM::MLLT10, and HOXr were significantly associated with poorer 5 year OS in multivariable analysis. The identification of these high-risk genetic fusions in pediatric AML is crucial not only for refined risk classification but also for enabling clinical trials of novel targeted agents in these subtypes. In the Children’s Oncology Group trial, treatment by Gemtuzumab Ozogamicin significantly improved EFS in pediatric AML with KMT2Ar, though OS remained unchanged42. Menin inhibitors represent another promising therapeutic class and may be effective in patients harboring KMT2Ar, NPM1 mutations, NUP98r, or UBTF-TDs43,44,45. Recent preclinical studies have further elucidated that CBFA2T3::GLIS2-associated AMKL is critically dependent on apoptotic pathways mediated by the BCL2 and BCL-xL proteins. This dependence highlights the therapeutic potential of targeting these pathways using venetoclax and navitoclax, respectively46. Venetoclax, which selectively inhibits BCL2, may be effective when combined with chemotherapy in some patients with relapsed CBFA2T3::GLIS2 AMKL47. Collectively, these early findings provide hope that novel, targeted therapies will improve outcomes for children with high-risk genetic subtypes of AML48.

Faber et al.49 utilized whole-exome or whole-genome sequencing to characterize RUNX1::RUNX1T1 (n = 85) and CBFB::MYH11 (n = 80) AML, identifying NRAS as the most frequently mutated gene in CBF-AMLs. They also observed enrichment of c-KIT exon 17 mutations in the RUNX1::RUNX1T1 cohort (p = 0.005), which correlated with poorer outcomes (p = 0.01), and reported that frameshift mutations in ASXL2 were common and exclusive to RUNX1::RUNX1T1 cases. In the present cohort, ASXL1/ASXL2 and c-KIT mutations were similarly enriched in RUNX1::RUNX1T1 AML, consistent with previous reports. While c-KIT mutations have been associated with inferior prognosis in CBF-AML in prior studies, this association was not observed in our cohort50,51,52. Beyond genetic fusion subtypes, several somatic genetic alterations also warrant consideration in risk stratification for pediatric AML. Although patients harboring FLT3-ITD tend to exhibit poorer outcomes, this difference did not reach statistical significance in multivariable analyses within our cohort. Nevertheless, multiple international pediatric trials have demonstrated that the addition of sorafenib, a FLT3 inhibitor, may improve outcomes for patients with FLT3-ITD53. Expanding government-funded access to FLT3 inhibitors in regions such as Taiwan could provide substantial clinical benefit for this population. RUNX1 mutations represent another adverse prognostic marker, consistently associated with higher rates of treatment failure across both pediatric and adult AML cohorts54,55,56. In a Japanese cohort of 503 pediatric patients with AML, RUNX1 mutations were identified in 2.8% of cases and correlated with poor clinical outcomes, findings of which are in line with our data60. Another rare USP9X somatic mutations were identified in two patients. USP9X was identified as a female-specific B-ALL susceptibility gene associated with multiple congenital anomalies57. In sporadic pediatric B-ALL, USP9X acts as a tumor suppressor gene in both sexes57. In AML, the inhibition of USP9X induces apoptosis in FLT3-ITD positive cells58. However, the impacts or pathogenesis of USP9X somatic mutations might need further investigations.

This study has certain limitations. First, the cohort size was relatively small, which may have limited the detection of rare subtypes and reduced the statistical power of the survival analyses. Second, our previous analysis of treatment outcomes in pediatric AML showed that patients with low-risk genetic subtypes, particularly RUNX1::RUNXT1 and CBFB::MYH11, experienced poorer outcomes than those reported in other international studies. This discrepancy may have confounded the identification of certain genetic prognostic markers in the present study59. A prospective clinical trial to use a larger cohort size with the same genetic tools might be indicated in the future.

In conclusion, our findings are consistent with previous studies and highlight transcriptome analysis as a valuable tool for molecular subtyping in pediatric AML. Patients harboring high-risk genetic subtypes exhibited poorer outcomes compared to those with low-risk subtypes. Specifically, patients with FLT3-ITD mutations may benefit from targeted therapies, which necessitates government support in Taiwan. Additionally, patients with RUNX1 mutations showed particularly dismal outcomes, indicating a need for novel therapeutic agents or alternative treatment strategies. Finally, integrating minimal residual disease monitoring could enhance risk-adapted protocol design for pediatric patients with AML in Taiwan in the future60,61,62.

Methods

Patients and protocols

AML diagnosis was established based on morphological analysis of bone marrow or peripheral blood, combined with immunophenotyping using multiple combinations of monoclonal antibodies and cytogenetic studies. Between January 1997 and December 2019, a total of 105 pediatric patients with AML were enrolled. All patients received treatment according to the previously described TPOG AML-97A protocol59,63. The FAB classification system was used to determine AML morphological subtypes. Karyotypes were interpreted according to the International System for Human Cytogenetic Nomenclature. Beginning in 2012, reverse transcription-polymerase chain reaction assays were used to screen for two common fusion transcripts: RUNX1::RUNX1T1 and CBFB::MYH11. All patients underwent morphological, immunophenotypic, and cytogenetic analyses.

The Institutional Review Boards of the participating institutions approved the protocols, and the patients, parents, or guardians signed the written informed consent, as appropriately. The TPOG-AML-97A protocol has been published in detail previously59,63.

DNA and RNA extraction

Total RNA was extracted from leukemic cells using NucleoZOL (Macherey–Nagel, Düren, Germany) according to the manufacturer’s instructions, while genomic DNA was extracted using standard phenol/chloroform-based methods. These procedures have been described in detail previously64,65,66. Briefly, one million cells were lysed overnight using lysis buffer. PureLink RNase A (Thermo Fisher Scientific, Waltham, USA) was added to degrade total RNA, followed by incubation at 37 °C for 10 min. An equal volume of phenol–chloroform–isopropanol (25:24:1) was then vigorously mixed with the lysate, followed by centrifugation at 16,100 × g at 4 °C for 5 min. The upper aqueous phase was transferred to a fresh tube, and genomic DNA was precipitated by adding two volumes of pure ethanol pre-chilled to –80 °C. The DNA pellet was washed with 75% ethanol and rehydrated in Tris–EDTA buffer.

FLT3-ITD determined by DNA-PCR

FLT3-ITDs were detected by PCR amplification of genomic DNA using the following primers: forward (F) 5′-GCAATTTAGGTATGAAAGCCAGC-3′ and reverse (R) 5′-CTTTCAGCATTTTGACGGCAACC-3′20,67. The 50 μL reaction mixture contained approximately 500 ng of genomic DNA and 10 pmol of each primer. Samples were amplified using standard PCR conditions and PCR products were resolved on a 2% agarose gel alongside positive and negative controls.

Transcriptome sequencing and bioinformatic analyses

The TruSeq library preparation kit and the HiSeq 2000 sequencer (Illumina, San Diego, CA, USA) were used for RNA sequencing. The sequence reads were paired-end, 150 base pairs (bp) in length. All software were run using default parameters on a high-performance computing environment, and fastq files were mapped and aligned to GRCh38 human genome reference by STAR v2.5.3a68. Gene annotation downloaded from the Ensembl website (http://www.ensembl.org/) on January 1, 2023, was used for STAR mapping and subsequent read-count evaluation. The bioinformatics tools for fusion calling include Cicero69, STAR-Fusion (v1.8.1)68, FusionCatcher (1.2.0)70, squid (1.5)71, pizzly (0.37)72, arriba (1.2.0)73, and Pindel (0.2.0)74, with the combined approach of the latter, Arriba, and Cicero specifically used to search for the internal tandem duplication75.

UMAP analysis

STAR68 aligned GRCh38 RNA-seq BAM files of 105 AML samples were calculated for gene read counts using RSEM76. The resulting count was further transformed into log2(cpm) using the Limma Voom R package77. UMAP for dimension reduction78 with default setting (n components = 2, n neighbors = 15, and min distance = 0.3) was conducted using the top 1000 variable genes that were selected using the median absolute deviation method, as described by Umeda et al31.

Identifying differentially expressed genes and enriched gene sets for each major subtype

A read count matrix for 105 AML RNA-seq samples (log₂[CPM]) was generated using the Limma Voom pipeline77. Differential gene expression analysis was then performed for each AML subtype (with n ≥ 3 samples) compared to all remaining samples combined as the control group. In the NTU cohort, two subtypes were dominant, including RUNX1::RUNX1T1 (n = 25) and KMT2Ar (n = 14), while other subtypes, such as CBFB::MYH11 (n = 9) and APL (n = 7), had fewer samples. Due to this imbalance, differential expression analyses using a simple “subtype vs. others” design may result in biased detection of genes that are lowly expressed in minority subtypes but highly expressed in the dominant ones. To minimize such bias and better capture subtype-specific expression patterns, we focused only on upregulated genes within each subtype. This approach reduces the likelihood of identifying genes upregulated in major subtypes but downregulated in minor ones, ensuring that the selected genes are more informative for distinguishing subtypes.

For each subtype, genes significantly upregulated at FDR q < 0.01 were selected. The mean expression of these genes across all samples was clustered and visualized using the MATLAB (version 2025b) function clustergram, which allows interactive selection and coloring for major subtype-specific clusters. Through this supervised clustering approach, we identified sets of highly expressed genes corresponding to four major AML subtypes with distinct clustering branches: RUNX1::RUNX1T1, CBFB::MYH11, APL, and KMT2Ar. These genes were further visualized in separate heatmaps to highlight top differentially upregulated genes among these subtypes.

Gene set enrichment analysis (GSEA) was performed using the GSEA software with MSigDB human gene sets: H (hallmark, n = 50) and C2 (curated, n = 7,561)79. Since only the four major subtypes, comprising RUNX1::RUNX1T1, CBFB::MYH11, APL, and KMT2Ar, showed clear subtype-specific expression clusters in previous clustering analysis, GSEA was focused on pathways enriched in upregulated genes within these subtypes.

To assess reproducibility, the GSEA results from the NTU cohort were compared with those reported by Umeda et al.31 for the NG cohort. Although Umeda et al. did not provide raw expression matrices or DEG lists for their 887 AML samples, they published GSEA results (based on MSigDB C2 gene sets) obtained using a similar case–control design. We therefore overlapped the top enriched and upregulated gene sets (FDR < 0.05) from both NTU and NG cohorts to identify replicable, subtype-enriched pathways for the four major subtypes.

Finally, gene sets significantly enriched and upregulated (FDR < 0.05) in both NTU and NG cohorts (n = 193) were clustered using MATLAB clustergram function based on their normalized effect sizes (NES) of enrichment. Sub-branches showing distinct or shared enrichment patterns across subtypes were colored and visualized in individual heatmaps to facilitate comparison of subtype-specific and common biological pathways.

WES

The SureSelect Human All Exon 50 Mb or 38 Mb kit (Agilent) was used for library preparation, targeting a sequencing depth of 120 × . Paired-end sequencing (2 × 150 bp) was performed on the Illumina HiSeq 2000 instrument. Sequencing adapters and low-quality reads—defined as reads where more than 50% of bases had a Phred quality score below 5—were removed. The remaining high-quality reads were aligned to the reference human genome (GRCh38). Variant sites differing from the reference genome were then independently identified in each sample26. Somatic variants were called by paired analysis of diagnosis-germline samples. For 41 tumor WES samples without matched germline samples, we performed tumor-only variant calling using Mutect280, with a panel of normal samples (n = 96) aggregated internally at St. Jude Children’s Research Hospital80. To exclude non-rare SNPs, we excluded variants observed in gnomAD (v3.1.2)81 with population-level allele frequency > 0.01 among any of four major populations, including European, East Asian, South Asian, and African populations. Variant annotation was performed using ANNOVAR80, and coding mutations that potentially disrupt protein coding sequences were further underwent manual assessment and evaluation of read depth, mapping quality, and strand bias, in order to eliminate additional artifacts82.

Statistics

Complete remission (CR) was defined as trilineage hematopoietic recovery with less than 5% blasts in the bone marrow. Failure to achieve CR after two courses of induction chemotherapy was considered refractory disease. Relapsed disease was defined as recurrence after an initial CR. Secondary malignancies were categorized as separate events. Risk classification was based on the criteria described by Umeda et al. and prior cohort studies31. Patients were classified as high-risk if they had one of the following alterations: UBTF-TD, GLISr, PICALM::MLLT10 or HOXr. Samples represented by a single case, including KMT2A-PTD and MECOMr were excluded for survival analysis, as were patients with acute promyelocytic leukemia. Differences in mutation frequencies between subgroups were evaluated using Fisher’s exact test. A two-sided p-value < 0.05 was considered statistically significant. OS was defined as the time from the start of treatment to death from any causes. EFS was defined as the time from the start of treatment to relapse or death from any causes. Patients who did not achieve first remission were assigned an EFS of zero. Patients without an event were censored at their last follow-up. Survival curves were estimated using the Kaplan–Meier method and compared by the log-rank test. Survival rates are reported as mean percentages ± standard deviation. Cox proportional hazard models were constructed for EFS and OS, with covariates, including sex, white blood cells (< 10 × 103/μL, or > 100 × 103/μL). All statistical analyses, unless otherwise specified, were performed using SAS software, version 9.4 (SAS Institute, Inc., Cary, NC, USA).