Introduction

Acute myeloid leukemia (AML) is a heterogeneous hematologic malignancy primarily driven by gene mutations, gene rearrangements, and subsequent acquired cooperative mutations1,2. AML is characterized by over 20 frequently mutated genes, which can be broadly categorized into epigenetic regulators (e.g., DNMT3A, ASXL1, TET2), splicing factors (e.g., SF3B1, SRSF2), signaling transduction genes (e.g., FLT3, KIT, KRAS), tumor suppressors (e.g., TP53, WT1), and transcription factors (e.g., RUNX1, MYC)3,4. Identifying these mutations in AML patients is critical for guiding treatment options and prognostic stratification. However, some cases lack detectable coding mutations5, and patients within the same molecular subtype often exhibit significant clinical heterogeneity6. This suggests the involvement of unidentified non-coding mutations in AML pathogenesis. For instance, deletions in lineage-specific enhancer clusters can disrupt the spatiotemporal expression of MYC, contributing to AML development7. Functional screening of enhancers with high-frequency mutations in AML has further revealed that such mutations can lead to abnormal expression of genes like KRAS and PER28. Thus, detecting and interpreting non-coding mutations is essential for understanding AML heterogeneity.

The majority of disease-associated mutations are located in non-coding regions, particularly in cis-regulatory elements (CREs)9. Enhancers, a type of transcriptionally activated CRE, play a pivotal role in gene expression regulation and are typically found in open chromatin regions marked by specific histone modifications10. These elements contain transcription factor binding sites that recruit co-factors to initiate gene expression. Enhancers can function independently or synergistically as clusters, creating functional redundancy and complicating the study of individual enhancers11. Additionally, enhancers exhibit cell-type-specific activity, controlling the spatiotemporal expression of genes12. Mutations in the transcription factor binding sites of enhancers can disrupt this regulation, contributing to disease7. However, identifying functional non-coding mutations and elucidating their mechanisms remains a significant challenge.

Single-cell chromatin accessibility sequencing (scATAC-seq) has been employed to study epigenetic regulation in AML pathogenesis and relapse13,14,15,16,17, but has rarely been used to predict functional non-coding mutations. Unlike bulk sequencing, scATAC-seq enables the identification of non-coding mutations at single-cell resolution18,19 and links them to cellular traits such as epigenetic regulation and cell-type abundance20,21,22. Although methods exist to detect non-coding mutations using single-cell chromatin accessibility data18,19,23, the functional consequences and significance of these mutations remain poorly characterized.

In this study, we performed multi-omics sequencing, including scATAC-seq, ATAC-seq, RNA-seq, and H3K27ac CUT&Tag, on bone marrow samples from 10 AML patients. We revealed cellular composition heterogeneity in AML and developed an integrated pipeline based on scATAC-seq data to detect non-coding mutations and characterize their functional consequences. We identified sample-specific and lineage-specific non-coding mutations in AML and elucidated the functional importance of the regulatory elements they reside in, as well as their potential impact on transcriptional regulation.

Results

Single-cell and bulk multi-omics profiling of bone marrow in AML

To better resolve the heterogeneity of AML and enable subsequent validation, we performed scATAC-seq, ATAC-seq, H3K27ac CUT&Tag, and RNA-seq on bone marrow samples from 10 AML patients (AML1 to AML10) at the time of diagnosis (Fig. 1a, Supplementary Data 1). For the scATAC-seq experiment, we incorporated modifications inspired by Paired-seq24 methodology, while preserving the core ATAC sequencing component (“Methods” section). Each sample was uniquely labeled using Tn5 barcodes, allowing simultaneous processing of multiple samples with a low doublet rate (Supplementary Fig. 1a–c). We observed that the fraction of fragments in peaks (FriP) and the fraction of fragments in transcription start sites (FriTSS) were lower in the NIH3T3 cell line compared to SHARE-seq (Supplementary Fig. 1d–f), which may be caused by the lower sequencing depth. We performed scATAC-seq on 10 bone marrow samples, obtaining 69,088 high-quality cells with 1239 median unique fragments per cell (Supplementary Fig. 1g) and identifying 8 distinct clusters (Fig. 1b). Based on marker gene activity derived from scATAC-seq (Fig. 1c, Supplementary Fig. 1h) and gene expression measured by RNA-seq (Supplementary Fig. 1i), we annotated major malignant cell types, including hematopoietic stem cell-like (HSC-like, CD34), immature myeloid progenitor-like (IMP-like, MPO and PF4), megakaryocytic progenitor-like (MkP-like, PF4), granule-monocytic progenitor-like (GMP-like, MPO), monocyte-like (Mono-like, CD14, CEBPD), and classical dendritic cell-like (cDC-like, CD14, CLEC10A). Each tumor sample exhibited a dominant cell cluster with the expansion of tumor cells (Fig. 1d). Furthermore, the expression of marker genes in the annotated cell types aligned with RNA-seq results (Supplementary Fig. 1i), and the promoters of these marker genes showed varying degrees of accessibility across cell types (Fig. 1e). Additionally, these scATAC clusters were enriched for lineage-determining transcription factors (TFs) (Fig. 1f), such as RUNX1 and TAL1 in hematopoietic stem/progenitor cells (HSPC) clusters, CEBPA and SPI1 (PU.1) in myeloid lineage, and GATA2 and MYC in erythroid progenitor cells25,26.

Fig. 1: Single-cell and bulk multi-omics profiling of bone marrow samples from AML patients at diagnosis.
figure 1

a Overview of study design, sample collection, single-cell profiling, and analyses. b UMAP visualization for scATAC-seq across samples (left) and major hematopoietic cell types (right). c scATAC-seq gene activity scores for the representative marker genes across scATAC-seq clusters. Color indicates relative gene activity across all clusters, and dot size indicates the percentage of cells showing gene activity in each cluster. d Fraction of cells in each scATAC-seq cluster across samples. e Genome tracks of the accessibility near the marker genes. f UMAP for the representative motif activity score identified by ChromVAR across major hematopoietic cell types.

eMut detects and interprets single-cell non-coding mutations based on scATAC-seq data

The current mutation detection pipelines using single-cell ATAC-seq (scATAC) data can be categorized into three distinct groups based on their specific applications (Supplementary Data 2): (i) Monopogen19 and SComatic23, which are specifically designed for identifying putative somatic single-nucleotide variants (SNVs) in single-cell data without requiring matched normal samples; (ii) Alleloscope27, Copy-scAT28, epiAneufinder29, and AtaCNA30, which specialize in detecting copy number variants (CNVs) from scATAC data; and (iii) popular sequencing mutation detection tools, such as GATK31, SAMtools32, Strelka233, VarScan234, and VarCA18—an ensemble classifier integrating multiple variant callers—that have been adapted for single-cell data analysis. However, identifying functional non-coding mutations that lead to dysregulation of target genes and contribute to disease pathogenesis remains a significant challenge.

To address this, we developed eMut, an integrative pipeline designed with mutation detection and functional interpretation modules to systematically detect non-coding mutations using single-cell chromatin accessibility data and to characterize their potential functional impacts on gene regulatory programs and cellular lineages in AML (Fig. 2, see “Methods” section for details). First, leveraging reads from open chromatin regions, eMut identifies mutations in individual cells using tools such as Monopogen19 or GATK Mutect235, irrespective of whether tumor samples are paired with normal samples. Given the sparsity of scATAC-seq data, we further imputed candidate mutated cells by network propagation36, utilizing mutated cells (seed cells) within a cell-cell similarity graph (Fig. 2a). Additionally, eMut incorporates four modules for functional interpretation of detected non-coding mutations (Fig. 2b): (1) identifying cell-type- or lineage-specific mutations and assessing their potential roles in cell fate determination; (2) detecting hypermutated CREs with a significant excess of mutations to pinpoint potentially critical enhancers; (3) predicting the effects of mutations on transcription factor motifs (loss or gain) and linking mutated enhancers to target genes to identify functionally impactful non-coding mutations; (4) comparing target gene expression changes between mutated and wild-type cells (or samples). Collectively, eMut represents an integrative pipeline for detecting, imputing, and characterizing non-coding regulatory mutations with functional consequences at the single-cell resolution.

Fig. 2: Overview of eMut workflow to detect and interpret non-coding mutations at the single-cell level.
figure 2

a Mutation detection and imputation. eMut splits the alignment file (BAM-formatted) according to cellular barcode and detects mutations in open regions of the genome by matching sequence information to the reference genome. For each mutation, the mutated cells were imputed by network propagation using the SCAVENGE approach. b Functional interpretation of mutations at different levels, including for cell-type enrichment, hypermutated CREs identification, impact on transcription factor binding event, and target gene expression.

Both coding and non-coding mutations exhibit heterogeneity among AML patients

To characterize the genomic landscape of AML, we analyzed known coding genetic variants in 10 AML patients using diagnostic genetic testing, including cytogenetic and molecular genetic analyses (Fig. 3a, Supplementary Data 1). Consistent with previous studies37, coding mutations were detected in FLT3-ITD rearrangement (n = 3), and other AML-associated genes, such as NPM1, DNMT3A, CEBPA, IDH2, NRAS, and WT1. These mutations displayed distinct combinatorial patterns, unique to each patient (Fig. 3a). For non-coding mutations, we employed eMut to identify mutations in accessible chromatin regions using scATAC-seq data from these patients (without paired normal samples) (Fig. 3b). In total, we identified 2878 potential somatic non-coding mutations (median 317 per patient), most of which were patient-specific, underscoring the heterogeneity of non-coding mutations in AML (Fig. 3b, Supplementary Data 2). Among these, 1930 (67%) were supported by bulk ATAC-seq data, and 1569 (54.5%) were validated by H3K27ac CUT&Tag data (Supplementary Fig. 2). In addition, we found a positive correlation between the fraction of cells with somatic non-coding mutations and the percentage of AML blast cells (Pearson R = 0.57, p = 0.053; Fig. 3c and Supplementary Data 1). These non-coding mutations were mainly concentrated in the 5’ flanking regions, introns, and intergenic regions (Fig. 3d). To explore mutagenic mechanisms, we analyzed mutational spectra and signatures38. The non-coding mutation spectrum derived from scATAC-seq data showed high concordance with whole genome sequencing (WGS) from public AML datasets39 (cosine similarity 0.82), comparable to bulk ATAC-seq predictions (Fig. 3e, “Methods” section). By fitting COSMIC signatures to the observed mutational spectra, we identified three COSMIC single base substitution (SBS) signatures: a myeloid signature, SBS5 (a clock-like signature), and SBS6 (DNA mismatch repair) (Fig. 3f). These findings suggest both non-coding and coding mutations exhibit heterogeneity among AML patients but share underlying mutational processes.

Fig. 3: Mutation detection in AML by eMut.
figure 3

a Heatmap showing the somatic mutations and fusion genes supported by diagnostic genetic testing (cytogenetic and molecular genetic) (blue) or bulk RNA-seq data (yellow), both (red). b The distribution of somatic non-coding mutations across patients identified by eMut. AML8 was excluded from the analysis due to the absence of detectable somatic mutations. c Correlation between the percentage of blast cells with the fraction of mutated cells with somatic mutations (Pearson’s correlation test, alternative = “greater”). d The genomic localization of putative somatic mutations annotated using RefSeq hg38. e Comparison between the mutational spectra of the mutations detected in AML samples using scATAC-seq (mutations detected by Monopogen), ATAC-seq (mutations detected by GATK), WES-seq (data from TCGA), and WGS-seq (data from ICGC). f Cosine similarity values between mutational spectra with COSMIC mutational signatures (myeloid, SBS5, and SBS6) are shown in the bar plot.

Recurrent non-coding mutations across AML patients are cell-type-specific

AML is a disease characterized by abnormal hematopoietic differentiation and proliferation40. To investigate the potential role of non-coding mutations, particularly recurrent ones, in cell fate determination, we identified 52 recurrent non-coding mutations present in at least 3 patients (Fig. 4a). Cell-type enrichment analysis revealed that these recurrent non-coding mutations are cell-type-specific across AML patients (Fig. 4b, “Methods” section). For example, the mutation chr17:8156823:C:A, located on the enhancer of PER1, was identified in 6 AML samples, and showing significant enrichment in proliferating cells across various cell types (Fig. 4c). Due to the sparsity of single-cell data, we performed imputation on mutated cells to infer additional potential mutated cells (“Methods” section). Per1, which is involved in circadian rhythm transcription-translation feedback loops, has been implicated in leukemia stem cell function41. Another mutation, chr8:38030499:C:G, located on the promoter of EIF4EBP1, a gene essential for sustaining AML cells proliferation and survival42, was enriched in the HSC-like cluster (Fig. 4d). Additionally, we identified a mutation chr16:74666845:G:C, enriched in the Mono-like lineage and located in the promoter of RFWD3 (Fig. 4e), as well as a mutation chr20:62832771:G:A, enriched in the GMP-like lineage and situated in the enhancer of DPH3P1 (Fig. 4f). Collectively, these findings suggest that cell-type-specific recurrent non-coding mutations may play a role in determining cell fate in AML.

Fig. 4: Recurrent cell-type-specific non-coding mutations in AML.
figure 4

a Heatmap shows the 52 recurrent non-coding mutation profiles across patients (ranked by the number of patients with individual mutation). The nearest neighbor genes, annotation (E, enhancer; P, promoter), and chromosome coordinate for each mutation are shown on the left of the heatmap. The cumulative numbers of mutations for each sample are shown at the top of the heatmap. b The enrichment of mutations across various cell types. The dot size represents the number of mutated cells, the outer circle represents the enrichment significance using the hypergeometric test. cf The distribution of raw mutated cells (left) and imputed mutated cells (right) with mutations highlighted in (a), chr17:8156823:C:A (PER1, enhancer) (c), chr8:38030499:C:G (EIF4EBP1, promoter) (d), chr16:74666845:G:C (RFWD3, promoter) (e), and chr20:62832771:G:A (DPH3P1, enhancer) (f).

Cis-regulatory elements with mutations implicated in leukemogenesis

To investigate the hypothesis that genetic variants may drive enhancer dysregulation and contribute to AML, we mapped the identified putative somatic mutations to accessible cis-regulatory elements (CREs), identifying 2542 mutated CREs (Supplementary Data 3). Target genes linked to these mutated CREs showed significantly higher enrichment of AML-related genes and cancer-driver genes compared to CREs without somatic mutations (median enrichment of 5.7-fold, Fig. 5a). Functional enrichment analysis43 revealed that these mutated CREs were associated with apoptosis and hematopoietic cell fate determination, including the regulation of lymphocyte differentiation (Fig. 5b).

Fig. 5: Characterization and validation of mutated or hypermutated CREs.
figure 5

a The observed fraction of AML-related genes and cancer-driver genes (COSMIC) within the genes linked to CREs with (red) or without (gray) somatic mutations is shown. b The selected top functional enrichment analysis results (via rGREAT) of CREs with somatic mutations. c The observed fraction of AML-related genes and cancer-driver genes within the genes linked to hypermutated CREs is shown in orange, with gray representing the remaining CREs. d, e Percentage of mutations annotated with blood-related GWAS SNPs (d) and cis-eQTL (e). fh Functional validation of mutated CREs identified through the somatic mutation mode (f), germline mode (g), using eMut and by SComatic (h). Venn diagram representing the overlap of all CREs, mutated CREs, and functional CREs (overlapped with CREs reported by previous study) (top panel), along with a bar plot illustrating the percentage of mutated CREs confirmed by experimental validation compared to the rest of the CREs (bottom panel). The p-value was computed using the hypergeometric distribution (lower.tail = F).

In addition to somatic mutations, germline mutations have also been implicated in leukemogenesis44,45. Therefore, we analyzed germline mutations in AML (Supplementary Fig. 3). Given the substantial number of germline mutations, we adapted our approach to identify hypermutated CREs (defined by an observed excess expected somatic/germline mutations, see “Methods” section for details) rather than simply mutated CREs (with or without somatic mutations). Similarly, target genes of hypermutated CREs were enriched for AML-related genes and cancer-driver genes (Fig. 5c). We further assessed the prevalence of GWAS SNPs46 and cis-eQTL47 within these hypermutated CREs, observing a 1.6-fold enrichment of SNPs associated with blood-related traits and a 1.4-fold enrichment of cis-eQTL (Fig. 5d, e).

To validate the mutations detected by eMut could help screen for functional CREs in leukemogenesis, we collected experimentally discovered functional CREs in human leukemia8. For comparison, we also applied another mutation detection method, SComatic, to identify putative somatic mutations (Supplementary Fig. 4 and Data 2). Compared to SComatic, eMut demonstrated a superior balance between the number and proportion of functional CREs enriched among mutated CREs (Fig. 5f–h). For the mutation detection pipelines Monopogen and GATK utilized in eMut, the additional functional interpretation modules—such as the perturbed TF binding sites module for Monopogen and the hypermutated CREs module for GATK—enhanced the screening efficiency of functional CREs (Fig. 5f, g). Together, these findings suggest that CREs harboring genetic variants detected by eMut may play a critical role in regulating leukemia-related genes and contributing to leukemogenesis.

Candidate functional non-coding mutations in AML

Enhancers serve as platforms for binding lineage-regulating TFs, which control gene expression13,21. To further investigate the impact of non-coding mutations on transcriptional regulation, we assessed alterations of transcription factor binding motifs caused by these mutations. We found that the majority (83%) of somatic mutations significantly altered TF binding motifs, resulting in either loss or gain of binding sites for prominent factors such as EGR1 and lineage-determining TFs like SPI1 (Fig. 6a). Given the sparsity of single-cell data, we examined the impact of non-coding mutations on target gene expression by comparing mutated samples with wild-type samples for each mutation. Notably, 334 non-coding mutations were found to affect downstream target genes, with 208 mutations associated with upregulated expression and 126 mutations with downregulated expression, respectively (Fig. 6b). These findings suggested that non-coding mutations can induce transcriptional dysregulation to a significant extent. Furthermore, these transcriptionally dysregulated genes were significantly enriched for AML survival-related genes48 (Supplementary Fig. 5a), offering a potential basis for developing diagnostic biomarkers for risk stratification.

Fig. 6: Candidate functional non-coding mutations in AML.
figure 6

a The top 40 transcription factors ranked based on the number of mutations that could potentially affect their binding motifs. Pie chart showing the proportion of mutations affecting transcription factor binding (gain or loss). b The left heatmap displays the mutation profiles across patients, while the right heatmap shows the gene expression of the linked target genes for each mutation across samples by bulk RNA-seq (mutated patients vs wild-type, with three replicates per patient). c, d Genomic visualization of six mutations. In each example, the second row represents the reference sequence, the bottom row represents the mutated sequence (with the alt-allele base), and the top row represents the gain (c) or loss (d) motif by the mutation (top panel). The mutated position is highlighted by the red box. The boxplot shows target gene expression between mutated and wild-type samples (middle panel). The UMAP plots show the distribution of raw mutated cells (bottom panel). e The genomic track around the TCF3 locus. The position of the mutation is shown on the scATAC-seq tracks with the mutated cluster labeled in red. The CRE with mutation is highlighted in a gray background, and significant peak-to-gene linkages are indicated by loops. f The genomic track of H3K27ac CUT&Tag signal around the TCF3 with the mutated sample labeled in red.

Therefore, from the perspective of transcriptional dysregulation, we identified potential functional mutations that could directly alter TF binding and induce gene expression dysregulation (Supplementary Data 4). For instance, certain mutations were predicted to create a potential TF motif for transcriptional activators, which might associate with alterations in gene expression. For example, in HSC-like cells, gains in RXRA and E2F2 motifs were associated with increased expression of RUNX3 and ESR1, respectively, whereas ERG motif gain correlated with elevated ETV6 levels in GMP-like cells (Fig. 6c, Supplementary Fig. 6). In contrast, the disruption of ZNF628 motifs coincided with reduced RUNX2 expression in mono-like cells, and the loss of EGR1 motifs was accompanied by downregulation of BCL2L1 in cDC-like cells (Fig. 6d, Supplementary Fig. 6). Similarly, the loss of SP1 motifs was primarily associated with reduced TCF3 expression in mono-like cells, alongside decreased chromatin accessibility and diminished H3K27ac signals (Fig. 6d–f, Supplementary Fig. 6). Notably, these functional somatic mutations affecting TF binding and AML-related gene expression were specifically depleted in T/NK cells, whereas passenger mutations showed no such depletion in these cells (Supplementary Fig. 5b–j, see “Methods” section for details).

Additionally, our observations revealed that in certain cases, the expression alterations of downstream target genes may result from the cumulative effects of multiple mutations through regulatory cascades, rather than individual mutation events. For example, the mutation chr11:32421211:C:T resides in an active enhancer exhibited preferential accessibility in the HSPC-like clusters and was linked to the promoter of WT1 (Wilms tumor 1) (Supplementary Fig. 7a), a critical regulator of normal and malignant hematopoiesis49. This mutation was predicted to gain binding motifs of MYC and ATF4, but a downregulation trend in its target gene was observed in AML10 (Supplementary Fig. 7b–d). Interestingly, recurrent non-coding somatic and germline mutations within this enhancer suppressed enhancer activity and WT1 expression by disrupting MYB binding in acute promyelocytic leukemia (APL)45. Consequently, we focused our investigation on MYB, the primary transcription factor that natively occupies this enhancer region. Our analysis revealed that a specific mutation chr6:135184142:A:C, within the enhancer region was predicted to disrupt the RARA binding motif, resulting in decreased MYB expression in the same patient (Supplementary Fig. 7e–h). This regulatory alteration may potentially contribute to the observed downregulation of WT1 expression.

In summary, our findings demonstrated that functional non-coding mutations can alter TF motifs, thereby modulating the expression of downstream target genes, suggesting their potential role as regulatory drivers in AML pathogenesis. While further experimental validation is needed to fully establish the causal relationship between non-coding mutations and gene expression changes, the current evidence provides a foundation for understanding their contribution to disease mechanisms.

Discussion

Identifying mutations and gene rearrangements is crucial for the diagnosis and subtyping of AML. However, the role of non-coding genetic variants in AML pathophysiology and heterogeneity remains largely unclear. Here, we performed single-cell and bulk multi-omics profiling of bone marrow in AML and explored candidate functional non-coding mutations. Large-scale perturbation screening and functional analyses have demonstrated that non-coding mutations contribute to enhancer dysregulation in leukemogenesis8. Nevertheless, many of these mutations lack functional interpretation at the resolution of individual cell types. In this study, we developed eMut to identify and analyze potential functional non-coding mutations that contribute to cellular heterogeneity in AML. Similar to existing single-cell mutation detection methods, eMut addresses the challenges of low-quality and sparse single-cell data18,19,23 by performing imputation to enhance mutation profiles. Importantly, eMut employed a strategy with several key differences compared to the previous mutation detection pipelines, focusing on the functional consequences of non-coding mutations at four levels: cell-type-specific enrichment, the functional importance of the mutated enhancers, the potential disruption of transcription factor binding motifs, and the effects on downstream target genes.

Our results indicate that the majority of non-coding mutations are patient-specific, while recurrent mutations are cell-type-specific, highlighting the diversity of non-coding mutations in AML. Mutated CREs are enriched with blood-related genetic variants, potentially linked to AML-associated genes, and harbor a higher abundance of functional CREs, supporting the development of therapies targeting functionally mutated CREs, such as those targeting RUNX144. Furthermore, these functional non-coding mutations, which alter TF binding and impact proto-oncogenes (via gain-of-function) and tumor suppressors (via loss-of-function), help prioritize candidate mutations for experimental validation and may also serve as potential therapeutic targets through genetic, such as base editing of BCL11A enhancer for β-hemoglobinopathies50. Our results provide insights into the complex heterogeneity and pathogenesis of AML and offer a valuable reference to guide the development of diagnostic biomarkers for risk stratification and to advance the identification of therapeutic targets beyond the coding region.

This study has several limitations. First, unlike solid tumors, obtaining paired normal blood samples for somatic mutation detection in AML is challenging. Although using other normal tissues as controls is feasible, it is important to acknowledge that chromatin accessibility varies across cell types and patients12,13. This variability may limit the utility of scATAC-seq data in detecting somatic mutations, particularly those specific to hematopoietic cell types. To address this, we employed the Monopogen method19 to predict potential non-coding mutations within the open chromatin regions. However, it is essential to note that this method currently only predicts non-coding mutations on autosomes. Second, due to the lack of matched single-cell transcriptome data for most samples and the sparsity of such data, we assessed the impact of non-coding mutations on downstream target genes using bulk RNA-seq data. Consequently, it was not possible to entirely exclude the influence of co-occurring mutations on these genes. Third, while these analyses provide a valuable framework for linking genetic variation to disease phenotypes, experimental validation in an appropriate cellular context is essential to confirm the regulatory interactions between individual mutations and genes. In the future, it will be important to collect additional AML samples and conduct further experiments to validate the functional impact of selected candidate non-coding mutations, as demonstrated by the validation of WT1 non-coding mutations45.

Methods

Sample acquisition and patient consent

This study includes ten samples of acute myeloid leukemia (AML). Samples were collected from the Fujian Institute of Hematology, the Fujian Provincial Key Laboratory on Hematology, and Fujian Medical University Union Hospital, and all patients provided written informed consent. Clinical information (e.g., age and gender) along with diagnostic genetic testing (cytogenetic and molecular genetic) were obtained for each patient (Supplementary Data 1). The studies involving human participants were reviewed and approved by the ethics committee of Union Hospital, Fujian Medical University. All ethical regulations relevant to human research participants were followed.

Cell lines

The HEK293T cell line was a gift from Huafeng Xie’s laboratory at South China University of Technology, and the NIH3T3 cell line was a gift from Dawang Zhou’s laboratory at Xiamen University. Both HEK293T and NIH3T3 cells were cultured in DMEM with 10% FBS medium at 37 °C with 5% CO2 in a humidified incubator.

Bone marrow sample preparation

Bone marrow samples from patients were 1:1 mixed with cold D-PBS, then loaded to Ficoll-Paque PLUS (cytiva, 17144002) and centrifuged at 400 × g for 30 min. The bone marrow mononuclear cells (BMMCs) in the middle layer were transferred out and washed twice with cold D-PBS.

RNA-seq

One million BMMCs were collected, and total RNA was extracted by the MiniBEST Universal RNA Extraction Kit (Takara, 9767). Libraries of mRNA were constructed by Novogene (Beijing, China). Briefly, mRNA was purified using poly-T oligo-attached magnetic beads. Fragmentation was performed by divalent cations before first-strand cDNA synthesis by random hexamer and M-MuLV reverse transcriptase. The second strands were synthesized using DNA polymerase I and RNase H. After adapter ligation, 370~420 bp cDNA fragments were purified and amplified for PE150 sequencing by Illumina NovaSeq 6000.

Bulk ATAC-seq

One million BMMCs were collected and centrifuged at 300 × g for 5 min, resuspended in 200 μL lysis buffer (10 mM Tris buffer pH 8.0, 10 mM NaCl, 3 mM MgCl2, 0.1% CA630, 0.1% Tween-20, 0.01% Digitonin) on ice for 3 min, after adding 1 ml wash buffer, cells were centrifuged at 500 × g for 5 min, and resuspend with wash buffer. 50,000 cells from each AML sample were taken for ATAC reaction. ATAC reactions were performed following the manuscript of the TruePrep DNA Library Prep Kit (Vazyme, TD501).

CUT&Tag

CUT&Tag experiments were performed according to the manuscript of the Hyperactive Universal CUT&Tag Assay Kit (Vazyme, TD903). Briefly, 50,000 BMMCs were collected and incubated with ConA Beads, anti-H3K27ac antibody (Abcam, ab177178), goat-anti-rabbit secondary antibody (Vazyme), pA/G-Tnp, after tagmentation, accessible chromatin regions were purified and amplified for PE150 sequencing by Illumina NovaSeq 6000.

Diagnostic genetic testing

Diagnostic genetic testing was conducted by Tianjin Sino-US Diagnostics Laboratory, with G-band analysis performed on bone marrow samples obtained from patients to determine karyotypes. Molecular mutation screening for 40 genes (Supplementary Data 1) was performed using the Illumina high-throughput sequencing platform, based on target region probe capture technology. The mutation data were filtered and analyzed using public databases such as dbSNP (V156), 1000 Genomes51, gnomAD52, COSMIC53, and ClinVar54, alongside functional prediction software like PolyPhen-255 and SIFT56, as well as the internal database of Tianjin Sino-US Diagnostics Laboratory, to obtain the final analysis results.

Single-cell ATAC-seq library preparation

Cell preparation was similar to bulk ATAC-seq. Single-cell ATAC library construction steps were mainly referenced from Paired-seq24 and SHARE-seq57. Briefly, ATAC reactions were performed in 50ul ATAC buffer (20 mM Tris-HCl pH 8.0, 66 mM KCl, 3 mM MgCl2, 16% DMF) with barcoded-Tn5 at 37 °C for 30 min, and stopped by adding 10 μL stop buffer (50 mM Tris-HCl pH 8.0, 50 mM EDTA). Transposed cells were then pooled and split into 96-well plates containing distinct barcodes (Supplementary Data 5), after three rounds of barcode ligation, cells were pooled together and split into sub-libraries. Sub-libraries were reverse-crosslinked, and purified using PCR & DNA Cleanup Kit (NEB, T1030S), and amplified for 15 cycles. PCR products were then purified by 0.8× DNA Clean Beads (Vazyme, N411-02) for PE150 sequencing by Illumina NovaSeq X plus or BGI DNBSEQ-T7.

Processing and analysis of scATAC-seq data

Data preprocessing

The scATAC-seq sequencing data from 10 individuals were processed similarly to SHARE-seq57. First, cell barcodes were extracted from Read2 and added to the SeqID of Read1. Merged Read1 was mapped to the genome by BWA58. PCR duplicates were removed after mapping using Picard (http://broadinstitute.github.io/picard/). Fragment files were generated from BAM files using a custom script.

Identification of cell types

Then, preprocessing and downstream analysis of data used fragment files as input to ArchR (Version 1.0.2)59. Cells with a FriTSS below 2 or nFragments below 300 were all filtered from the following analysis. After quality control, we used the ‘addIterativeLSI’ function with 25,000 variable features and 30 dimensions for dimensionality reduction and ‘addClusters’ with resolution=0.8 to identify clusters. The two-dimensional representation of the data was generated using the ‘addUMAP’ function. The gene activity scores were similarly smoothed using the MAGIC algorithm60, and motif activity scores were computed with ChromVAR61. Both sets of scores were then visualized on Uniform Manifold Approximation and Projection (UMAP).

Linkage of cis-regulatory elements to target genes

Candidate peak-gene links were identified by the ‘addPeak2GeneLinks’ function in ArchR59. This procedure firstly links peaks located within a genomic distance of 250 kb to the TSS of each gene. Subsequently, the Pearson correlation coefficient of log2-normalized accessibility and gene activity scores was computed. Eventually, 33,483 high-confidence peak-to-gene linkages were determined by selecting those pairs with a Pearson correlation coefficient exceeding 0.2.

eMut overview

Mutation detection at the single-cell level

The eMut began with individual BAM files of single-cell sequencing data. It removed reads that have high alignment mismatches (default of four mismatches) and low mapping quality. For tumor samples with paired normal samples, somatic single-nucleotide variants (SNVs) and small insertions and deletions (INDELs) were called using GATK Mutect235 with its default settings and retained based on “PASS” labeling. For tumor samples without paired normal samples, eMut applied GATK Mutect2 to identify germline and somatic mutations and utilized Monopogen19 with its default setting to predict potential somatic mutations for each cell. All mutations were annotated by VEP (V102) and the dbSNP database (V156). The vcf2maf method was used to convert the VCF format files of annotated mutations into the MAF format, which is subsequently better handled.

Imputation of mutation profile

Considering dropout events due to the sparsity of single-cell technical data, we applied SCAVENGE36 with its default parameters to infer potential mutated cells. Briefly, an M-kNN graph was constructed based on scATAC-seq data to represent cell-cell similarity. For a given mutation, the mutated cells (as seed cells) were projected onto the M-kNN graph. Through network propagation of these seed nodes, relevant cells were identified as potential mutated cells.

Identification of cell-type-specific mutations

For each cell type, we determined whether the number of mutated cells for a given mutation significantly exceeds what would be expected by chance using a hypergeometric test. We defined mutations as over-occurring in a specific cell type if they have a false discovery rate (FDR) of less than 0.05.

Prediction of the disruptiveness of mutation on TF binding sites

To explore the impact of mutations in their located enhancer, motifbreakR62 was applied to predict TF motif disruptions (loss or gain) for numerous single-nucleotide variants using various sources of TF motifs (e.g., JASPAR63, ENCODE64). In the predicted results, “strong” effect motif change will be considered as the potential impact of mutations.

Identification of hypermutated CREs

Among all accessible regions, we adapted the ActiveDriverWGS method65 with modifications to identify hypermutated CREs based on scATAC-seq data. Specifically, we changed adjacent flanking genomic regions to flanking accessible regions (±500 kbps) for training the model of expected mutations, we identified hypermutated CREs (observed excess expected mutations) in each sample.

Mutation analysis in AML

Mutation detection

Considering AML tumor samples without paired normal samples, we applied the Monopogen module in eMut to identify their potential somatic mutations, and the GATK module to identify germline mutations. Additionally, SComatic was employed to call somatic mutations using parameters min_cell_type = 1 and min_cell_type = 2.

Hypermutated CREs annotation

We identified the hypermutated CREs of each AML sample based on germline and somatic mutations identified by the GATK module in eMut. The AML-related genes (MalaCards), hematopoietic-related GWAS46, and cis-eQTL47 (https://www.eqtlgen.org/phase1.html) datasets were downloaded to annotate these hypermuated CREs.

Mutational spectra and signatures

We conducted mutational spectra analyses for mutations in AML identified by scATAC-seq (detected by Monopogen or SComatic), ATAC-seq (detected by GATK), WES-seq (The Cancer Genome Atlas, https://portal.gdc.cancer.gov), and WGS-seq (ICGC39, http://dcc.icgc.org/releases/PCAWG/consensus_snv_indel/) using the R package MutationalPatterns38. Based on the COSMIC Mutational Signatures catalog v.366, we estimated the similarity of selected mutational processes to the observed mutational spectrum using the function get_known_signatures with default parameter values.

The depletion of functional somatic mutations in T/NK

Statistical analysis of the mutated cell frequency (for somatic mutations) and mutation frequency (for germline mutations) revealed a significant reduction in T/NK cells (Supplementary Fig. 5b–e). To account for confounding effects of open chromatin regions and sequencing depth, we normalized somatic mutation frequency using germline mutation frequency as a background for each cell (Supplementary Fig. 5f). Additionally, to determine whether functional somatic mutations were depleted in T/NK cells while passenger mutations were not, we calculated the somatic-to-germline mutation rate for four somatic mutation categories: all somatic mutations, somatic mutations altering TF motif, somatic mutations altering TF motif near AML-related genes, and somatic mutations altering TF motif that modulate AML-related gene expression. Cells were ranked based on these rates, and the top cells at different thresholds were selected to assess whether T/NK cells exhibited significant depletion compared to random.

Statistics and reproducibility

Gene expression differences and mutation frequencies/ratios were compared using the Wilcoxon rank-sum test (two-sided). Enrichment analyses, including cell type, functional CRE, survival gene, and mutated cell enrichments, were conducted using the hypergeometric test. Pearson correlation analysis evaluated the correlation between the mutated cell and blast cell proportions. Survival differences between groups were assessed using the log-rank test. All p-values were adjusted for multiple testing using the Benjamini–Hochberg false discovery rate (FDR) correction. Statistical significance was defined as: P ≥ 0.05 (not significant, ns); P < 0.05 (*); P < 0.01 (**); and P < 0.001 (***). To ensure reproducibility, bulk ATAC-seq served as biological replicates (10 AML samples) to validate scATAC-seq findings, while RNA sequencing included three technical replicates for data robustness.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.