Abstract
The distribution of m6A across various RNA isoforms and its heterogeneity within single cells are still not well understood. Here, we develop m6A-isoSC-seq, which employs both Oxford Nanopore long-read and Illumina short-read sequencing on the same 10x Genomics single-cell cDNA library with APOBEC1-YTH induced C-to-U mutations near m6A sites. Through m6A-isoSC-seq on a pooled sample of three cell line origins, we unveil a profound degree of m6A heterogeneity at both the isoform and single-cell levels. Through comparisons across single cells, we identify widespread specific m6A methylation on certain RNA isoforms, usually those misprocessed RNA isoforms. Compared to the coding isoforms of the same genes, the expression of highly methylated misprocessed RNA isoforms is more sensitive to METTL3 depletion. These misprocessed RNAs tend to have excessive m6A sites in coding regions, which are targets of CDS-m6A decay (CMD). This study offers undocumented insights into the role of m6A in RNA surveillance.
Similar content being viewed by others
Introduction
N6-methyladenosine (m6A) RNA methylation is the most prevalent RNA modification on mRNAs. The best characterized role of m6A is promoting RNA degradation through its reader protein YTHDF2, suggesting the potential role of m6A in RNA surveillance1,2. Recent investigations have uncovered the inhibitory role of the exon junction complex (EJC) in the interaction between methyltransferase complexes and RNAs, thereby suppresses m6A near the exon junctions3,4,5,6. These findings well-explained how the topological distribution of m6A on RNAs is formed and implied the selective nature of m6A in modifying distinct mRNA isoforms. It is interesting that the vast majority of internal exons in canonical transcripts are short. It is possible that cells may use these special patterns as rules to recognize the normally processed RNA isoforms, because the incorrectly processed RNAs may be easy to produce RNAs that have long internal exons. However, due to the ambiguity of m6A on different RNA isoforms, whether m6A plays roles in RNA surveillance by facilitating the degradation of the misprocessed RNAs is still unknown.
The ambiguity of m6A modifications on isoforms primarily arises from the fact that the majority of m6A identification methods are unable to preserve intact RNAs. Antibody based methods7,8,9,10,11,12 such as MeRIP-seq (also known as m6A-seq) fragments the RNAs before immunoprecipitation with m6A antibody. Antibody independent methods13,14,15 such as GLORI, which converts non-m6A modified adenosine into inosine, also cause severe chemical mediated RNA breakdown, despite providing accurate locations and quantifications of m6A sites at single-nucleotide resolution16. DART-seq is a method that induces C-to-U mutations beside m6A sites by transfecting APOBEC1-YTH fusion protein in the cells, it preserves intact RNAs when it marks the m6A in live cells17.
With the rapid development of technology, m6A detection at single-cell resolution has been developed based on MeRIP-seq or DART-seq. MeRIP-seq based single-cell detection methods include scm6A-seq and picogram-scale MeRIP–seq (picoMeRIP-seq)18,19. scm6A-seq achieved single-cell resolution by barcoding the fragmented RNAs of each cell followed by MeRIP-seq, it revealed that the RNA abundances in m6A immunoprecipitation (IP) fraction could contribute to the gene expression-based clustering of cells in mouse early development19; picoMeRIP-seq is able to perform MeRIP-seq on picogram-scale input mRNAs through optimizing the protocol of original MeRIP-seq18. However, MeRIP-based methods cannot adopt droplet-based single-cell sequencing methods, such as 10x genomics, thereby they can only study a small number of single cells and their applications are largely restricted in studying early embryo development; On the other hand, scDART-seq20, which is based on DART-seq17, can detect m6A in single cells by adopting conventional single-cell RNA-seq because DART-seq introduce C-to-U mutations beside m6A sites in the living cells. Recent study reported that scDART-seq was able to study the m6A differences across the cell populations clustered based on gene expression in hippocampus21. However, current scDART-seq adopting droplet-based single-cell sequencing can only detect the m6A sites at the 3’ end of the RNAs. Additionally, due to the technical challenges in accurately quantifying m6A in single cells, whether the m6A signatures alone other than gene expression can correctly cluster different cell population is still not clear yet.
Because DART-seq preserves intact RNAs, it has the potential to detect m6A in full-length RNAs at single-cell resolution by adopting single-cell long-read sequencing. The third generation long-read sequencing has demonstrated capability in unraveling the complexities of RNA isoforms22,23. Nevertheless, due to the relatively high rate of sequencing error of long-read, it is of great challenge in accurately identifying the cell barcodes and UMIs. Short read single-cell sequencing on the same library has been proved to be able to correct the demultiplexing error of long-read single-cell sequencing24,25,26. However, whether this strategy adopted to DART-seq can accurately decipher the heterogeneity of m6A on isoforms and in single cells is still unknown.
To comprehensively characterize m6A modification at the single-cell and isoform level, we developed a method called m6A-isoSC-seq. This method combines the power of Oxford Nanopore (ONT) long-read sequencing and Illumina short-read sequencing technologies to sequence the same 10x genomics library derived from cells transfected with the APOBEC1-YTH fusion protein. We applied m6A-isoSC-seq on a pooled sample of three different cell line origins and revealed high degree of heterogeneity of m6A on transcript isoform and single-cell levels. We also discovered uncharacterized role of m6A on RNA surveillance of misprocessed isoforms.
Results
Enhanced DART-seq improves the identification and quantification of m6A
To improve the C-to-U transition efficiency of the DART-seq, we added a copGFP tag and T2A (Thosea asigna virus 2A) self-cleaving peptide to the original APOBEC1-YTH vector used in DART-seq17 and used flow cytometry to sort the cells with high fluorescence intensity (Fig. 1a). We referred this improved method as enhanced DART-seq (eDART-seq) and applied eDART-seq to HEK293T, HeLa, and HepG2 cells, respectively (Supplementary Fig. 1a). Although the YTH domain of YTHDF2, which is known to promote RNA decay27, is used in eDART-seq, we did not observe significant down-regulation of gene expression in APOBEC1-YTH transfected cells compared to the control cells (Supplementary Fig. 1b, c). In addition, gene expression fold changes between edited and non-edited samples were also very similar for genes with different mutation rates in eDART-seq (Supplementary Fig. 1d, e). These results suggest that YTH domain alone is not likely to be able to induce RNA degradation, which is consistent with the previous reports that YTHDF2 recruits the degradation machinery through its N-terminal domain other than the YTH domain located in the C-terminal27,28,29,30. In addition, we found transfected HA-tagged APOBEC1-YTH-copGFP proteins were expressed at approximately the same levels in both nucleus and cytoplasm (Supplementary Fig. 1f). Besides, similar as GLORI, the C-to-U mutation numbers in eDART-seq had positive other than negative correlation with the previously reported Nuc/Cyto ratio31 of RNA for each gene (Supplementary Fig. 1g–j), suggesting it is not likely to create technical bias of C-to-U mutations for RNAs with different cellular localizations. These results suggest that eDART-seq is a reliable method to measure m6A.
a Schematic diagram of the eDART-seq method. b Scatter plot demonstrating the correlation of C-to-U mutation rate (n = 9380) between DART-seq and eDART-seq. Pearson correlation coefficient and P value based on a two-tailed t test were shown. c Histogram showing the distributions of C-to-U mutation rates in DART-Seq and eDART-seq, respectively. d Normalized distributions of C-to-U mutation sites identified by eDART-seq across 5′UTR, CDS, and 3’UTR in HEK293T, HeLa, and HepG2, respectively. e Scatter plot demonstrating the correlation of C-to-U mutation rates between the mutation sites adjacent to m6A site and the mutation sites at two nucleotides downstream of the m6A sites (n = 404). Pearson correlation coefficient and P value based on a two-tailed t test were shown. f Scatter plot showing the −log10 Pearson correlation P values based on a two-tailed t test (y-axis) of the C-to-U mutation rates between each mutation site with certain distance (x-axis) to the m6A site and the adjacent C-to-U mutation site (+1 position) of the m6A site. g Histogram showing the distribution of the lengths of mutation peaks. h Box plot showing the correlation between max methylation levels of mutation peaks and the GLORI determined max m6A levels of the sites within the peak regions. P values of two-tailed Wilcoxon tests are indicated. For box plots, the horizontal line represents the median value, the lower and upper quartiles represent the 25th and 75th percentiles, and the whiskers are extended to 1.5 × IQR. Source data are provided as a Source Data file.
After filtering out all the C-to-U mutations that could also be induced by transfecting YTHlacking APOBEC132, we identified 199,227 C-to-U mutation sites in HEK293T cell line (Supplementary Fig. 1k), which is 4.9 fold of the number of C-to-U mutation sites identified in the original DART-seq study17. Moreover, the mutation rates were also significantly higher than observed in the original DART-seq study17 (Fig. 1b, c). The significantly higher editing efficiencies of eDART-seq were also observed at the known single-nucleotide m6A sites determined by GLORI16, miCLIP9, eTAM-seq33, m6A-SAC-seq34, m6ACE-seq35, MAZTER14, and m6A-REF-seq36, respectively (Supplementary Fig. 2a). Ectopic expression of two GLORI determined lowly methylated m6A sites within 244 nt regions of ACTB (level = 0.11) and FEN1 (level = 0.278) in HEK293T cells also confirmed the significantly higher editing efficiencies of copGFP positive cells than unsorted cells (Supplementary Fig. 2b).
Consistent with the original DART-seq study17, the mutation sites were enriched in DRACH m6A motifs and near stop codons (Fig. 1d and Supplementary Fig. 2c–g); 78.5% of the mutation sites occurred at 1 bp downstream of adenosine, 46.5% of which were in DRACH; however, only 14.0% of these C-to-U mutations in DRACH were GLORI annotated m6A sites in HEK293T cells (Supplementary Fig. 2h, i), suggesting that most of these APOBEC1-YTH induced C-to-U mutations in DRACH do not represent genuine m6A sites. Similar results were also observed in HeLa and HepG2 cell lines. On the other hand, although 64.3% of the total C-to-U mutations were not within DRACH motif, in contrast to the randomly selected adenosines on expressed transcripts, we found the mutation sites in HEK293T were enriched within 100 bp of GLORI16 identified m6A sites of the same cell line (Supplementary Fig. 3a, b). We further found the C-to-U mutation rates at the sites following the m6A sites (+1 positions) were significantly correlated with those at the +2 positions as well as those within 100 bp distances from the m6A sites (Fig. 1e, f). These results strongly indicate that the mutations that apart from the +1 position of m6A sites were also induced through the binding of APOBEC1-YTH to the m6A sites. Consistent with previous study37, we found that the single-nucleotide m6A levels of both DART-seq and eDART-seq showed poor correlations with those measured by GLORI (Supplementary Fig. 3c–f). However, if we divided the transcriptome into 100 bp sliding windows and took the maximal m6A level in each window, we found the maximal m6A levels of sliding windows in eDART-seq showed significant positive correlations with GLORI and eTAM-seq (Supplementary Fig. 3g–j). Moreover, we found the average mutation rates of individual genes in eDART-seq showed significant positive correlations with that in GLORI and eTAM-seq, respectively (Supplementary Fig. 3k–n). Overall, these results suggest that eDART-seq is a reliable method to quantify m6A at region level but not single-nucleotide level.
We then found 57.5% of the 199,227 C-to-U mutation sites (Supplementary Fig. 3o), including those not within m6A motifs, were clustered into 14,918 m6A peaks with median lengths of 240 bp in HEK293T cells (Fig. 1g). The m6A motif DRACH was significantly enriched in the peaks (Supplementary Fig. 3p), 46.8% and 59.7% of the peaks have GLORI16 and miCLIP38 identified single-nucleotide m6A sites within the peaks, respectively (Supplementary Fig. 3q). To calculate m6A levels of the peaks, we used the previously defined matric ε-score39 to measure the mean C-to-U mutation rates (MMRs) of the peaks. We found the m6A levels of the mutation peaks were significantly and positively correlated with GLORI quantified m6A levels (Fig. 1h), suggesting reliability of peak-based quantification of m6A levels. In addition, although the single-nucleotide C-to-U mutations in DRACH is not consistent with the DRACH composition of real m6A sites, we found the most common 5-mers, such as GGACT, were actually the mostly enriched in our m6A peaks, as defined by clusters of C-to-U mutations (Supplementary Fig. 3r), further suggesting that peak-based analyses of eDART-seq are much more accurate than the single-nucleotide resolution m6A analyses.
Single-cell level characterization of m6A
To demonstrated the ability of eDART-seq in resolving m6A modifications on the single-cell level, we evaluated a mixed cell population by pooling the copGFP positive cells of three copGFP-tagged APOBEC1-YTH overexpressing cell lines (HepG2, HeLa, and HEK293T). We used both Illumina short-read and Oxford Nanopore long-read sequencing (ONT) platforms on the same 10x Genomics single-cell cDNA library with cell barcodes and unique molecular identifiers (UMIs) (Fig. 2a). A total of 310 million Illumina reads and 339 million ONT long-reads from 6917 cells were generated. As expected, the cells were clustered into three distinct populations originated from the three different cell lines using gene expression based on Illumina reads (Supplementary Fig. 4a–h). After filtering out all the C-to-U mutations that could also be induced by transfecting YTHlacking APOBEC132, the Illumina single-cell sequencing data generated 122,110 C-to-U mutation sites with similar distributions (Supplementary Fig. 5a–c) and mutation rates (Fig. 2b and Supplementary Fig. 5d, e) as those from bulk eDART-seq. We did observe an enrichment of C-to-U mutation sites at 3’ end of RNAs (Supplementary Fig. 5c), reflecting the common 3’ bias of 10x Genomics single-cell data.
a Experimental workflow of m6A-isoSC-seq. b Scatter plot demonstrating the correlation of C-to-U mutation rate in bulk eDART-seq and pseudobulk of Illumina based single-cell data of HEK293T cells (n = 1584). Pearson correlation coefficient and P value based on a two-tailed t test were shown. c Volcano plot showing the correlation between m6A methylation level and gene expression at single-cell level in HeLa. Pearson correlation coefficient and P value based on a two-tailed t-test were shown. d Box plot demonstrating the correlation between normalized gene expression and the m6A-isoSC-seq determined m6A levels of FTH1 gene in single-cell level. P values of two-tailed Wilcoxon tests are indicated. e UMAP visualization of all cells clustered by MMRs in Illumina based single-cell data. f Heatmap of the MMRs showing top 50 differentially methylated marker peaks across different MMR classified cell clusters. g Stacked bar plot showing the percentages of cell types for each MMR classified cell cluster. The number of cells for the major cell type of each cluster is indicated in the corresponding area. h Distributions of the percentage of cells methylated in each mutation peak in HeLa, HepG2, and HEK293T, respectively. The methylated cell was defined as those cells with at least one C-to-U mutation in the peak. i Scatter plot demonstrating the correlation between the m6A methylation level of pseudobulk and the average m6A methylation levels of single cells for each mutation peak (n = 3161) in HEK293T. Pearson correlation coefficient and P value based on a two-tailed t-test were shown. j Scatter plot demonstrating the correlation between the m6A methylation level of pseudobulk and the percentage of methylated cells for each mutation peak (n = 2878) in HEK293T. Pearson correlation coefficient and P value based on a two-tailed t test were shown. For box plots, the horizontal line represents the median value, the lower and upper quartiles represent the 25th and 75th percentiles, and the whiskers are extended to 1.5 × IQR. Source data are provided as a Source Data file.
We then asked whether using the m6A levels alone could separate the pooled cells into correct subpopulations of different cell line origins. We observed a significant negative correlation between gene expression and m6A levels calculated as MMRs at single-cell level (Fig. 2c, d and Supplementary Fig. 5f, g). Furthermore, we found the overall m6A modification levels of single-cells were positively correlated with gene expression for METTL3 (r = 0.27), METTL14 (r = 0.34), and WTAP (r = 0.28), the key components of m6A methyltransferase complex (Supplementary Fig. 5h–j). These results suggest that m6A-isoSC-seq provides a reliable quantification of m6A at single-cell level. To rule out the possibility that differentially expressed genes among the 3 cell lines may technically contribute to the clustering of cells, we filtered out all the 1023 marker genes that contributed to the clustering based on gene expression. Based on the m6A level of 11951 mutation peaks (C-to-U mutation clusters) (Supplementary Fig. 5k, l) with an average length of 330 bp (Supplementary Fig. 5m), the cells could be unambiguously grouped into 3 major clusters (Fig. 2e, f) with more than 90% of the cells being HeLa, HepG2, HEK293T cells, respectively (Fig. 2g), suggesting m6A are cell line-specific at single-cell level. Indeed, 84 (73.6%) of the 114 marker peaks are methylated differently among the different cell lines (Supplementary Fig. 5n–q).
The bulk m6A methylation level of each peak is related to both the percentage of methylated cells and the m6A levels in the methylated cells. Contrasted to the previous report that most of m6A sites are methylated in a small fraction of cells based on the original DART-seq with low C-to-U transition efficiency17, we found about half of the cells were methylated for each peak on average (Fig. 2h), suggesting that high transition efficiency of eDART-seq is critical to identify the correct m6A pattern at single-cell level. Although the mean MMRs in single cells methylated at the corresponding peaks were strongly correlated with the MMRs of the pooled single-cell data (Fig. 2i and Supplementary Fig. 5r, s), we observed a much stronger correlation between the fraction of cells methylated at the corresponding peaks and the MMRs of the pooled single-cell data in all the three cell lines, respectively (Fig. 2j and Supplementary Fig. 5t, u). The results suggest that fractions of cells methylated at the peaks have a major contribution to the bulk m6A levels of the peaks.
On the other hand, although the mixture of cell lines with different transfection efficiencies was used in this study, we chose two human embryonic stem cell (hESC) differentiation models to test whether the transfection process affects the estimation of cell proportions: one is the differentiation of hESCs into mesoderm, and the other one is differentiation of hESCs into definitive endoderm using an inefficient differentiation protocol (~60%), as described in a previous study40. Through imaging with HA antibody, we found it distributed in both nucleus and cytoplasm in hESCs (Supplementary Fig. 6a). After differentiation into mesoderm, the proportions of BRACHYURY marked mesoderm cells were equally high (98.15% vs. 97.62%) in GFP+ and GFP− cell populations (Supplementary Fig. 6a, b). Similarly, after the differentiation into definitive endoderm with inefficient protocol, both the FOXA2 marked (FOXA2+) endoderm cell population and the FOXA2− cell population had the similar fractions of GFP+ cells (34.18% vs. 32.51%, P = 0.68, Supplementary Fig. 6c, d). These results suggest that although the transfection efficiencies are different across different cell lines, the transfected stem cells do not have editing bias for specific cell populations during cell differentiation, which enables the reliable usage of m6A-isoSC-seq in profiling the single-cell and isoform level m6A modifications across cell populations during differentiation.
Isoform level characterization of m6A at single-cell resolution
We then used the ONT reads of single cells to decipher the landscape of isoform and single-cell level m6A modification. Similar as the previous single-cell ONT data24, the single-cell ONT reads had the median length (N50) of 917 bp and average length of 862 bp. About 192 million out of the 339 million raw ONT long-reads were uniquely mapped reads with poly(A) tails. We then used the cell barcodes and UMIs of Illumina short-reads originated from the same 10x genomics cDNA library as a white list to guide the identification of cell barcodes and UMIs of ONT long-reads and obtained ~38 million (19.8% of the ~192 million) long-reads with cell barcodes and UMIs for downstream analyses (Fig. 3a). 97% of the mappable ONT reads were mapped to the same genes as the genes mapped by the Illumina reads with the same cell barcodes and UMIs, suggesting the barcodes and UMIs of ONT reads were correctly identified. 61.1% of the reads with UMI and cell barcode covered more than 90% lengths of full-length RNAs, and it was 32.3% that covered the full-length RNAs; while 17% of reads covered 3’UTR of mRNA only. Both the ONT determined expression of genes and transcripts can be used to classify the cells into 3 clusters that were similar as those classified with the gene expression determined by Illumina reads, further suggesting the high quality of the ONT single-cell sequencing (Supplementary Fig. 7a–f).
a Bar plot showing the proportion of remaining single-cell long-reads after step-by-step filtering. b Scatter plot demonstrating the correlation of C-to-U mutation rate between bulk eDART-seq and pseudobulk of ONT single-cell reads in HeLa (n = 51,184). Pearson correlation coefficient and P value based on a two-tailed t-test were shown. c Distributions of the percentage of cells with m6A methylation on each isoform in HeLa, HepG2, and HEK293T cells. The methylated cell refers to the cell with at least one C-to-U mutation in the specific transcript. d Scatter plot demonstrating the correlation between the m6A methylation level of pseudobulk and the percentage of methylated cells for each isoform (n = 3576) in ONT single-cell data of HeLa cells. Pearson correlation coefficient and P value based on a two-tailed t test were shown. e Scatter plot demonstrating the correlation between the m6A methylation level of pseudobulk and the average m6A methylation levels of single cells for each isoform (n = 2592) in ONT single-cell data of HeLa cells. Pearson correlation coefficient and P value based on a two-tailed t test were shown. f UMAP visualization of all cells clustered by isoform m6A levels in ONT single-cell data. g Heatmap of normalized m6A methylation levels for top 20 differentially methylated marker transcripts for each isoform m6A level classified cell cluster. h Stacked bar plot showing the percentages of cell types for each isoform m6A level classified cell cluster. The number of cells for the major cell type of each cluster is indicated in the corresponding area. i UMAP visualization of all HEK293T cells clustered by isoform m6A levels in ONT single cell data. j Top gene ontologies of HEK293T cluster2 (HEK293T_2) specifically expressed genes. Source data are provided as a Source Data file.
The long-read data generated 69,428 C-to-U mutation sites from 5.7% of long-reads. These mutation sites were composed of 24.5% of the C-to-U mutations in Illumina-based bulk eDART-seq. Of note, the C-to-U mutation rates correlated well between the pooled single-cell ONT reads and the Illumina based bulk eDART-seq reads at single site level (r = 0.75, 0.61, 0.67 for HeLa, HEK293T, and HepG2, respectively) and peak level (r = 0.4, 0.32, 0.39 for HeLa, HEK293T, and HepG2, respectively) (Fig. 3b and Supplementary Fig. 8a–e).
Similar to the Illumina single-cell data (Fig. 2f), on average, about half of the cells were methylated for each transcript (Fig. 3c). Moreover, we also observed that the isoform m6A levels of the pooled single-cell data had a much greater correlation coefficient with the fraction of cells methylated for the isoforms (Fig. 3d and Supplementary Fig. 8f, g) than the mean isoform m6A levels of single cells methylated for the isoforms in all the three cell lines (Fig. 3e and Supplementary Fig. 8h, i), respectively, further suggesting the fraction of m6A methylated cells has a major contribution to the final m6A levels of the bulk m6A levels of isoforms.
To further clarify whether the m6A level of RNA isoforms could be used to classify the cells from different cell lines, we also used the mean C-to-U mutation rate (MMR) as the matric to measure the m6A level of each transcript. Based on the m6A levels of the transcripts without differential expression across the three cell lines, the single cells could still be clustered into 3 distinct subpopulations of different cell line origins (Fig. 3f–h). Moreover, 639 (72.0%) of the 887 marker transcripts did display differential m6A levels in addition to differential expression across the cells from the 3 cell lines. As exemplified in the transcript NME2-201, the distributions of m6A levels are significantly different across the single cells of different cell line origins (Supplementary Fig. 8j, k). Specific regions of this transcript isoform also exhibit distinct mutation rates in the single cells of different cell line origins (Supplementary Fig. 8j).
Interestingly, we found based on m6A levels of transcripts, the HEK293T cells could be grouped into three distinct subpopulations (Fig. 3i), and the expression of APOBEC1-YTH showed no difference among these subpopulations (Supplementary Fig. 9a). These three subpopulations could also be clustered according to isoform expression (Supplementary Fig. 9b, c). We further found the marker genes of the cluster 2 of HEK293T cells were significantly enriched in GO terms associated with cellular detoxification, suggesting that the cells in cluster 2 are more sensitive to the toxicity of the reagents used in treating cells, such as transfection (Fig. 3j).
Isoform-specific m6A methylation contributes to cell line-specific expression of isoforms
We found 516 isoforms with significant negative correlation and only 25 isoforms with significant positive correlations between m6A methylation level and the expression of corresponding isoforms across the single cells of HeLa (Fig. 4a). Moreover, we found these 516 isoforms with m6A levels negatively correlated with their expression had significantly higher m6A methylation than the other isoforms (Fig. 4b), which is consistent with the well characterized role of m6A in degrading RNAs. As exemplified in Fig. 4c, there is a strong negative correlation between the expression of the transcript ISG15-203 and its m6A levels across single cells. In addition, we also found these 516 isoforms had significantly smaller number of exons than the other isoforms (Fig. 4d), which is consistent with the previous report that mRNAs with less exons had higher m6A modification due to the inhibition of m6A by EJC3,5.
a Volcano plot showing the correlation between m6A methylation level and isoforms expression at single-cell level in HeLa. P values based on a two-tailed t test are shown. b Comparing the m6A level for isoforms (n = 516) with negative correlation between m6A level and expression and other isoforms (n = 3140) in single cells of HeLa. P value of two-tailed Wilcoxon test is indicated. c Box plot demonstrating the correlation between normalized isoforms expression and the m6A-isoSC-seq determined m6A levels of ISG15-203 in single-cell level. P values of two-tailed Wilcoxon tests are indicated. d Box plot comparing the number of exons for isoforms (n = 516) with negative correlation between m6A level and expression and other isoforms (n = 3140) in single cells of HeLa. P value of two-tailed Wilcoxon test is indicated. e Volcano plot showing the differential m6A levels of the isoforms in the same genes in HeLa. f Comparing the mean expression of highly methylated isoforms (n = 507) and lowly methylated isoforms (n = 354) in single cells of HeLa. g, h The overlaps of highly methylated isoforms (g) and lowly methylated isoforms (h) identified in different cells. P values from one-tailed Fisher’s exact tests are indicated. Comparing the m6A levels (i) and expression (j) between HeLa-specific highly methylated and lowly methylated isoforms in indicated cell lines. k C-to-U mutations on all the ONT reads of the different isoforms in indicated cell lines. l Violin plots comparing the m6A levels between EEF1D-202 and EEF1D-223 in different cell lines. m UMAP comparing the transcript expression level of the isoform EEF1D-202 (left) and isoform EEF1D-223 (right) in ONT single-cell data. The UMAP_1 and UMAP_2 values for each cell were identical to those presented in Fig. 3f. For box plots and violin plots, P values of two-tailed Wilcoxon tests are indicated, the horizontal line represents the median value, the lower and upper quartiles represent the 25th and 75th percentiles, and the whiskers are extended to 1.5 × IQR. Source data are provided as a Source Data file.
We then asked whether the isoforms of the same genes are methylated differently in single cells. To identify the isoforms that are methylated higher or lower than the other isoforms of the same genes, which are referred to as highly methylated isoforms (HMIs) or lowly methylated isoforms (LMIs), we compared the mean m6A levels of each isoform across methylated single cells with the combination of all other isoforms in the same genes. We found 507 HMIs and 354 LMIs out of 1973 transcripts included for the analysis (see “Methods”) in the single cells of HeLa origin (Fig. 4e). Notably, the HMIs had significantly lower expression levels than the LMIs, which is consistent with the notion that high m6A methylation of these isoforms may play a role in the degradation of these RNAs (Fig. 4f). Similar results were also observed in HEK293T and HepG2 cell lines (Supplementary Fig. 9d–i).
To further address whether the specifically methylated isoforms of the same genes are cell line-specific, we compared the HMIs identified in HEK293T, HeLa, and HepG2. We found 91.7% and 87.5% of the highly methylated isoforms in HEK293T and HepG2, respectively, are shared with other cell lines, suggesting the m6A deposition on different isoforms (Fig. 4g), in general, are common across different types of cells. However, 346 (68.2%) out of 507 highly methylated isoforms in HeLa were HeLa-specific (Fig. 4g). Similar results were observed for the above-identified lowly methylated isoforms (Fig. 4h). We further confirmed that although these 346 HeLa-specific highly methylated isoforms were methylated significantly higher than the 242 HeLa-specific lowly methylated isoforms in HeLa, they were methylated at similar level in HEK293T and HepG2, respectively, indicating genuine HeLa-specific high and low methylation of these isoforms (Fig. 4i). Consistent with such finding, the isoform expression levels of the HeLa-specific highly methylated isoforms were significantly lower than the HeLa-specific lowly methylated isoforms in HeLa but not in the other two cell lines (Fig. 4j).
As exemplified in the gene EEF1D, compared with transcript EEF1D-223, protein-coding transcript EEF1D-202 has higher m6A level and lower expression level in HeLa, which is not the case in the other two cell lines (Fig. 4k–m and Supplementary Fig. 9j). The results indicate that at least some of the isoform-specific m6A methylation is cell line dependent and may be involved in regulating the cell line-specific expression of transcript isoforms.
Intronic polyadenylation RNA isoforms with fewer exons are highly m6A methylated for RNA surveillance
As compared with the LMIs, we found the HMIs had significantly smaller number of exons (P = 4.1 × 10−10, two-tailed Wilcoxon test), especially a significantly larger fraction of isoforms with no more than four exons (P = 4.0 × 10−6, Fisher’s exact test) (Fig. 5a, b), which is consistent with the above finding that the isoforms with m6A levels negatively correlated with their expression across single cells had smaller number of exons (Fig. 4d). Similar results were also observed in HepG2 and HEK293T cell lines (Supplementary Fig. 10a–d). Based on the m6A levels of 3047 transcripts measured in the pseudobulk of the ONT single-cell data, we found a strong negative correlation between the number of exons and the m6A levels with notably higher m6A levels for the transcripts with no more than four exons (Fig. 5c). Similarly, based on m6Aiso41, an ONT direct RNA sequencing (DRS)-based method that can detect m6A at single-read resolution, and GLORI measured m6A levels in HEK293T cells, we consistently found the high m6A methylation of transcripts or genes with no more than four exons (Fig. 5d and Supplementary Fig. 10e), which is consistent with previous report3,5.
a Box plot comparing the number of exons for HMIs (n = 507) and LMIs (n = 354) in single cells of HeLa. b Stacked bar plot comparing the percentages of the isoforms with different numbers of exons between HMIs (n = 507) and LMIs (n = 354) of HeLa. c Comparing the mutation rates of C-to-U mutation sites in the isoforms with different numbers of exons in the pseudobulk of ONT single-cell data. d Box plot comparing the modification levels of m6A sites in the isoforms with different numbers of exons in HEK293T DRS data. e Pie charts comparing the percentages of IpA isoforms between the HMIs (n = 507) and LMIs (n = 354) of HeLa. P value of two-tailed Chi-squared test is indicated. f Box plot comparing the C-to-U mutation rates of mutation sites in IpA isoforms and their counterpart canonical isoforms in the pseudobulk of ONT single-cell data. g Stacked bar plot comparing the percentages of isoforms with different numbers of exons between IpA isoforms and their counterpart methylated canonical isoforms in the pseudobulk of ONT single-cell data. h Box plot comparing the distances of m6A sites to the nearest exon-exon junctions for IpA transcripts with their counterpart canonical transcripts in DRS datasets. i C-to-U mutations on all the ONT single-cell reads of RAD51C-203 (bottom, IpA) and RAD51C-201 (top, protein-coding). j UMAP showing the m6A level of RAD51C-203 (left, IpA) and RAD51C-201 (right, protein-coding) in each cell. The UMAP_1 and UMAP_2 values for each cell were identical to those presented in Fig. 3f. The appeared single cells are those with at least one C-to-U mutation in the corresponding isoform. For box plots and violin plot, P values from two-tailed Wilcoxon test are indicated, the horizontal line represents the median value, the lower and upper quartiles represent the 25th and 75th percentiles, and the whiskers are extended to 1.5 × IQR. For stacked bar plot, P values of one-tailed Fisher’s exact tests are indicated. Source data are provided as a Source Data file.
Besides, we found 8.3% of these HMIs were intronic polyadenylation (IpA) RNA isoforms, whereas, it was only 2.5% for LMIs (Fig. 5e). Consistently, based on all the 2761 expressed protein-coding IpA isoforms, we found the C-to-U mutations on IpA isoforms had significantly higher mutation rates than their counterpart canonical isoforms (P = 6.2 × 10−6, two-tailed Wilcoxon test) (Fig. 5f). Similar results were also observed using m6Aiso measured m6A levels (P = 9.1 × 10−8, two-tailed Wilcoxon test) (Supplementary Fig. 10f). As compared with their m6A methylated counterpart canonical isoforms, the m6A methylated IpA isoforms were significantly enriched in those with no more than four exons (P = 4.8 × 10−6, Fisher’s exact test) (Fig. 5g and Supplementary Fig. 10g). Moreover, we found the m6A sites on IpA isoforms had significantly longer distances to their closest exon-exon junctions than their counterpart canonical isoforms (Fig. 5h and Supplementary Fig. 10h). The m6A levels of the m6A sites on IpA isoforms with distances to their closest exon-exon junctions longer than 200 nt were significantly higher than their counterpart canonical isoforms; while it was not significant for those shorter than 200 nt (Supplementary Fig. 10i). As demonstrated in Fig. 5i, the IpA isoform RAD51C-203 with two exons is methylated higher than RAD51C-201, a protein-coding isoform contains nine exons (Fig. 5i). In addition, RAD51C-203 also displayed higher m6A level than RAD51C-201 at single-cell level (Fig. 5j and Supplementary Fig. 10j). This result suggests that high m6A methylation of these IpA resulted early-stopping isoforms may play a role in RNA surveillance of these misprocessed RNAs.
Mis-spliced RNAs are highly m6A methylated by generating long internal exons
Besides IpA isoforms, we also found the HMIs had significantly higher proportions of the isoforms that are targets of nonsense-mediated mRNA decay (NMD), non-coding isoforms of protein-coding genes (non-coding RNAs), and the isoforms with retained introns (Fig. 6a and Supplementary Fig. 11a, b). Moreover, we found the m6A levels of all the analyzed NMD isoforms, non-coding isoforms, and isoforms with retained introns across the methylated single cells were also significantly higher than the protein-coding isoforms of the same genes based on both DART-seq of pseudobulk and m6Aiso of DRS data (Fig. 6b, c and Supplementary Fig. 11c–e). Based on ONT direct RNA sequencing data of HEK293T cells, in which the m6A levels of isoform were positively correlated with the degrees of expression up-regulation upon METTL3 knockout (Supplementary Fig. 11f, g), we found compared with the protein-coding isoforms of the same genes, the expression of NMD, retained-intron, and non-coding isoforms exhibited greater degrees of up-regulation upon METTL3 knockout based on ONT direct RNA sequencing data of HEK293T cells42 (Fig. 6d). Meanwhile, for all isoforms in each of these categories, we also found the log2 fold changes (LFCs) of isoform expression changes upon METTL3 KO were positively correlated with the m6A levels measured by m6Aiso (Supplementary Fig. 11h).
a River plot comparing the compositions of transcripts between HMIs (n = 507) and LMIs (n = 354) in HeLa. P value of two-tailed Chi-squared test is indicated. b Box plot comparing the mean m6A methylation levels of single cells for NMD (n = 86), retained intron (n = 155), and non-coding transcripts of protein-coding genes (n = 78) with their counterpart protein-coding transcripts in HeLa. c Box plot comparing the methylation levels of m6A sites in misprocessed isoforms with their counterpart protein-coding transcripts in HEK293T DRS data (PRJNA1044456). d Comparing the expression changes of misprocessed isoforms with their counterpart protein-coding transcripts in wild-type and METTL3 knockout cells in HEK293T DRS data (PRJEB40872). e Metagene plot showing the C-to-U mutation sites distributed on NMD transcripts across the 5′UTR, CDS, and 3’UTR defined by the premature stop codon (green line) and the canonical stop codon used by the counterpart non-NMD isoforms (purple line), respectively. f Stacked bar plot comparing the percentages of isoforms with at least one long internal exon (>400 nt) out of methylated retained-intron isoforms and their counterpart methylated canonical isoforms in the HEK293T DRS dataset. P values of one-tailed Fisher’s exact tests are indicated. g C-to-U mutations on all the ONT single-cell reads of the retained-intron isoform RPS11-209 (bottom) and its counterpart protein-coding isoform RPS11-201 (top). h Comparing the expression changes for misprocessed isoforms with their counterpart protein-coding transcripts, respectively, in DMSO and SMG1i treated HEK293T cells. Violin plot comparing the expression of canonical transcripts (i) and NMD transcripts (j) with different m6A levels in DMSO treated HEK293T cells. k Violin plot comparing the expression changes of NMD transcripts across different groups of m6A levels in DMSO and SMG1i treated HEK293T cells. For box plots and violin plots, P values from two-tailed Wilcoxon test are indicated, the horizontal line represents the median value, the lower and upper quartiles represent the 25th and 75th percentiles, and the whiskers are extended to 1.5 × IQR. Source data are provided as a Source Data file.
We were then interested in how these mis-spliced RNA isoforms became highly methylated. When we studied the locations of the m6A sites on the NMD isoforms, retained-intron isoforms, and non-coding isoforms, we found that the m6A sites of these isoforms were still enriched near the sites same as the stop codons of their counterpart protein-coding isoforms (Fig. 6e and Supplementary Fig. 11i). Whereas, the NMD RNAs also had apparent m6A enrichment near their premature termination codons (PTCs) (Fig. 6e). For example, FIS1-204, an NMD transcript of FIS1 is methylated at a higher m6A level compared with the protein-coding transcript FIS1-201 at the same m6A sites near the PTC (Supplementary Fig. 11j). We then found these mis-spliced RNA isoforms significantly enriched for those with long internal exons (length > 400 nt), especially for retained-intron RNAs (P = 1.9 × 10−44, Fisher’s exact test) (Fig. 6f and Supplementary Fig. 11k, l). Consistently, the m6Aiso determined m6A sites on these RNAs had significantly longer distances from their closest exon-exon junctions (Supplementary Fig. 11m). Furthermore, the m6A levels of m6A sites on retained-intron RNAs with distances to their closest exon-exon junctions longer than 200nt were significantly higher than the protein-coding isoforms of the same genes; while it was not significant for the m6A sites on retained-intron RNAs with distances to their closest exon-exon junctions shorter than 200nt (Supplementary Fig. 11n, o), suggesting EJC play an important role in isoforms-specific m6A of retained-intron RNAs. For example, RPS11-209, an isoform with retained introns of RPS11 is methylated at a higher m6A level compared with protein-coding transcript RPS11-201 at the internal exon (Fig. 6g). These results suggest that mis-spliced RNAs are labeled by m6A due to their abnormal exon-intron structures for RNA surveillance.
We then asked whether the m6A directed degradation of the mis-spliced RNAs, especially the NMD targeted RNA isoforms. NMD is a well-studied mechanism for RNA surveillance, in which the SMG1 mediated phosphorylation of UPF1 is critical for UPF1 to recognize the PTCs43,44. To answer this question, we inhibited the phosphorylation of UPF1 by SMG1 inhibitor compound 11j (Supplementary Fig. 12a), followed by ONT DRS sequencing. We observed significant up-regulation of the expression of these mis-spliced transcripts, other than their counterpart protein-coding transcripts of the same genes, upon SMG1 inhibition, indicating that SMG1 inhibition can selectively inhibit NMD pathway as expected (Fig. 6h). In control cells, the expression of both the canonical isoforms and the mis-spliced RNA isoforms were negatively correlated to their m6A levels (Fig. 6i, j and Supplementary Fig. 12b, c). However, the expression changes of these NMD targets upon NMD inhibition (SMG1 inhibitor treatment) showed no correlation with their m6A levels (Fig. 6k and Supplementary Fig. 12d–f). Collectively, these data demonstrate that m6A mediated degradation of mis-spliced RNAs is independent of NMD mechanism.
To further demonstrate m6A mediated degradation of mis-spliced RNAs was independent of UPF1, we knocked down UPF1 by siRNA, followed by ONT DRS sequencing. Similar as using SMG1 inhibitor, we observed significant up-regulation of the expression of these mis-spliced transcripts upon UPF1 knockdown, but the expression changes of these mis-spliced transcripts did not show significant correlation with their m6A levels (Supplementary Fig. 12g–i). Besides, there is no significant change of m6A level of mis-spliced transcripts with UPF1 knockdown (Supplementary Fig. 12j). To further test whether the relationship between UPF1 and m6A, we took advantage of the public knockdown RNA-seq data of NMD regulators45, including UPF1, SMG6, and SMG7. Consistent with our DRS sequencing of UPF1 knockdown and SMG1 inhibition, we found the gene expression changes upon knockdown of NMD regulators did not show significant correlations with their numbers of modified m6A sites of those genes (Supplementary Fig. 12k–n). We then revealed that the binding sites of UPF1 were enriched in 3’UTR of genes, which is distinct from YTHDF2 and m6A (Supplementary Fig. 12o). And only 8% UPF1 binding sites were shared by both GLORI determined m6A sites and YTHDF2 binding sites, respectively (Supplementary Fig. 12p), in HeLa cells. Indeed, these results are also consistent with previous observation that m6A-mediated decay through UPF1 does not require translation30. Overall, these results suggest that m6A mediated RNA degradation and the UPF1 mediated RNA decay pathways are two independent RNA surveillance mechanisms.
CMD contributes to m6A-mediated degradation of misprocessed RNAs
Besides NMD, a recent study reported that m6A in the coding sequence triggers mRNA degradation via CDS-m6A decay (CMD), which is a pathway that is distinct from previously reported m6A-dependent degradation mechanisms46. CMD relies on translation, during which m6A deposition in the CDS causes ribosome pausing and transcript destabilization. The subsequent decay involves YTHDF2 through recruiting DCP2 other than UPF1, CNOT1, or HRPS1246. Nevertheless, why cells need CMD mechanism is incompletely understood, partly due to the lack of m6A information on specific isoforms. Because mis-splicing of RNAs are likely to affect the coding regions of RNAs, we suspected the misprocessed isoforms can be degraded through CMD mechanism. Through analyzing the m6A of IpA and NMD isoforms we found the levels of m6A in CDS of these misprocessed isoforms were significantly higher than the protein-coding isoforms of the same genes (Fig. 7a and Supplementary Fig. 13a–c), suggesting that misprocessed RNAs may be degraded by CMD pathway.
a CDS C-to-U mutation rates in IpA isoforms (exon number ≤ 4) and their corresponding protein-coding isoforms in the pseudobulk of ONT single-cell data. b Density plot showing the length of coding sequence in last exon for CDM targeted genes and CMD non-targeted genes, respectively. P value from two-tailed Wilcoxon test is indicated. c Comparing the percentages of isoforms with coding sequence in last exon longer than 200 bp of IpA isoforms (exon number ≤ 4) and their corresponding protein-coding isoforms in the same genes. d Comparing the percentages of CDS m6A sites in IpA isoforms with different lengths of coding sequence in last exon in HEK293T DRS data. Comparing the modification levels of CDS m6A sites in IpA isoforms (e) and canonical isoforms (f) with different lengths of coding sequence in last exon in the HEK293T DRS dataset. Comparing the expression changes of genes with different lengths of coding sequence in last exon in wild-type and YTHDF2 (g) or DCP2 (h) knockdown HeLa cells. i Comparing the percentages of DCP2-targeted genes across categories defined by the lengths of coding sequence in last exon. j Comparing the percentages of isoforms with different lengths of internal CDS exons between high methylated NMD isoforms and their corresponding protein-coding isoforms in the same genes. k Comparing the percentages of genes (coding regions in the last exons ≤ 200 nt) with different lengths of internal CDS exons between CMD target genes and CMD non-target genes. l Comparing the modification levels of CDS m6A sites in canonical protein-coding isoforms (coding regions in the last exons ≤ 200 nt) with different lengths of maximal internal CDS exons in the DRS dataset. For box plots, P values of two-tailed Wilcoxon test are indicated, the horizontal line represents the median value, the lower and upper quartiles represent the 25th and 75th percentiles, and the whiskers are extended to 1.5 × IQR. For stacked bar plot, P values of one-tailed Fisher’s exact test are indicated. Source data are provided as a Source Data file.
We were then curious about why misprocessed RNAs tend to have higher levels of m6A in CDS. Since IpA usually results in the creation of alternative last exons from the intronic sequences with the usages of accidently occurred stop codons, we then studied the coding regions in last exons. First of all, through reanalyzing the CMD-targeted genes as determined in the original paper, we realized that CMD-targeted genes had significantly longer coding sequence in their last exons (Fig. 7b). Of note 69.3% of CDM-targeted genes had coding sequence in last exon > 200 nt, which is the distance threshold that EJCs can play inhibitory role on m6A deposition; in contrast, it was only 35.4% for non-CDM-targeted genes (Fig. 7b). We then found 32.1% of seriously truncated IpA isoforms (exon number ≤ 4) had coding regions >200 nt in their last exons, which is significantly higher than the canonical last exons with only 17.9% of them had coding regions >200 nt (Fig. 7c, exemplified in Fig. 5i). Moreover, the IpA isoforms with longer coding sequences in the last exons, exhibited a higher fraction of m6A sites in CDS and significantly higher methylation levels of CDS m6A sites (Fig. 7d, e and Supplementary Fig. 13d, e). Collectively, these results suggest that the IpA RNAs with extended coding sequence in last exon are more likely degraded through the CMD pathway.
Furthermore, a similar correlation was observed in canonical protein-coding isoforms, in which longer coding sequences in the last exon corresponded to higher CDS m6A levels (Fig. 7f and Supplementary Fig. 13f, g). Knockdown of METTL3, YTHDF2, or DCP2 (but not YTHDF1, YTHDF3, or UPF1) led to greater up-regulation and RNA stability for transcripts with coding sequences in the last exon longer than 200 nt (Fig. 7g, h and Supplementary Fig. 13h, l). As expected, genes with longer coding sequence in last exon were more frequently targeted by DCP2, a key factor in the CMD mechanism (Fig. 7i). This result suggested that the CMD mechanism can supervise and degrade the RNAs with unusual location of stop codons in the last exons, created by IpA or other misprocessed events.
Besides IpA RNAs, which are due to incorrect polyadenylation, NMD RNAs are usually created by incorrect splicing. We found the highly methylated NMD isoforms with maximum m6A level >0.5 were significantly enriched in long internal exon with coding regions ≥400 bp, compared to their corresponding protein-coding isoforms (Fig. 7j). Consistently, the CDM targeted genes also showed significantly enrichment of long internal exon with coding regions ≥400 bp (Fig. 7k). The genes containing those long internal coding exons are methylated significantly higher in coding regions than the other genes (Fig. 7l and Supplementary Fig. 13m). This result suggested that NMD RNAs can also be targeted by CMD, which is independent of NMD mechanism.
Disscussion
In this study, we found profound heterogeneity of m6A across single cells and m6A alone could cluster the cells into correct subpopulations. Previous studies of single-cell m6A analyses mostly analyzed the m6A of different subpopulations clustered based on gene expression21. Here, we found different kinds of cells could be clearly separated based on m6A signatures of non-differentially expressed genes, suggesting that m6A provides an additional layer of cell identity. Nevertheless, we also realized that the m6A-based clustering of cell populations is much less distinguishable than the gene expression-based clustering. Indeed, how much does m6A vary across different cell types or tissues has long been an open question. Although dynamics of m6A regulated by trans m6A factors is widely accepted47, multiple studies revealed that the stoichiometry of m6A could be well-predicted using sequence alone14,48, implying that m6A is determined to a large extent by the surrounding sequence and may not be greatly varied across different tissues like gene expression. However, probably due to the larger heterogeneity of gene expressions across single cells, there is great heterogeneity of m6A across single cells. In addition, the thousands of cells with great m6A heterogeneity provide a nice resource for us to uncover the specific isoforms with special patterns of m6A modifications based on robust comparisons and correlation analyses with large sample sizes. We thereby were able to accurately identify the specific isoforms with higher or lower m6A levels relative to other isoforms of the same genes, which further lead to the discovery of highly m6A methylation of misprocessed RNAs.
Compared to the previous DART-seq based methods20,21, the optimized quantification of m6A using DART-seq is critical for the m6A-based cell clustering and the subsequent analyses. Although DART-seq has irreplaceable advantages of parallel measurement m6A in large number of single cells as well as keeping intact RNAs due to the endogenous labeling feature of DART-seq. However, as we found, the APOBEC1-YTH fusion protein also indues a large number of C-to-U mutations around but not beside the m6A sites. Possibly due to the intrinsic preference of APROBEC1 to create C-to-U mutations after adenosines, most of C-to-U mutations in DRACH were not adjacent to genuine single-nucleotide m6A sites, and the 5-mer composition of DRACH for these C-to-U mutations are not consistent with the annotated m6A sites (Supplementary Fig. 2c–f and i). Therefore, previous cell clustering based on quantification of single m6A sites according to the beside C-to-U mutations has to suffer from the biased quantification of m6A. In this study, although we sacrificed the single-nucleotide resolution, we obtained high accuracy in quantifying the m6A of each isoform. First of all, the composition of enriched 5-mers of DRACH in m6A peak regions are consistent with the known m6A sites (Supplementary Fig. 3r). Second, the overall m6A levels of single cells were significantly correlated to the expression level of methyltransferase complex (Supplementary Fig. 5h–j). Moreover, there is a strong bias to the negative correlation than positive correlations between m6A levels and expression of specific isoforms (Fig. 4a).
Our study highlighted the advantages of identifying m6A in the context of full-length RNAs at single-cell resolution. The vast majority of eukaryotic genes produce multiple RNA isoforms, which is a way of producing larger functional and regulatory diversity from a finite number of genes49,50. RNA transcript isoforms from the same genes can be different in terms of internal exons, transcriptional starts, and transcriptional terminations through various mechanisms such as alternative splicing, alternative promoters, and alternative polyadenylations51. Here, we found the different transcript isoforms are marked at different m6A sites. This finding added an additional layer of complexity for RNA regulation. It is possible that the m6A on specific isoforms may contribute to the functional and regulatory diversity of different transcript isoforms.
Here, we revealed a role of m6A in RNA surveillance through recognizing the RNAs with unusual exon-intron structures resulted from misprocessed RNAs. Eukaryotic genes have evolved very special mechanisms to distinguish the incorrectly processed RNAs from the correct RNA products. For example, the NMD mechanism specifically detects the PTCs that are more than 50–55 nt upstream of the last exon-exon junctions to prevent the production of truncated protein products52. As a result, the stop codons of 95.4% of the protein-coding genes are located in the last exons according to our calculation. On the other hand, as we calculated, 84% canonical transcripts do not have any long internal exon (>400 bp), and short internal exons are less likely to be m6A methylated5,6. The IpA RNAs usually produce new stop codons from the intronic regions of the canonical isoforms and result in unusually long distances between stop codons and last exon-exon junctions. Our study provides an explanation for these special patterns: cells may use these special patterns as rules to recognize the normally processed RNA isoforms, because the incorrectly processed RNAs are easy to produce RNAs that break these rules. The mistakes in splicing tend to generate unusual long internal exons especially for intron retention events (Fig. 6f). Now we understand that these unusual patterns of RNAs were captured by m6A RNA methylation for cells to label the misprocessed RNAs for CMD or other m6A-mediated degradation mechanisms. On the other hand, differences in nuclear vs. cytoplasmic localization between isoforms may contribute to the differences in detected m6A levels between isoforms. Indeed, we found the Nuc/Cyto ratios of genes were positively correlated with both the number of GLORI m6A sites and C-to-U mutation sites in eDART-seq (Supplementary Fig. 1g–j), suggesting that specific nuclear localization of endogenous m6A methyltransferase is accounting for the higher m6A of nuclear-localized RNAs. Because misprocessed mRNAs tend to have deficit nuclear export53,54, it is also possible that cellular localization difference of isoforms may be a potential mechanism to result in specific m6A methylation and degradation of misprocessed mRNAs that are retained in nucleus. On the other hand, using APEX-Seq-determined RNA subcellular localization55, we found that the endoplasmic reticulum (ER)-associated transcripts exhibited similar higher average number of GLORI m6A sites and eDART-seq C-to-U mutation sites compared to other cytoplasmic RNAs (Supplementary Fig. 14a–d). Moreover, we found that ER membrane (ERM)-associated genes have longer coding sequences in last exons compared to other cytoplasmic genes (Supplementary Fig. 14e, f). Therefore, we cannot rule out the possibility that isoform-specific m6A modifications may contribute to the selection of particular RNA isoforms for distinct cellular localization.
It is also reasonable that this type of m6A-mediated RNA surveillance is independent of NMD mechanism, because the degradation of these misprocessed RNAs is mediated by the abnormal deposition of m6A due to their escaping of EJC inhibition other than the existence of PTC. However, it does not rule out the possibility that some NMD RNAs may subject to both types of RNA surveillance. We do find the NMD RNAs are highly m6A methylated, indeed, the PTCs of some of NMD RNAs are created due to mis-splicing like intron retention, which also likely produces long internal exons. Also, some misprocessed RNAs do not have PTC but have long internal exons, m6A-mediated RNA surveillance is important for the degradation of these RNAs.
There are also limitations of m6A-isoSC-seq. First, as we have demonstrated in this study, it cannot be used to quantify m6A at the single-nucleotide resolution. Second, the localization of the editing machinery in the cell may cause biased C-to-U mutations on certain RNAs. Although we have proved that the APOBEC1-YTH proteins are expressed in both cytoplasm and nucleus, we still cannot rule out that all RNAs in the cell have equal chance of being edited. Third, we cannot rule out the possibility that the binding of APOBEC1-YTH on RNAs can affect the fates of endogenous RNAs. Although we found the binding YTH domain from YTHDF2 is not likely to induce RNA degradation, it may still have some effects on the targeted RNAs, for example, competing with other RNA binding proteins. Fourth, high editing efficiency is critical for m6A-isoSC-seq, since the transfection efficiencies of different cell lines are remarkably different, applying m6A-isoSC-seq in some cell lines with low transfection efficiencies is still challenging. Fifth, similar to other widely used m6A detection approaches (e.g., MeRIP-seq8 and m6A-SAC-seq34), eDART-seq generates semi-quantitative measurements of m6A levels rather than absolute quantifications such as those achieved by GLORI16. Consequently, direct comparisons of m6A levels across distinct genomic loci should be interpreted with caution. Sixth, due to the requirement of exogenously expressed APOBEC1-YTH protein, it has additional challenge to apply m6A-isoSC-seq in tissue samples. Low or varying gene delivery efficiency to tissues and different cell populations (e.g., HepG2 in Supplementary Fig. 1) could hinder the detection of m6A heterogeneity in tissues with complex cell populations.
Further studies are necessary to improve the method and broaden the usage of m6A-isoSC-seq. As we have demonstrated, transfected stem cells do not have editing bias for specific cell populations during cell differentiation (Supplementary Fig. 6a–d), thus m6A-isoSC-seq is able to profile the single-cell and isoform level m6A modifications across cell populations during cell differentiation. In addition, purified APOBEC1-YTH protein has the potential to be used to mark RNA modifications in vitro, thus it is possible to be used to parallelly edit the single-cell or microscale RNAs in tissue samples by adopting to the current single-cell and spatial transcriptomics technologies. On the other hand, inducible eDART-seq can be developed to profile the single-cell and isoform level m6A in specific conditions.
Methods
Cell culture and treatment
HEK293T cells (ATCC, #CRL-3216), Hela cells (CELLCOOL, #CC1101), and HepG2 cells (CELLCOOK, #CC0101) were cultured in Dulbecco’s modified Eagle’s medium (DMEM; Corning, #10-013-CVRC) supplemented with 10% fetal bovine serum (FBS; QuaCell, #B21001). The cells were maintained at 37 °C in a humidified chamber with 5% CO2 and were verified to be free of mycoplasma contamination. The pcDNA3.1-APOBEC1-YTH-T2A-copGFP construct was transfected into HEK293T cells, HepG2 cells, and HeLa cells, respectively, using Lipofiter 3.0 (Hanbio, #HB-TRLF3-2000) according to the manufacturer’s instructions. The T2A was added to separate the copGFP in cells through self-cleaving in order to avoid the aggregation of copGFP, which may affect APOBEC editing activity. The medium was refreshed 6 h after transfection, and the copGFP-positive cells of the three cell lines with similar ranges of fluorescence intensities were isolated, respectively, through flow cytometry 3 days after transfection.
For NMD pathway inhibition, ~1 × 107 HEK293T cells were treated with either DMSO (Sigma-Aldrich, #D2650) or 300 nM of the small molecule NMD inhibitor SMG1i (compound 11j, MedChemExpress, #HY-124719) for 4 h prior to cell harvest.
The APOBEC1-YTH-T2A-copGFP sequence was provided in Supplementary Table 1.
Bulk RNA sequencing
The total RNA of pcDNA3.1-APOBEC1-YTH-T2A-copGFP transfected HEK293T, HeLa and HepG2 was harvested using Trizol (Invitrogen, #15596018) followed by treatment with RNase-Free DNase (NEB) to remove possible DNA contamination, respectively. RNA-seq libraries were prepared with Dynabeads mRNA Purification Kit (Ambion) and TruSeq Stranded mRNA Library Prep Kit (Illumina). Sequencing was performed with Illumina HiSeq 2000 to generate about 55 million strand-specific 150 bp paired-end reads for each cell line.
Single-cell samples preparation and sequencing
Cell counter estimated 1700 copGFP-positive cells of each of HEK293T, HepG2, and HeLa cell lines were mixed together in a 1:1:1 ratio and isolated to create cell suspension. The mixed single-cell suspension was utilized to generate a barcoded single-cell RNA-seq library using the 10x Genomics 3’ gene expression kit (v3) following the recommended protocol provided by the manufacturer. Half of the amplified cDNA was employed for short-read sequence library preparation following the 10x Genomics protocol and sequenced on an Illumina NovaSeq 6000 sequencing system to generate 310,040,913 150 bp paired-end reads. The other half of the full-length cDNA generated by 10x Genomics was amplified by five cycles of PCR with TruSeq read 1 forward primer 5′-NNN CTA CAC GAC GCT CTT CCG ATC T-3′ and 3′ partial TSO reverse primer 5′-NNN AAG CAG TGG TAT CAA CGC AGA GTA CAT-3′ for Oxford Nanopore sequencing. After that, amplified full-length cDNA was treated with 0.55 × SPRIselect to remove PCR primers and the cDNAs with lengths shorter than 1000 bp. The purified cDNA reagent into a final elution of 51 µL in Buffer EB to allow for adequate template/volume for the necessary assessment of quality control metrics and PromethION library preparation. Single cell nanopore sequencing library was prepared with the Oxford Nanopore LSK-110 kit (PCR free) and sequenced on PromethION following the manufacturer’s instructions to generated 339,434,069 long reads derived from 6917 cells.
Direct RNA library preparation and sequencing
Total RNA was extracted from the HEK293T cell line using TRIzol reagent (Invitrogen, #15596018), according to the manufacturer’s instructions. RNA concentrations and RNA integrity were measured using a NanoDrop 2000 Spectrophotometer and Agilent 4200 TapeStation, respectively. A 20 μg aliquot of total RNA was subjected to poly(A) + RNA selection using the Dynabeads mRNA DIRECT purification kit (Invitrogen, #61011) following the manufacturer’s instructions. Subsequently, approximately 500 ng of poly(A) + RNA was used for direct RNA library preparation. The libraries were generated following the standard ONT SQK-RNA002 protocol with the reverse transcription step included. The libraries were loaded onto R9.4.1 flow cells (ONT, FLO-PRO002) and sequenced on PromethION 2 Solo (ONT, PRO-SEQ002).
Lentivirus production and generation of stable gene-overexpression hESC lines
The coding sequence of APOBEC1-YTH-HA was subcloned into the pLVX-Tet-One-Puro (ClonTech, 631847) vector and verified through Sanger sequencing. Lentivirus production was carried out by co-transfecting HEK293T cells with the lentiviral vector, an envelope plasmid (pMD2.G, Addgene, 12259), and a packaging plasmid (psPAX2, Addgene, 12260) using LipofectamineTM 2000 (Thermo Fisher, 11668019). The virus-containing medium was then harvested from the HEK293T cells. H1 hESCs were infected with the virus in two rounds (6 h each round) in the presence of 8 µg/mL polybrene. After the final infection, transduced cells were selected with 1 µg/mL puromycin for three successive passages.
Definitive endoderm differentiation
Undifferentiated hESCs cultured in E8 medium (STEMCELL Technologies, 05990) were dissociated into a single-cell suspension using Accutase (STEMCELL Techologies, 07920) and then plated onto Matrigel (BD, 354277)-coated 24-well plate at a density of 105 cells per well in E8 medium containing 10 µM Y-27632 (Selleck, S1049). When the cells reached 80% confluence, definitive endoderm differentiation was initiated by switching to the endoderm differentiation medium (DMEM/F-12 (Gibco, 11330032) supplemented with 50 U/mL Penicillin-Streptomycin (Gibco, 15070063), Chemically Defined Lipid Concentrate (1:100, Gibco, 11905031), 10.7 µg/mL holo-Transferrin human (Sigma-Aldrich, T0665), 71 µg/mL L-Ascorbic acid (Sigma-Aldrich, A8960), 14 ng/mL Sodium selenite (Sigma-Aldrich, S5261), and 10 ng/mL Activin A (PeproTech, 12014E)). The cells were cultured under these conditions for 2 days. Additionally, 3 µM CHIR99021 (Selleck, S1263) was added to the medium during the first 24 h of differentiation and subsequently removed. Samples were collected and analyzed on day 2.
Immunofluorescence analysis
For intracellular protein immunofluorescence staining, cells were fixed in 4% paraformaldehyde (PFA) for 15 min at room temperature and subsequently washed twice with DPBS. Cells were permeabilized in the permeabilization buffer (DPBS supplemented with 0.3% Triton X-100) for 30 min at room temperature, washed once in DPBS, and blocked in the blocking-permeabilization buffer (DPBS supplemented with 3% bovine serum albumin (BSA) and 0.1% Triton X-100) for 1 h at room temperature. Following this, cells were incubated overnight at 4 °C with HA-Tag antibody (1:200, Cell Signaling Technology, 2367), Brachyury antibody (1:200, R&D, AF2085) and FOXA2 antibody (1:200, Cell Signaling Technology, 8186) diluted in the blocking-permeabilization buffer. After washing with DPBS, cells were stained with secondary antibody (1:500, Thermo Fisher, A-21244, A-21469, and A-11001) and diluted in the secondary antibody buffer (DPBS supplemented with 1% BSA and 0.05% Triton X-100) at room temperature for 45 min. Stained samples were imaged using the Operetta CLSTM high-content analysis system (Perkin Elmer). Data analyses were performed by using the Harmony 4.5 software.
Detection of RNA editing efficiency
The mRNA sequences containing m6A sites of ACTB and FEN1 flanked by adapter sequences (upstream: 5′-ATGCAGCGACTCAGCCTCTG-3′, downstream: 5′-CACGCACTACGGTACTGGCTA-3′), were cloned into the pcDNA 3.1 vector. These plasmids were co-transfected with APOBEC1-YTH-T2A-copGFP in HEK293T cells. Subsequently, the copGFP positive cells were isolated by flow cytometry sorting. The total RNA was extracted by FastPure Cell/Tissue Total RNA Isolation Kit V2 (Vazyme, RC112-01). Editing efficiency was detected by PCR using specific primers (F: 5′-ATGCAGCGACTCAGCCTCTG-3′; R: 5′-TAGCCAGTACCGTAGTGCGTG-3′). The ACTB spike-in, PERN1 spike-in, and PCR primer sequences used in this study were provided in Supplementary Table 2.
Cell fraction and western blot
HEK293T cells were collected, washed with 1 mL of cold PBS buffer containing 1 mM EDTA, and centrifuged at 500 × g for 5 min at room temperature. The cell pellet was resuspended in 200 μL of ice-cold cytoplasmic lysis buffer (10 mM Tris-HCl, pH 7.5, 0.1% IGEPAL® CA-630, 150 mM NaCl) and incubated on ice for 5 min. Following incubation, 2.5 volumes of chilled sucrose cushion were added, and the mixture was centrifuged at 15,000 × g for 10 min at 4 °C. The supernatant was collected as the cytoplasmic fraction. The nuclear pellet was resuspended in RIPA buffer supplemented with 1 mM phenylmethylsulphonyl fluoride (PMSF; ST506, Beyotime Biotech, China) immediately before use. Subsequently, nuclear lysates were sonicated and centrifuged at 12,000 × g for 15 min at 4 °C. Finally, the samples were mixed with 6 × SDS sample buffer, boiled at 100 °C for 10 min, and resolved by SDS-PAGE. Finally, HA antibody (1:2000, Cell Signaling Technology, 3724), GAPDH antibody (1:10000, ABclonal, A19056), and Histone 3 antibody (1:5000, Cell Signaling Technology, 4499) were used to detect samples resolved by SDS-PAGE.
Mutation identification and analyses of bulk RNA-seq
The adapters of raw paired-end sequencing reads were removed by Cutadapter (v2.7)56. The reads were then aligned to human GRCm38/hg38 reference genome by STAR (Version 2.7.3a)57. We then took advantage of rMATS-DVR58, which implements Picard (v2.22.3, https://broadinstitute.github.io/picard/) to remove PCR duplicates and GATK toolkit (v3.8.0)59 to calibrate the bam files sequentially by splitting ‘N’ cigar reads (i.e., splice junction reads), mapping quality reassignment (program: SplitNCigarReads), and base quality score recalibration (program: BaseRecalibrator)58,60. Mutations were identified by CTK-tools (v1.1.3)61 based on the calibrated BAM files with default parameters. The mutation sites that match dbSNP annotated SNPs were filtered out. Samtools (v1.7)62 was used to calculate the number of reference and mutated nucleotides for each mutation site. Only the mutation sites with read coverages ≥ 50, mutation counts ≥ 5, and mutation rates ≥ 0.1 were used for the downstream analyses. The mutation sites that can also be identified in the cells transfected with empty vectors or induced by transfecting APOBEC1 without YTH were removed from further analyses32. To define the C-to-U mutation peaks, we made 100 bp sliding windows with 20 bp overlapping with each other on the exon regions. The sliding windows with at least 3 C-to-U mutation sites were identified as mutation peak for each cell line, respectively. The overlapped mutation peaks were then merged. Homer (v 4.11.1)63 software was used to identify the motifs enriched in the mutation peaks with all the exonic regions of the whole genome as the background. The longest isoforms were used to determine the distance of C-to-U mutation sites to the closest known m6A sites.
Gene expression analyses for short-read single-cell sequencing data
Raw data were demultiplexed by mkfastq software. The reads were then aligned to human GRCm38/hg38 reference genome using Cell Ranger (v6.0.1)64, followed by count calculation, cell-barcode filtering with the default parameters. The UMI counts of all genes across all single cells were used to cluster the single cells using Seurat (v3.1.0)65. Briefly, only cells with at least 200 and at most 5000 genes and no more than 20% of mitochondrial reads were considered for the downstream analyses. After filtering, the RNA counts were normalized using “LogNormalize” method implemented in Seurat with a scale factor of 10,000. The top 2000 variable genes defined by the “FindVariableFeatures” function of Seurat were used for PCA analysis. We then took the top 10 principal components of PCA for the analyses of cell clustering and UMAP projection. Clusters were identified using SNN graph and original Louvain algorithm with a resolution of 0.1. “FindAllMarkers” function in Seurat was used to determine the marker genes of each cell cluster by requiring p < 0.01 and log2 fold-change ≥ 0.25. The cell types of the clusters were determined by comparing the expression of the marker genes of each cluster with their gene expression in the bulk RNA-seq of the three cell lines. The determined cell type of each single-cell was consistently used in all the single-cell analyses in the study.
Mutation analyses for short-read single-cell sequencing data
The single-cell RNA sequencing reads were combined to call the raw mutations using the same pipeline as we did for the bulk RNA-seq data. The VarTrix (v1.1.22, https://github.com/10XGenomics/vartrix) was used to calculate the counts of reference and mutated nucleotides for each mutation site in each cell. The mutation sites that match dbSNP annotated SNPs were filtered out. Only the mutation sites with UMI coverage ≥ 50, mutation reads coverage ≥ 5, and mutation rate ≥ 0.05 in the combined single-cell data were used for the downstream analyses. The mutation peaks were identified based on the combined single-cell data using the same approach as we did for the bulk RNA-seq data.
The mean C-to-U mutation rate (MMR) was defined similarly as ε-score39, which is a matric to measure the mutation rates of mutation peaks was described as follows:
Here, n represents the number of C-to-U mutation site in the region of corresponding mutation peak; Ui and Ci representing the counts of U and C of the ith mutation site, respectively.
Only the cells with >200 methylated mutation peaks (at least one C-to-U mutated read) were used in the downstream analysis. To rule out the effects of gene expression on m6A-based cell clustering, we filtered out the mutation peaks on the marker genes of gene expression-based cell clusters. Then, MMRs of the filtered mutation peaks across the filtered single cells were used to cluster the single cells using Seurat (v3.1.0)65. The MMRs of the filtered mutation peaks were normalized using “LogNormalize” method implemented in Seurat with a scale factor of 10,000. The top 3000 variable peaks defined by the “FindVariableFeatures” function of Seurat were used for PCA analysis. We then took the top 10 principal components of PCA for the analyses of cell clustering and UMAP projection. Clusters were identified using SNN graph and original Louvain algorithm with a resolution of 0.30. Marker mutation peaks of each cluster were identified using “FindAllMarkers” function in Seurat by requiring p < 0.01 and log2 fold-change ≥0.1. For each mutation peak, the methylated cell refers to the single cell that has at least one read with at least one C-to-U mutation in the mutation peak. The fraction of methylated cells is the fraction of methylated cells out of all the single cells with at least one read overlaps with the mutation peak.
Base calling of ONT single-cell sequencing data
Base calling was performed on raw FAST5 files of nanopore sequencing data to generate FASTQ files using Guppy (v3.1.3) in high accurate mode (https://community.nanoporetech.com/downloads).
Barcode and UMI calling of ONT single-cell sequencing data
The Illumina short-read single-cell data was used as a reference to identify cell barcodes and UMIs from nanopore long reads, following a similar strategy reported in a previous study66. The identification process started with searching for the 10x genomics 5′ adapter (“CTACACGACGCTCTTCCGATCT”) and poly(T) sequence within the nanopore long reads, which helped locate the cell barcode and UMI sequence. To match putative cell barcodes and UMIs identified from the error-prone nanopore long reads to those defined from the high-fidelity Illumina short reads, the maximum edit distances allowed between the long-read and short-read data were set to 3 for cell barcodes and 2 for UMIs.
Gene expression-based cell clustering of ONT single-cell sequencing data
Only the cells with ≥ 750 expressed genes were used in cell clustering. The read counts of genes were normalized using “LogNormalize” method implemented in Seurat65 with a scale factor of 10,000. The top 3000 variable isoforms defined by the “FindVariableFeatures” function of Seurat were used for PCA analysis. We then took the top 10 principal components of PCA for the analyses of cell clustering and UMAP projection. Clusters were identified using SNN graph and original Louvain algorithm with a resolution of 0.017. “FindAllMarkers” function in Seurat was used to determine the marker genes of each cell cluster by requiring p < 0.01 and log2 fold-change ≥ 0.25.
Isoform expression-based cell clustering of ONT single-cell sequencing data
Similar as gene expression-based cell clustering, only the cells with ≥ 750 expressed genes were used in cell cluster analyses. The read counts of annotated isoforms (GENCODE V32) were normalized using “LogNormalize” method implemented in Seurat65 with a scale factor of 10,000. The top 3000 variable isoforms defined by the “FindVariableFeatures” function of Seurat were used for PCA analysis. We then took the top 15 principal components of PCA for the analyses of cell clustering and UMAP projection. Clusters were identified using SNN graph and original Louvain algorithm with a resolution of 0.03. “FindAllMarkers” function in Seurat was used to determine the marker isoforms and marker genes of each cell cluster by requiring p < 0.01 and log2 fold-change ≥ 0.25.
Mutation identification of ONT single-cell sequencing data
First of all, we made a list of reference C-to-U mutation sites by merging the C-to-U mutation sites identified from bulk RNA-seq data in HeLa, HepG2, and HEK293T. Instead of identifying mutations solely based on ONT single-cell reads, we calculated the C-to-U mutation rates at the reference C-to-U mutation sites from the ONT single-cell reads with previously assigned cell barcodes and UMIs. “SNPMatrix” command of Sicelore tool (v2.0)66 was employed with default parameters to calculate the C and U counts of the reference mutation sites in each single-cell. The obtained C-to-U mutations on the ONT reads derived from the same genes and isoforms were then aggregated to calculate the gene-wise and isoform-wise C-to-U mutation rates, respectively, for each cell.
Isoform m6A level-based cell clustering of ONT single-cell sequencing data
We filtered out the marker isoforms of the above isoform expression-based cell clustering analysis. Only the cells with ≥750 expressed genes and ≥100 methylated transcripts (at least one C-to-U mutated read) were used in the downstream analyses. The MMRs of transcripts in single cells were used to cluster the single cells using Seurat (v3.1.0)65. The mean C-to-U mutation rate of the filtered isoforms were normalized using “LogNormalize” method implemented in Seurat with a scale factor of 10,000. The top 3000 isoforms with the most variable mean C-to-U mutation rate determined by the “FindVariableFeatures” function of Seurat were used for PCA analysis. We then took the top 10 principal components of PCA for the analyses of cell clustering and UMAP projection. Clusters were identified using SNN graph and original Louvain algorithm with a resolution of 0.08. The marker isoforms of each cluster with distinct m6A methylation level were identified using “FindAllMarkers” function in Seurat by requiring p < 0.01 and log2 fold-change ≥ 0.25.
For the pseudobulk analyses, only the C-to-U mutation sites with at least 5 mutations and 10 covered reads in transcript level were used to determine the m6A levels of each transcript.
Direct RNA sequencing data analysis preprocess
For direct RNA sequencing data, base calling was performed on each raw FAST5 file to generate FASTQ files using Guppy (v3.1.3) in high accurate mode. The reads with mean quality score exceeded 7 were aligned to human GRCm38/hg38 reference transcriptome (GENCODE V32) using minmap267. The multi-mapped reads were removed using samtools68. The expression of transcripts was calculated based on uniquely mapped reads. Subsequently, the reads and nanopore current were mapped to their corresponding positions of transcripts by the eventalign module of Nanopolish (version 0.13.2)69. Simultaneously, the mean, standard deviation, and dwelling time of ionic current from all the reads mapped to the same continuously segmented 5-mers were extracted by Nanopolish (version 0.13.2)69. Based on the extracted ionic current signals, m6Aiso was used to predict m6A modification at single-molecular level41. Only the GLORI annotated m6A sites with read coverages ≥ 20 and modified reads ≥ 10 in gene level were used for the downstream analyses. The m6A of each isoform was then determined based on the m6A sites with read coverages ≥ 10 and modified reads ≥ 5 in the isoform. Similar as MMR, the mean m6A level of isoform was calculated as follows:
Here, n represents the number of m6A sites in the isoform; m6Ai and Ai represent the counts of modified and unmodified reads at the ith m6A site, respectively.
Identification and analyses of highly and lowly methylated isoforms
Only the isoforms with at least one C-to-U mutated read in at least 10 single cells were used in the analyses. Then the genes with at least two eligible isoforms were used to identify the highly and lowly methylated isoforms. The MMR for each isoform in each cell was calculated as above described. For each isoform, we compared the isoform level MMRs in the single cells with at least one C-to-U mutated read of the corresponding isoforms with the combination of all other isoforms using Wilcoxon test. The isoforms with MMRs significantly higher or lower than other isoforms in the same gene (FDR < 0.05) were identified as high methylated isoforms and low methylated isoforms. The fractions of GENCODE annotated isoform types were then compared between the highly and lowly methylated isoforms using two-tailed Chi-squared test.
Comparison of m6A levels across different types of isoforms
The isoform type, number of exons, length of 3’UTR and length of internal exon for each isoform were identified based on the annotation of GRCm38/hg38 reference transcriptome (GENCODE V32). The isoforms with polyadenylation sites located within the intronic regions of the annotated canonical isoforms were determined as IpA isoforms. To compare the m6A levels of different isoforms using single-cells, pseudobulk, as well as ONT DRS data, the mean m6A methylation levels for NMD, intron retention, non-coding, and IpA transcript were compared with protein-coding transcripts in the same genes, respectively.
For single-cell analyses, the isoforms with at least one C-to-U mutated read in at least 10 cells were used in downstream analyses. The mean m6A methylation levels in methylated cells (at least one read with C-to-U mutation) were used to represent the m6A levels of isoforms in single cells. The meta gene analyses of NMD transcripts were performed using the start codons and stop codons of the NMD transcripts and the counterpart canonical isoforms, respectively. The GO analyses were performed using clusterProfiler70.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The raw sequence data have been deposited in GEO dataset under the accession number GSE239765. The public GLORI, m6A-REF-seq, eTAM-seq, m6A-SAC-seq, m6ACE-seq, MAZTER-seq, miCLIP, and m6A-seq datasets were downloaded from the GEO database (GSE210563[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE210563], GSE125240, GSE211303, GSE162356, GSE124509, GSE122961, GSE63753, GSE29714). The YTHDF1-3, DCP2, UPF1, SMG1, SMG7 knockdown RNA-seq were download form GEO database (GSE134380[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE134380], GSE143662, GSE86148). The public YTHDF2 and UPF1 CLIP data were downloaded from the GEO database (GSE248574[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE248574], GSE47976). The public nuclear-cytoplasmic fractions RNA-seq were downloaded from the GEO database (GSE190303[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE190303]). The HEK293T DRS sequence were download form the EBI database (PRJNA1044456[https://www.ebi.ac.uk/ena/browser/view/PRJNA1044456]). The METTL3 knockout DRS sequence were download form EBI database (PRJEB40872[https://www.ebi.ac.uk/ena/browser/view/PRJEB40872]). Source data are provided with this paper. The original un-cropped images of western blots in this study are provided in the Supplementary Fig. 15.
Code availability
Source code and analysis scripts for sgRNA design and analyses are available on GitHub (https://github.com/ZJRen9/m6A-isoSC-seq, https://doi.org/10.5281/zenodo.15100885). Source data are provided with this paper.
References
Wang, X. et al. N6-methyladenosine-dependent regulation of messenger RNA stability. Nature 505, 117–120 (2014).
Zaccara, S. & Jaffrey, S. R. A unified model for the function of YTHDF proteins in regulating m(6)A-modified mRNA. Cell 181, 1582–1595.e1518 (2020).
Luo, Z. et al. Exon-intron boundary inhibits m(6)A deposition, enabling m(6)A distribution hallmark, longer mRNA half-life and flexible protein coding. Nat. Commun. 14, 4172 (2023).
He, P. C. et al. Exon architecture controls mRNA m(6)A suppression and gene expression. Science 379, 677–682 (2023).
Yang, X., Triboulet, R., Liu, Q., Sendinc, E. & Gregory, R. I. Exon junction complex shapes the m(6)A epitranscriptome. Nat. Commun. 13, 7904 (2022).
Uzonyi, A. et al. Exclusion of m6A from splice-site proximal regions by the exon junction complex dictates m6A topologies and mRNA stability. Mol. Cell 83, 237–251.e237 (2023).
Meyer, K. D. et al. Comprehensive analysis of mRNA methylation reveals enrichment in 3’ UTRs and near stop codons. Cell 149, 1635–1646 (2012).
Dominissini, D. et al. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature 485, 201–206 (2012).
Linder, B. et al. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nat. Methods 12, 767–772 (2015).
Dierks, D. et al. Multiplexed profiling facilitates robust m6A quantification at site, gene and sample resolution. Nat. Methods 18, 1060–1067 (2021).
Chen, K. et al. High-resolution N6-methyladenosine (m6A) map using photo-crosslinking-assisted m6A sequencing. Angew. Chem. Int. Ed. Engl. 54, 1587–1590 (2015).
Ule, J. et al. CLIP identifies Nova-regulated RNA networks in the brain. Science 302, 1212–1215 (2003).
Zhang, Z. et al. Single-base mapping of m6A by an antibody-independent method. Sci. Adv. 5, eaax0250 (2019).
Garcia-Campos, M. A. et al. Deciphering the “m6A Code” via antibody-independent quantitative profiling. Cell 178, 731–747.e716 (2019).
Shu, X. et al. A metabolic labeling method detects m6A transcriptome-wide at single base resolution. Nat. Chem. Biol. 16, 887–895 (2020).
Liu, C. et al. Absolute quantification of single-base m(6)A methylation in the mammalian transcriptome using GLORI. Nat. Biotechnol. 41, 355–366 (2023).
Meyer, K. D. DART-seq: an antibody-free method for global m6A detection. Nat. Methods 16, 1275–1280 (2019).
Li, Y. et al. Single-cell m6A mapping in vivo using picoMeRIP-seq. Nat. Biotechnol. 42, 591–596 (2024).
Yao, H. et al. scm6A-seq reveals single-cell landscapes of the dynamic m6A during oocyte maturation and early embryonic development. Nat. Commun. 14, 315 (2023).
Tegowski, M., Flamand, M. N. & Meyer, K. D. scDART-seq reveals distinct m6A signatures and mRNA methylation heterogeneity in single cells. Mol. Cell 82, 868–878.e810 (2022).
Feng, S. et al. Single-cell discovery of m6A RNA modifications in the hippocampus. Genome Res. 34, 822–836 (2024).
Gao, Y. et al. ESPRESSO: Robust discovery and quantification of transcript isoforms from error-prone long-read RNA-seq data. Sci. Adv. 9, eabq5072 (2023).
Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol. 39, 1348–1365 (2021).
Shiau, C. K. et al. High throughput single cell long-read sequencing analyses of same-cell genotypes and phenotypes in human tumors. Nat. Commun. 14, 4124 (2023).
Philpott, M. et al. Nanopore sequencing of single-cell transcriptomes with scCOLOR-seq. Nat. Biotechnol. 39, 1517–1520 (2021).
Joglekar, A. et al. Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain. Nat. Neurosci. 27, 1051–1063 (2024).
Du, H. et al. YTHDF2 destabilizes m(6)A-containing RNA through direct recruitment of the CCR4-NOT deadenylase complex. Nat. Commun. 7, 12626 (2016).
Sikorski, V., Selberg, S., Lalowski, M., Karelson, M. & Kankuri, E. The structure and function of YTHDF epitranscriptomic m(6)A readers. Trends Pharmacol. Sci. 44, 335–353 (2023).
Park, O. H. et al. Endoribonucleolytic Cleavage of m(6)A-Containing RNAs by RNase P/MRP Complex. Mol. Cell 74, 494–507.e498 (2019).
Boo, S. H. et al. UPF1 promotes rapid degradation of m(6)A-containing RNAs. Cell Rep. 39, 110861 (2022).
Choi, S. H. et al. RBM45 is an m(6)A-binding protein that affects neuronal differentiation and the splicing of a subset of mRNAs. Cell Rep. 40, 111293 (2022).
Flamand, M. N., Ke, K., Tamming, R. & Meyer, K. D. Single-molecule identification of the target RNAs of different RNA binding proteins simultaneously in cells. Genes Dev. 36, 1002–1015 (2022).
Xiao, Y. L. et al. Transcriptome-wide profiling and quantification of N(6)-methyladenosine by enzyme-assisted adenosine deamination. Nat. Biotechnol. 41, 993–1003 (2023).
Hu, L. et al. m(6)A RNA modifications are measured at single-base resolution across the mammalian transcriptome. Nat. Biotechnol. 40, 1210–1219 (2022).
Koh, C. W. Q., Goh, Y. T. & Goh, W. S. S. Atlas of quantitative single-base-resolution N(6)-methyl-adenine methylomes. Nat. Commun. 10, 5636 (2019).
Chen, H. X., Zhang, Z., Ma, D. Z., Chen, L. Q. & Luo, G. Z. Mapping single-nucleotide m(6)A by m(6)A-REF-seq. Methods 203, 392–398 (2022).
Zhao, X., Ye, H., Li, T., Rigden, D. J. & Wei, Z. m6AConquer: a data resource for unified quantification and integration of m6A detection techniques. bioRxiv. https://doi.org/10.1101/2024.09.10.612173 (2024).
Tang, Y. et al. m6A-Atlas: a comprehensive knowledgebase for unraveling the N6-methyladenosine m6A epitranscriptome. Nucleic Acids Res. 49, D134–d143 (2021).
Brannan, K. W. et al. Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes. Nat. Methods 18, 507–519 (2021).
Li, Q. V. et al. Genome-scale screens identify JNK-JUN signaling as a barrier for pluripotency exit and endoderm differentiation. Nat. Genet. 51, 999–1010 (2019).
Guo, W. et al. Single-molecule m(6)A detection empowered by endogenous labeling unveils complexities across RNA isoforms. Mol. Cell 85, 1233–1246.e1237 (2025).
Pratanwanich, P. N. et al. Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore. Nat. Biotechnol. 39, 1394–1402 (2021).
Kashima, I. et al. Binding of a novel SMG-1-Upf1-eRF1-eRF3 complex (SURF) to the exon junction complex triggers Upf1 phosphorylation and nonsense-mediated mRNA decay. Genes Dev. 20, 355–367 (2006).
Yamashita, A., Ohnishi, T., Kashima, I., Taya, Y. & Ohno, S. Human SMG-1, a novel phosphatidylinositol 3-kinase-related protein kinase, associates with components of the mRNA surveillance complex and is involved in the regulation of nonsense-mediated mRNA decay. Genes Dev. 15, 2215–2228 (2001).
Karousis, E. D., Gypas, F., Zavolan, M. & Mühlemann, O. Nanopore sequencing reveals endogenous NMD-targeted isoforms in human cells. Genome Biol. 22, 223 (2021).
Zhou, Y. et al. m6A sites in the coding region trigger translation-dependent mRNA decay. Mol. Cell 84, 4576–4593.e4512 (2024).
Cun, Y., Guo, W., Ma, B., Okuno, Y. & Wang, J. Decoding the specificity of m6A RNA methylation and its implication in cancer therapy. Mol. Ther. 32, 2461–2469 (2024).
Luo, Z., Zhang, J., Fei, J. & Ke, S. Deep learning modeling m6A deposition reveals the importance of downstream cis-element sequences. Nat. Commun. 13, 2720 (2022).
Baralle, F. E. & Giudice, J. Alternative splicing as a regulator of development and tissue identity. Nat. Rev. Mol. Cell Biol. 18, 437–451 (2017).
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Kurosaki, T., Popp, M. W. & Maquat, L. E. Quality and quantity control of gene expression by nonsense-mediated mRNA decay. Nat. Rev. Mol. Cell Biol. 20, 406–420 (2019).
Guo, C. J., Xu, G. & Chen, L. L. Mechanisms of long noncoding RNA nuclear retention. Trends Biochem. Sci. 45, 947–960 (2020).
Khan, M., Hou, S., Chen, M. & Lei, H. Mechanisms of RNA export and nuclear retention. Wiley Interdiscip. Rev. 14, e1755 (2023). RNA.
Fazal, F. M. et al. Atlas of subcellular RNA localization revealed by APEX-seq. Cell 178, 473–490.e426 (2019).
Kechin, A., Boyarskikh, U., Kel, A. & Filipenko, M. cutPrimers: a new tool for accurate cutting of primers from reads of targeted next generation sequencing. J. Comput. Biol. 24, 1138–1143 (2017).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Wang, J., Pan, Y., Shen, S., Lin, L. & Xing, Y. rMATS-DVR: rMATS discovery of differential variants in RNA. Bioinformatics 33, 2216–2217 (2017).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297–1303 (2010).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Shah, A., Qian, Y., Weyn-Vanhentenryck, S. M. & Zhang, C. CLIP Tool Kit (CTK): a flexible and robust pipeline to analyze CLIP sequencing data. Bioinformatics 33, 566–567 (2017).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e1821 (2019).
Lebrigand, K., Magnone, V., Barbry, P. & Waldmann, R. High throughput error corrected Nanopore single cell transcriptome sequencing. Nat. Commun. 11, 4025 (2020).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407–410 (2017).
Wu, T. et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation 2, 100141 (2021).
Acknowledgements
We thank Prof. Yi Xing for technical supports. This work was supported by the National Natural Science Foundation of China (32300455, Y.C.), the Guangdong Science and Technology Program (2022A1515110255, Y.C.; 2024A1515011841, J.W.; 2021A1515110650, Y.G.), China Postdoctoral Science Foundation (2024M753726, Z.R.; 2023M734053, X.H.), Guangzhou Science and Technology Program (2025A04J7115, J.W.).
Author information
Authors and Affiliations
Contributions
Y.C., J.W. and L.L designed and supervised the project; X.H. performed the experiments with the help from C.W., J.Z., F.W., Q.Z., N.C. and X.S.; Z.R., J.H. and Y.G. performed the data analyses with the help from W.G. and Z.W.; J.W., Z.R., Y.C. and L.L. wrote the manuscript with input from all authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ren, Z., He, J., Huang, X. et al. Isoform characterization of m6A in single cells identifies its role in RNA surveillance. Nat Commun 16, 5828 (2025). https://doi.org/10.1038/s41467-025-60869-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-60869-0









