Determination of isoform-specific RNA structure with nanopore long reads

Aw, Jong Ghut Ashley; Lim, Shaun W.; Wang, Jia Xu; Lambert, Finnlay R. P.; Tan, Wen Ting; Shen, Yang; Zhang, Yu; Kaewsapsak, Pornchai; Li, Chenhao; Ng, Sarah B.; Vardy, Leah A.; Tan, Meng How; Nagarajan, Niranjan; Wan, Yue

doi:10.1038/s41587-020-0712-z

Article
Published: 26 October 2020

Determination of isoform-specific RNA structure with nanopore long reads

Nature Biotechnology volume 39, pages 336–346 (2021)Cite this article

20k Accesses
106 Citations
131 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 23 March 2021

An Author Correction to this article was published on 12 November 2020

This article has been updated

Abstract

Current methods for determining RNA structure with short-read sequencing cannot capture most differences between distinct transcript isoforms. Here we present RNA structure analysis using nanopore sequencing (PORE-cupine), which combines structure probing using chemical modifications with direct long-read RNA sequencing and machine learning to detect secondary structures in cellular RNAs. PORE-cupine also captures global structural features, such as RNA-binding-protein binding sites and reactivity differences at single-nucleotide variants. We show that shared sequences in different transcript isoforms of the same gene can fold into different structures, highlighting the importance of long-read sequencing for obtaining phase information. We also demonstrate that structural differences between transcript isoforms of the same gene lead to differences in translation efficiency. By revealing isoform-specific RNA structure, PORE-cupine will deepen understanding of the role of structures in controlling gene regulation.

You have full access to this article via your institution.

Download PDF

Direct identification of A-to-I editing sites with nanopore native RNA sequencing

Article 13 June 2022

A systematic benchmark of Nanopore long-read RNA sequencing for transcript-level analysis in human cell lines

Article Open access 13 March 2025

Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore

Article 19 July 2021

Main

RNAs can fold into complex secondary and tertiary structures to regulate every step of their life cycle¹. The ability to assign correct structures to the right transcript is key to understanding RNA-based gene regulation. Recently, enzymatic and chemical probes have been coupled with high-throughput sequencing to generate large-scale RNA structure information across transcriptomes in vitro and in vivo^{2,3,4,5,6,7,8,9,10}. This has yielded insights into the pervasive regulatory roles of structure in diverse organisms and cellular conditions^1,11. Although powerful, current high-throughput structure mapping approaches suffer from drawbacks, such as complex library preparation protocols and the lack of structural information for full-length transcripts owing to short-read sequencing. As short reads cannot distinguish structures in shared regions between isoforms, this poses a challenge in the ability to correctly assign RNA structural information in individual gene-linked isoforms, limiting understanding of the role of structure in gene regulation.

Recent developments in high-throughput sequencing have enabled long-read, amplification-free complementary DNA (cDNA) sequencing on both the Pacific Biosciences and Oxford Nanopore platforms^12,13,14, enabling the phasing of alternative exons¹⁵. In nanopore sequencing, RNA and DNA molecules can be directly sequenced by measuring the current as the molecules are threaded through a biological pore^16,17. Additionally, natural modifications along DNA or RNA can result in current perturbations, leading to decodable signal anomalies that reveal both the position and identity of modifications along the genome^18,19,20. In principle, artificial RNA modifications caused by structure-probing chemicals are also decodable, but their use for studying full-length RNA structure has yet to be explored.

In this study, we coupled chemical modifications with direct RNA sequencing on nanopores to identify structural patterns in the transcriptome of human embryonic stem cells (hESCs). Our method, which we term PORE-cupine, identifies single-stranded bases along an RNA by detecting current changes induced by structure-dependent modifications (Fig. 1a). As the method involves only a simple two-step ligation protocol (with a preparation time of 2 h before sequencing) without the need for polymerase chain reaction (PCR) amplification, PORE-cupine captures structural information in a transcriptome rapidly and directly. The nature of long-read sequencing through nanopores also allows one to accurately assign and capture structures and their connectivity along individual gene-linked isoforms, deepening one’s understanding of how the complex and extensively spliced transcriptome could take on different structures to regulate isoform-specific gene expression.

**Fig. 1: PORE-cupine leverages machine learning to profile RNA secondary structure.**

Results

Chemical modifications on RNAs can result in detectable errors in direct RNA sequencing

Many chemical probes can modify single-stranded bases in folded RNAs²¹. To determine which chemical probes can result in a detectable signal change during direct RNA sequencing, we tested five different structure-probing compounds that modify single-stranded bases (Extended Data Fig. 1a). These include SHAPE reagents that acylate 2′ hydroxyl (OH) groups of flexible bases: N-methylisatoic anhydride (NMIA), 2-methylnicotinic acid imidazolide (NAI) and 2-methylnicotinic acid imidazolide azide (NAI-N3), as well as base-specific chemical probes: dimethyl sulfate (DMS) and 1-cyclohexyl-3-(2-morpholinoethyl) carbodiimide metho-p-toluenesulfonate (CMCT) (Extended Data Fig. 1b)^6,21,22,23. DMS alkylates single-stranded bases, specifically at N1 of adenines, N3 of cytosines and N7 of guanines, whereas CMCT primarily reacts with single-stranded uracils at N3 and guanines at N1 positions (Extended Data Fig. 1b). We first performed in vitro structure probing using each of these chemicals on Tetrahymena RNA, which has a well-defined secondary structure²⁴. Modified and unmodified Tetrahymena RNAs were ligated to an adapter and attached to a motor protein before being directly sequenced on the nanopore MinION system¹⁶ (Fig. 1a). We sequenced 19,000–68,000 modified Tetrahymena RNA reads for each of the compounds individually and 20,000 unmodified Tetrahymena RNA sequences (Supplementary Table 1).

Tetrahymena RNAs that were modified by the different compounds individually showed similar lengths upon mapping to the sequence, although DMS-modified mapped sequences were slightly shorter (Extended Data Fig. 1c). As bases with modifications can result in errors during base-calling along the sequence, we calculated the proportion of mismatches, insertions and deletions in our mapped modified versus unmodified Tetrahymena RNA reads. Notably, we observed a higher mismatch rate (modified: 6.5%–11.7%, unmodified: 5.3%), deletion rate (modified: 8.7–13.2%, unmodified: 8.7%) and insertion rate (modified: 3.4–4.1%, unmodified: 3%) in modified Tetrahymena RNA sequences for all of the five chemical compounds (Extended Data Fig. 1d–f). Aligning the error rates for the five compounds along the sequence and structure of the Tetrahymena RNA showed that there are peaks that are shared by multiple compounds (Extended Data Figs. 2 and 3a). We further observed that modification-induced mismatches show base-specific changes such that NAI-N3 and DMS modifications on cytosines tend to be miscalled as uracils, whereas NAI-N3 and CMCT modifications on uracils tend to be miscalled as cytosines (Extended Data Fig. 3b). This suggests that there are systematic errors in base-calling when bases are modified, and this could be leveraged for detecting them.

To determine whether the mismatches, insertions and deletions caused by the compounds reside in expected single-stranded positions according to the Tetrahymena RNA secondary structure, we calculated the performance of these errors using area under the receiver operating characteristic curve (AUC-ROC) analysis on footprinting signals for the Tetrahymena RNA. NAI-N3-induced mismatches resulted in the best performance for detecting single-stranded bases in the secondary structure of the Tetrahymena RNA (Extended Data Fig. 1g–i), suggesting that it is a promising structure-probing compound for further optimization with direct RNA sequencing signals.

PORE-cupine accurately identifies NAI-N3 modifications using machine learning

To detect NAI-N3 signals with higher accuracy, we next used a machine learning strategy known as support vector machine (SVM) to perform anomaly detection on our modified RNA. Upon sequencing and mapping, we used the program Nanopolish, which was first developed to align DNA signals to detect 5-mC¹⁸, to align the current signals to the RNA sequence (Extended Data Fig. 4a). We extracted three features from the current that flows through the channel of the nanopores during sequencing—the current mean, s.d. and dwell time—and determined their distribution in modified versus unmodified bases along the Tetrahymena RNA. Modified single-stranded bases undergo current changes in their s.d. and mean, but not in their dwell time, as compared to unmodified bases, suggesting that we could distinguish modification status based on the above two features (Fig. 1b and Extended Data Fig. 4b–e). By using footprinting data from the Tetrahymena RNA, we then optimized one-class SVM parameters with these two features to best distinguish signals from modified versus unmodified bases (Methods). The extent of modified outliers per base could be calculated as a ‘reactivity score’, whereby, the higher the score, the more single-stranded a base is predicted to be. For example, the double-stranded base 182 in the Tetrahymena RNA is not modified by NAI-N3 and does not show current changes with or without chemical modification upon sequencing (Fig. 1c). However, the single-stranded base 129 is modified upon structure probing, and this is reflected by the ‘comet tail’, indicating deviations in current for the modified base (Fig. 1c).

To determine the effect of different extent of modifications on direct RNA sequencing, we modified the Tetrahymena RNA in vitro using two different conditions—a standard structure-probing condition (5 min) and an over-modified condition (25 min) (Methods)—and sequenced 10,000–51,000 reads for each (Supplementary Table 1). We observed that over-modifying the RNA did not cause RNA degradation (Extended Data Fig. 4f) but resulted in much poorer mappability rates (mappability rates of unmodified, 5 min and 25 min were 75.2%, 81.4% and 16.5%, respectively; Extended Data Fig. 4g), indicating that the high error rates make it difficult to align reads accurately to known sequences. Over-modified RNAs are also shorter in length (median length of 25-min modified RNA = 348 bases, 5-min modified = 378 bases and unmodified = 379 bases; Extended Data Fig. 4h), suggesting that over-modified reads could be prematurely ejected from pores during sequencing. Plotting the coverage of the mapped reads along the length of the Tetrahymena RNA sequence showed the largest decrease in the first 50 bases of the 5′ end for unmodified and 5-min modified samples and in the first 100 bases for the over-modified samples (Extended Data Fig. 4i). Based on these results, we continued with the standard 5-min modification protocol for all subsequent structure experiments.

We observed that reactivity signals from two replicates of the Tetrahymena RNA were highly correlated, indicating that our data are reproducible (R = 0.97; Extended Data Fig. 5a). Bases with high reactivity scores were not observed in additional replicates of unmodified Tetrahymena RNA and were evenly distributed in modified denatured RNA in a non-structure-specific manner, further indicating that the reactivity scores represent real structure modifications (Fig. 1d). In addition to unimodal current profiles for each k-mer, we observed that 2.9% of Tetrahymena RNA k-mers show bimodal current profiles for both mean and s.d. (Extended Data Fig. 5b). Comparing PORE-cupine’s reactivity profile to footprinting signals showed that PORE-cupine has a two-base frameshift relative to footprinting (Extended Data Fig. 5c) and that correcting for this frameshift results in a high Pearson correlation coefficient (Fig. 1e, Extended Data Fig. 5d,e and Supplementary Data 1). As five bases occupy the nanopore channel at a time, this two-base shift indicates that modifications on the third base, which is at the center of the channel, result in the largest current difference in our study. We performed this two-base shift for all of our downstream analysis.

To optimize SVM’s ability to distinguish modified from unmodified bases accurately in different RNAs, we expanded our training and test set to 14 RNAs (2,663 bases, including two human messenger RNAs (mRNAs)) and used a 80%/20% train-test split (at the RNA level) to refine model parameters (Fig. 1e,f, Extended Data Fig. 5d–f and Supplementary Data 1). Except for the 16S ribosomal RNA (rRNA), for which we performed in vivo structure probing, the rest of the 13 RNAs were structure probed in vitro. All RNAs were sequenced as a pool to obtain 0.5–2 million reads using direct RNA sequencing (Supplementary Tables 1 and 2). PORE-cupine analysis showed high reproducibility in reactivity between different biological replicates for RNAs in the test set (Extended Data Fig. 5g–j). Our revised SVM parameters performed similarly to the initial SVM parameters based on the Tetrahymena RNA but exhibited a slightly higher median AUC-ROC score of 0.79 on the test set (Extended Data Fig. 5k). The overall performance of the model was also robust across various train-test splits of the 14 RNAs (Extended Data Fig. 5l and Methods). Separate training based on unimodal and bimodal k-mers did not improve the performance of the SVM for them (Extended Data Fig. 5m). Comparing PORE-cupine’s reactivity with footprints on randomly selected regions of test RNAs (full-length 16S rRNA, RPS29 and AdoCbl riboswitch) demonstrated good correlation, similar to what is observed with biological replicates for footprinting (Supplementary Table 3, Fig. 2a–c, Extended Data Fig. 6 and Supplementary Data 1).

**Fig. 2: PORE-cupine performs accurately in vitro and in vivo and captures riboswitch structural dynamics.**

As RNA structures can be dynamic, we next tested whether structural changes in a riboswitch (with and without ligand) can be detected using PORE-cupine²⁵. Applying PORE-cupine on the in vitro modified thiamine pyrophosphate (TPP) riboswitch resulted in 5,000–64,000 sequenced reads (Supplementary Table 1). We observed that corresponding reactivity signals were robust (Extended Data Fig. 7) and that the binding of TPP results in structure differences in the aptamer region (R = 0.3 between water and 10 μM TPP in the aptamer region versus R = 0.9 in non-aptamer regions; Fig. 2d). In addition, we also observed a gradated change in reactivity under different concentrations of TPP in vitro, indicating that PORE-cupine could detect gradual changes in RNA secondary structure (Fig. 2d,e).

Genome-wide analysis of RNA structures in hESCs using PORE-cupine

Groups of transcripts can share similar structures and perform related functions in cells²⁶. We applied PORE-cupine to study the RNA structural landscape in hESCs by sequencing four biological replicates of NAI-N3-modified and two biological replicates of unmodified hESC transcriptomes, totaling 10 million sequenced reads in each condition (Supplementary Table 4 and Extended Data Fig. 8a,b). The mappability of unmodified and modified reads was 86.1% and 59.6%, respectively (Extended Data Fig. 8c), with most reads having a modification rate of 1–2% (Extended Data Fig. 8d), providing good performance for detecting secondary structures in terms of AUC-ROC (Extended Data Fig. 8e). We did not observe a decrease in mapping rate around exon–exon junctions for our modified libraries (Extended Data Fig. 8f). We observed that 0.29% of the k-mers in the hESC transcriptome showed bimodal current profiles for both the current mean and s.d., and that this bimodality is enriched in specific bases along a k-mer (Extended Data Fig. 8g–j). We also observed a drop in read coverage at the extreme 5′ and 3’ ends of transcripts, indicating that these could be blindspots in calculating reactivities (Extended Data Fig. 8k).

To determine the number of reads needed for accurate structure determination, we subsampled the number of unmodified and modified reads of the Tetrahymena RNA and compared the reactivity information obtained from the subsampled set to that of the full data set. As expected, the correlation increased with the number of reads used and began to plateau at around R = 0.8, with 200 reads of unmodified RNA and 100 reads of modified RNA (Fig. 3a). At this threshold, we observed that transcript abundances and reactivity profiles of hESC RNAs were highly correlated across biological replicates, independent of modification status (Fig. 3b), indicating that our data are reproducible (Extended Data Fig. 9a). We obtained structural information for 1,582 coding genes, 98 noncoding genes, 67 pseudogenes and four rRNAs across the hESC transcriptome after filters (Fig. 3c and Extended Data Fig. 9b). The median length of mapped reads was 772 and 752 bases for unmodified and modified libraries, respectively (Extended Data Fig. 9c), with 37.9% and 42.8% of the unmodified and modified transcripts having more than 90% of the annotated gene length, respectively (Extended Data Fig. 9d). We observed that the reactivity profiles of our transcripts were highly consistent with those for near full-length transcripts (>99% of annotated length, n = 83, median Pearson correlation = 0.93; Extended Data Fig. 9e), indicating that our structure information is robust and captures what is found in vivo.

**Fig. 3: Genome-wide structure analysis in hESCs with PORE-cupine confirms known global structural features.**

To benchmark the accuracy of PORE-cupine with other widely used high-throughput, structure-probing methods, such as icSHAPE and SHAPE-MaP^5,6, we performed icSHAPE and SHAPE-MaP on the hESC transcriptome and on known structural RNAs, such as 16S rRNA and the Tetrahymena RNA. We observed that PORE-cupine performed similarly to icSHAPE and SHAPE-MaP on the 16S rRNA and Tetrahymena RNA (AUC-ROC for Tetrahymena RNA: 0.93 for SHAPE-MaP and PORE-cupine and 0.91 for icSHAPE; AUC-ROC for 16S rRNA: 0.8 for icSHAPE and PORE-cupine and 0.77 for SHAPE-MaP; Fig. 3d). To determine whether high-reactivity sites observed in one method were also seen in another independent method, we overlapped PORE-cupine signals with icSHAPE and SHAPE-MaP signals. We observed that 38% of PORE-cupine’s high-reactivity positions overlapped with icSHAPE or SHAPE-MaP sites, whereas 36% of SHAPE-MaP high-reactivity positions overlapped with icSHAPE or PORE-cupine sites, and 39% of icSHAPE high-reactivity positions overlapped with SHAPE-MaP or PORE-cupine sites (Fig. 3e,f and Methods). Although these overlap rates are low, they are consistent with previous observations on read-through versus reverase transcription (RT) stop methods²⁷ and point to the complementary range of various genome-wide, structure-probing methods in capturing different populations of single-stranded bases.

To determine whether PORE-cupine could capture global structural properties seen in other high-throughput, structure-probing data sets, we calculated the average reactivity signal in different RNA classes. As expected, we observed that rRNAs are the most highly structured, followed by long noncoding RNAs (lncRNAs) and mRNAs, in agreement with the importance of structure for noncoding RNAs³ (Extended Data Fig. 9f). Metagene analysis of mRNAs aligned by their translational start and stop sites showed the classic three-nucleotide structural periodicity in their coding sequences and not in their 5′ and 3′ untranslated regions (UTRs)^2,10,28 (Extended Data Fig. 9g), highlighting PORE-cupine’s ability to recapitulate known structural patterns in other data sets.

As icSHAPE and SHAPE-MaP can identify RNA-binding protein (RBP) sites by detecting different SHAPE reactivities in bound versus unbound positions²⁹, we evaluated whether PORE-cupine could also detect RBP binding sites in our hESC data. We examined the reactivity profiles of Lin28 binding sites with and without high-throughput sequencing of RNA isolated by crosslinking immunoprecipitation (HITS-CLIP) binding evidence in hESCs³⁰ (Methods). We observed that Lin28 binding sites with HITS-CLIP evidence showed an increase in reactivity in the bases flanking the binding motif and a decrease in reactivity within the binding motif (Fig. 3g). This indicates that real Lin28 binding sites are more structurally accessible around the motif and that Lin28 binding likely prevents NAI-N3 from modifying the RNA sequence.

Besides RBP binding, previous studies also showed that single-nucleotide variants (SNVs) can result in structural changes along an RNA³. To determine whether PORE-cupine could identify structural changes due to point mutations, we first identified mutations in the hESC transcriptome using Illumina RNA sequencing data (Methods). We then separated mapped direct RNA sequencing reads based on the different alleles observed. We identified 90 transcripts with two or more SNVs and sufficient coverage for reactivity analysis: 10/90 SNVs were observed to result in statistically significant reactivity changes (11.1%, Fisher’s exact test; Methods). Metagene analysis of the reactivity profiles across alleles showed that the largest reactivity differences occur locally and extend up to 25 bases upstream and downstream of the SNV location (Fig. 3h,i).

Detecting structural differences in shared exons from alternative isoforms

The human transcriptome is extensively spliced^31,32. As most short reads fall in sequences shared between transcript isoforms, short-read sequencing cannot reveal structural differences between isoforms, which can have considerable functional consequences³³. To analyze RNA structure across isoforms, we mapped direct RNA sequencing data to transcripts present in the Ensembl database (Methods) and obtained 104 genes (corresponding to 204 pairwise transcript comparisons) that had two or more isoforms with sufficient coverage for accurate structure characterization for downstream analysis (Fig. 4a and Extended Data Fig. 10a).

**Fig. 4: PORE-cupine reveals structural differences in shared exons between alternative isoforms.**

We observed that most gene-linked isoforms (178 of 204, 87%) showed reactivity differences in shared regions (Methods). Globally, this is reflected in lower reactivity similarities in shared sequences across gene-linked isoforms as compared to biological replicates of the same transcript (Fig. 4b). In general, isoform pairs with greater sequence similarities are more structurally similar in shared regions (Fig. 4c). This correlation is stronger when there are two or more alternative splice sites along a transcript, resulting in widespread reactivity changes across the entire transcript (Fig. 4c). Although the largest reactivity differences appear to occur immediately around the alternative splice site, 70% of isoforms contain both local and distal (>200 bp away) reactivity changes relative to splice sites (Fig. 4d). We confirmed this distal effect on structures by showing that identical sequences that are far away from an alternative splice site show lower reactivity correlation between gene-linked isoforms in the same replicate than between identical transcripts across biological replicates (Fig. 4e).

PORE-cupine can phase structures along long isoforms

The presence of two or more alternative structures that reside in and span identical sequences makes it particularly challenging for short-read sequencing to determine which combinations of RNA structures coexist in an isoform (Extended Data Fig. 10b). An example of this is RPS8, which is alternatively spliced into two isoforms that share identical sequences for three exons near the 3′ end but are alternatively spliced near the 5′ end (Fig. 4f). PORE-cupine analysis shows that the two isoforms contain different structures (A1 versus A2 and B1 versus B2) that are separated by ~400 bases from each other in the shared sequences. In short-read sequencing, the lack of connectivity between structures A and B makes it difficult to know whether A1 is linked to B1 or B2 in the blue isoform and vice versa. However, PORE-cupine enables us to link and correctly assign structural information to their individual isoforms in shared regions (Fig. 4g). Globally, 36.4% of the transcripts contained two structure-changing regions that were more than 200 bases apart (Extended Data Fig. 10c), demonstrating the importance of PORE-cupine for phasing structures along long isoforms by providing connectivity in RNA structure information across the transcriptome.

Isoforms with structural differences show differences in translation efficiency

Different RNA structures are used to regulate gene expression, including translation, splicing and decay¹. To determine whether structural differences between isoforms could regulate translation, we performed transcript isoform in polysomes sequencing (TrIP-seq) on hESCs to analyze the distribution of gene-linked isoforms across a polysome gradient³⁴. Isoforms that are found predominantly in higher polysome fractions are typically associated with more ribosomes and have higher translation rates, although RNAs could also be associated with other high-molecular-weight complexes in high polysome fractions (Extended Data Fig. 10d,e). We obtained a high degree of correlation between two biological replicates across polysome fractions (Extended Data Fig. 10f) and observed that highly translated transcripts are found in high polysome fractions, although poorly translated transcripts are found in low polysome fractions as expected (Extended Data Fig. 10g), indicating that our TrIP-seq data provide an accurate reflection of mRNA–polysome association.

Of 178 structure-changing isoform pairs, 153 pairs had polysome fractionation data. We observed that 28 pairs showed changes in translation efficiency by TrIP-seq (18.3%, Fisher’s exact test; Methods). Gene-linked isoforms that are structurally similar are translated at similar rates, whereas structurally more divergent pairs show greater differences in their translation, suggesting that isoform-specific structures could affect isoform-specific translation (Fig. 5a). We observed that one of the isoform pairs of RPL17 showed reactivity differences in shared regions based on PORE-cupine analysis, as well as translation efficiency differences. The transcript ENST00000618619.4 (RPL17_1) is highly translated, whereas ENST00000579408.5 (RPL17_2) is poorly translated and contains a retained intron of 161 bases in the 5′ UTR (Fig. 5b,c). To study how this retained intron resulted in structural changes in the 5′ UTR and in translation repression, we examined pairwise RNA interactions in this region from a previously published data set that uses proximity ligation sequencing (Sequencing of Psoralen crosslinked, Ligated and Selected Hybrids (SPLASH))³⁵. SPLASH reads showed strong interactions between the retained intron and sequences upstream and downstream to it, resulting in an extensively structured environment around the start codon (Fig. 5d). In the absence of the retained intron (RPL17_1), the isoform folds into a simpler structure, allowing the start codon to be more accessible for translation.

**Fig. 5: Structure differences between isoforms are correlated with translation efficiency.**

To experimentally validate that the poor translatability of RPL17_2 is indeed due to extensive structures formed by the retained intron around the start codon, we cloned the 5′ ends of RPL17_1 and RPL17_2 in front of a luciferase reporter and performed mutagenesis experiments on RPL17_2 (Fig. 5e and Methods). We confirmed that RPL17_1 indeed translates much better than RPL17_2, as shown by >10-fold higher luciferase units upon RNA transfection (Fig. 5f). Mutations that disrupt the pairwise interactions of three different stems (1.1 or 1.2, 2.1 or 2.2 and 3.1 or 3.2; Fig. 5e) open the structures around the start codon and increase the translatability of RPL17_2, whereas compensatory mutations that restore the helical structures partially rescue the poor translatability of RPL17_2 (Fig. 5f). These results confirm that structure plays an important role in regulating isoform-specific translation of RPL17.

Discussion

The human transcriptome is tightly regulated by sequence and structural features along each transcript^32,36. Assigning correct structural information to individual transcripts is the first step in understanding structure-based gene regulation. In this study, we coupled RNA structure probing with direct RNA sequencing on nanopores to better understand structure–function relationships in the cell. We initially tested five different structure-probing compounds, as it is unclear how the size, charge and location of the modifications (on the base or sugar) along RNA could perturb the current flowing through the nanopore. Although we observed that DMS modifications resulted in the highest amount of errors upon mapping, these errors were occurring in double-stranded as well as single-stranded bases, probably due to the ability of DMS to modify guanines in a structure-independent manner. In contrast, the SHAPE compound NAI-N3 modifies RNAs to result in errors that are more enriched in single-stranded bases.

Although errors, such as mismatches, insertions and deletions, detected during nanopore sequencing is a convenient way to determine whether a modification has an effect on direct RNA sequencing, machine learning strategies, such as SVMs, enable the identification of NAI-N3-modified bases with high accuracy. In addition, the frequency of outliers at each base can provide a reactivity score that serves as a proxy for single-strandedness at that base. Similarly to other direct RNA sequencing experiments, we observed that RNAs longer than 200 bases are sequenced better by direct RNA sequencing¹⁶, and that there is a 5′ end decay in signal particularly in the first 10% of bases of the transcript, which could be due to degradation during poly(A) selection or when the RNA is being sequenced. We also observed a signal decay at the very 3′ end of the transcripts, which could indicate incorrect annotations.

Compared to other short-read, high-throughput sequencing methods^2,6,8,10, PORE-cupine provides a fast, direct and complementary way to assay for RNA structure and dynamics genome wide. Low-reactivity regions could indicate either double-strandedness or protection from the compound, such as upon RBP binding. PORE-cupine also requires a minimum number of reads (200 reads of modified RNA and 100 reads of unmodified RNA) for accurate reactivity analysis, as most of the reads contain only 1–2% modifications and hence require us to aggregate signals across the strands to be able to detect structure. As such, deep sequencing is needed to be able to obtain RNA structure information using direct RNA sequencing. We obtained structure information for a total of 1,751 transcripts with ~40 million sequencing reads and estimate that at least 26 million more reads are needed to double the amount of structure information transcriptome wide. As we are currently mapping reads to transcripts in the Ensembl database, we are detecting structures on transcripts that have been annotated. Further efforts in de novo assembly of RNA transcripts would enable us to identify structural information on novel transcripts.

As PORE-cupine obtains structural information along long stretches of RNA, we can determine and phase RNA reactivities in gene-linked isoforms. However, alternative splicing at the extreme 5′ or 3′ ends of the transcripts limits the ability to uniquely map reads to individual gene-linked isoforms due to 5′ and 3′ end decay. We focused our analysis on mRNA gene-linked isoforms due to controversies on whether lncRNAs could be translated. We observed that many gene-linked isoforms exhibit reactivity differences in shared regions, and that this is associated with changes in translational efficiency using polysome profiling. Although polysome profiling data are not a perfect proxy for translation, as RNAs could also be associated with other RBPs that reside in different polysome fractions, we do observe that highly translated RNAs, such as ACTB, and poorly translated RNAs, such as ATF4, are in high and low polysome fractions, respectively, in our data. Furthermore, our luciferase experiments on a poorly translated versus highly translated isoform validated the polysome fractionation results, suggesting that transcripts were largely translated according to their presence in the polysome fractions. Lastly, our mutational experiments on RPL17 further demonstrate the importance of isoform-specific structure in regulating translation.

PORE-cupine expands the current repertoire of RNA structure probing strategies to deepen our understanding of the role of RNA structure in isoform-specific gene regulation^2,6,9,10,37. Although the initial version still requires an aggregate of signals across many strands to obtain accurate structure reactivity information, further developments of the method, by increasing the modification frequency per strand and improving sequencing and analytical techniques, could enable the study of RNA structures at a single-molecule level in the future.

Methods

RNA-modifying reagents

CMCT, NAI, NMIA and DMS were purchased from Sigma-Aldrich. NAI-N3 was synthesized as previously described from ethyl 2-methylnicotinate in four steps, as in Spitale et al.⁶.

In vitro transcription, folding and in vitro structure probing

RNA was transcribed from PCR-amplified inserts using the Hiscribe T7 High Yield Synthesis Kit (NEB). The RNA of interest was folded and structure probed in the presence or absence of ligand (TPP). Depending on the solvent for the RNA-modifying chemical, DMSO or water was added to the negative control. CMCT, NAI and NAI-N3 were added to final concentrations of 100 mM, whereas NMIA and DMS were used at final concentrations of 20 mM and 5% (vol/vol), respectively. DMS reactions were quenched with 30% β-mercaptoethanol in 0.3 M sodium acetate. Reactions were column purified (Zymo Research) and resuspended in nuclease-free water.

RNA footprint analysis

To determine sites of modifications along an RNA, an RT primer (IDT) was designed around 20 bp downstream of the region of interest. Primers were radiolabeled with P³² and purified on a 15% TBE-Urea PAGE gel. The purified labeled primer (1 µl) was then incubated with 500 ng of RNA (in 5.5 µl) for RT and run on an 8% TBE-urea PAGE sequencing gel. Gels were dried for 2 h on a vacuum gel drier before being exposed to a phosphorimager plate for 24 h. The phosphorimager plates were imaged on an Amersham Typhoon 5 Biomolecular Imager (General Electric). Gel images were quantified using semi-automated footprinting analysis (SAFA) software³⁸.

Human and bacterial cell culture with in vivo SHAPE modification

hESCs (H9) were obtained from labs in the Genome Institute of Singapore and cultivated in feeder-free conditions with mTESR basal media supplemented with mTESR supplement (STEMCELL Technologies).

For in vivo SHAPE modification, cells were rinsed once on the plate with room temperature PBS (Thermo Fisher Scientific) before being incubated with Accutase (STEMCELL Technologies) for 10 min at 37 °C to dissociate cells. The cells were washed with PBS and spun down at 400g for 5 min before being resuspended in 950 µl of PBS in a 1.7-ml Eppendorf. NAI-N3 (2 M in 50 µl) in DMSO⁺ or DMSO⁻ was then added to two separate cell suspensions and immediately mixed by inversion, and the cells were incubated at 37 °C at 10 r.p.m. (Model 400 hybridization incubator, SciGene) for 5 min. After the incubation period, the cell suspensions were immediately spun down at 400g for 5 min at 4 °C. The supernatant was removed before total RNA was isolated with the addition of TRIzol regent (Thermo Fisher Scientific).

Bacillus subtilis strain 168 was grown in LB media at 37 °C to an OD₆₀₀ = 0.6. Cells were harvested by centrifugation at 3,000 r.p.m. for 5 min and pelleted and washed in PBS, before being treated with 100 mM NAI-N3 as in the H9 example above for 5 min at 37 °C. The cells were pelleted after incubation, resuspended in bacterial lysis buffer (1% SDS, 8 mM EDTA and 100 mM NaCl) and lysozyme (final concentration, 15 mg ml⁻¹), and incubated for 15 min at 37 °C. Total RNA was then isolated with the addition of TRIzol reagent.

Direct RNA sequencing library preparation

Direct RNA sequencing libraries were constructed using an input of 500 ng of RNA for single templates or 1 µg of mRNA isolated from H9 hESC total RNA. For the training and test set of RNAs, a total of 14 RNAs were structure probed individually in vitro and then pooled together and sequenced as a mix. Of these 14 RNAs, 11 were used as the training set for the SVM, and the remaining three were used as a test set.

H9 mRNA was isolated using the Poly(A) Purist MAG Kit (Thermo Fisher Scientific). In this study, the direct RNA library preparation kit SQK-RNA001 from Oxford Nanopore Technologies was used for all sequencing runs. All preparation steps were followed according to the manufacturer’s specifications, except for the omission of a single RT step. The libraries were loaded onto R9.4.1 or R9.5 flow cells and sequenced on a MinION device. For the in vitro and 16S rRNA templates, the RTA DNA adapter from the sequencing kit in the first step was replaced with a DNA adapter complementary to the 3′ end of the RNA. The sequences of these adapters are detailed in Supplementary Table 1. The sequencing parameters were modified for all runs, and the specific changes are described in the document ‘Modified Minkown parameters’ found in our GitHub repository: https://github.com/awjga/PORE-cupine.

Polysome fractionation

H9 cells from a 15-cm plate were treated for 10 min with 100 µg ml⁻¹ cycloheximide at 37 °C. The cells were next washed with warm PBS, dissociated with trypsin and neutralized with ice-cold media containing FBS (all supplemented with 100 µg ml⁻¹ cycloheximide). Next, they were pelleted before being resuspended in 1× RSB buffer (10 mM Tris-HCl (pH 7.4), 150 mM NaCl and 15 mM MgCl₂) with 200 µg ml⁻¹ cycloheximide, lysed in lysis buffer (10 mM Tris-HCl (pH 7.4), 150 mM NaCl, 15 mM MgCl₂, 1% Triton-X, 2% Tween-20 and 1% deoxycholate) and incubated on ice for 10 min. Centrifugation at 12,000g for 3 min removed the nuclei, and the supernatant was removed for a subsequent centrifugation.

Equal OD units were loaded onto a linear 10–50% sucrose gradient that was made using the 107 Gradient Master Ip (BioComp Instruments). The gradients containing the cell lysate were centrifuged in SW41 bucket rotors (Beckman Coulter) at 36,000 r.p.m. at 8 °C for 2 h. Twelve fractions were separated and collected from the top of the gradient using the PGF Ip Piston Gradient Fractionator (BioComp Instruments) and the Fraction Collector (FC-203B, Gilson). The absorbance readings were collected at 260 nm with an Econo UV Monitor (EM-1 220 V, Bio-Rad). After fractionation, 110 µl of 20% SDS and 12 µl of proteinase K (Thermo Fisher Scientific) were added to each fraction for a 30-min incubation at 42 °C, after which 10 µl of GeneChip Eukaryotic Poly-A RNA controls (Affymetrix; final concentration, 1:120,000) were added to each fraction. Total RNA from each fraction was extracted using phenol-chloroform-isoamyl alcohol (25:24:1, Sigma-Aldrich), poly(A) selected and made into a cDNA library using Ultra Directional RNA Library Prep Kit (NEB), following the manufacturer’s instructions.

Data analysis

Processing and quantification of TrIP-seq data

Raw paired-end reads were first trimmed and then mapped to the human transcriptome (Ensembl version GRCh38.93) using Salmon³⁹ options: -l A–seqBias–gcBias –posBias). Relative abundances of each isoform were estimated as transcripts per million (TPM) by Salmon, and corresponding values were used for downstream analysis.

Base-calling and mapping of nanopore reads

Reads were base-called with Albacore version 2.3.3 or Guppy version 3.1.5 without filtering. Base-called sequences were aligned with GraphMap version 0.5.2 (ref. ⁴⁰). For single-gene mapping, references for the individual genes were used. For H9 transcriptome mapping, cDNA and noncoding reference sequences obtained from Ensembl were used (GRCh38).

Determination of single-stranded positions

To evaluate the performance of various structure-probing methods, single-stranded positions were determined as those having a value greater than 1 s.d. above the median of the SAFA value within a gel. Single-stranded positions were then used to evaluate various methods and determine ROC curves.

Calculation and comparison of error rates

Mismatch, deletion and insertion rates were calculated using custom Python scripts from aligned BAM files. The Wilcoxon rank sum test was used to calculate significance of error rates between modified and unmodified samples. Fold changes in mismatch, deletion and insertion rates per position were calculated by dividing the error rate in modified samples by the rate in unmodified samples. The fold change was Winsorized: values ≥99th percentile were set to the value at the 99th percentile, and values <1 were set to 1. AUC-ROC values for mismatch, deletion and insertion rate-based prediction of single-stranded bases were calculated and compared between modified and unmodified libraries using the Wilcoxon test.

Alignment of nanopore signals

Current measurements above 200 µA and below 0 µA were considered as outlier values and were removed from the raw nanopore sequencing files. The current signal was aligned using Nanopolish (version 0.10.2). As the current mean drifts with increasing sequencing time, we normalized the current per strand to that of the expected model current mean in Nanopolish. Multiple events from the same read and position were collapsed into a single value by calculating the weighted average of event mean and event s.d. and taking the sum of event lengths.

Training of parameters, determination of thresholds and calculation of reactivity profiles

A one-class SVM was used to determine the percentage of modifications per position. Specifically, the current mean and current s.d. were used as features for each base.

To determine the number of unmodified and modified reads needed for robust analysis, we subsampled reads to various depths (25–500, 100 iterations) and compared the reactivities of the subsampled strands to that of the full data set using Pearson correlation. Reactivity scores were determined per position by calculating the percentage of modified bases detected using one-class SVM. For the hESC H9 transcriptome, transcripts that were present in both replicates of the modified libraries with a minimum of 100× coverage and transcript length >200 nucleotides were retained for analysis. To compare the reactivity across isoforms, a five-nucleotide moving average was applied to the transcript reactivity profile followed by z-score normalization.

Reactivity near RBP binding sites

HITS-CLIP libraries for the RBP Lin28 (SRR531463 and SRR531464) were downloaded from the Sequence Read Archive and mapped to the human genome (Homo_sapiens.GRCh38) using BWA⁴¹. Binding peaks were enriched using the CLIP Tool Kit package⁴², and binding motifs were detected using HOMER⁴³. We analyzed the reactivities −50 bp to +50 bp around Lin28 motifs (G{GU}AG{C}A) that have CLIP binding peaks and randomly selected the same number of motifs outside of CLIP peaks as controls (compared using a t-test). In total, 552 HITS-CLIP binding sites on 316 transcripts were used for the analysis of Lin28 binding in our data. We randomly sampled 552 sites with the same motif sequence on 351 transcripts as control.

SNV structure analysis

Illumina RNA sequencing reads from the libraries RHN1291 and RHN1295 were mapped to the human genome (Homo_sapiens.GRCh38) using BWA. SNVs were called using bcftools with default parameters⁴⁴. From the identified SNV positions, nanopore mapped reads were separated into three categories, corresponding to matching (reads that match the annotation), mutated (mutated reads based on the variant calling results) and unclassified (reads that are not found in the previous two groups) by Biostar214299 from Jvarkit. Matching and mutated reads were used for further SNV analysis.

For each of the SNV pairs, both transcripts were filtered for having >200 unmodified reads and >100 modified reads. SNV pairs with at least one changing region were considered as having a change in structure.

Determination of structure changing regions

To determine whether a base changes structure significantly between two reactivity profiles, we used Fisher’s exact test to compare the number of unmodified and modified reads at each position between two transcripts. A five-nucleotide sliding window was applied across the transcripts and regions with two or more positions with P < 0.05 across a transcript pair, and those that were not significant between biological replicates were identified as being structurally changing. Hommel’s method was used for false discovery rate (FDR) correction of the P value. As the structure-changing region cannot be structurally different across biological replicates, this allows us to filter off regions that fluctuate in coverage across biological replicates, reducing the amount of noise that is called as structurally significant.

Analysis of isoform pairs

Transcript coordinates were converted into genomic coordinates to allow ease of comparison across isoforms. Two transcripts were considered to be a gene-linked isoform pair if they had overlapping genomic positions and >100 bp of unique positions. Reactivity values from shared positions for each isoform pair were retained for comparisons. For global analysis, all shared positions were used to calculate the Pearson correlation. For local analysis, 100 nucleotides to the left and right of differential splice sites were used (sites with fewer adjacent bases were excluded).

Translation efficiency (TE) for each transcript was determined by TrIP-seq. To calculate the significance in TE differences between two isoform pairs, the raw counts for alleles across both low polysome fractions (sum of fractions 6 and 7) and high polysome fractions (sum of fractions 9 and 10) were compared using Fisher’s exact test to assess if the reads derived from the reference allele are significantly enriched or depleted in high polysome fractions (P < 0.05). Hommel’s method was used for FDR correction of P values.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Raw sequencing data and reactivity profiles can be downloaded from the Gene Expression Omnibus under accession number GSE133361. Source data are provided with this paper.

Code availability

Source code for all scripts (R version 3.4.1) and commands used for analysis can be found at http://github.com/awjga/PORE-cupine.

Change history

12 November 2020
A Correction to this paper has been published: https://doi.org/10.1038/s41587-020-00755-w
23 March 2021
A Correction to this paper has been published: https://doi.org/10.1038/s41587-021-00889-5

References

Wan, Y., Kertesz, M., Spitale, R. C., Segal, E. & Chang, H. Y. Understanding the transcriptome through RNA structure. Nat. Rev. Genet. 12, 641–655 (2011).
CAS PubMed Google Scholar
Kertesz, M. et al. Genome-wide measurement of RNA secondary structure in yeast. Nature 467, 103–107 (2010).
CAS PubMed Google Scholar
Wan, Y. et al. Landscape and variation of RNA secondary structure across the human transcriptome. Nature 505, 706–709 (2014).
CAS PubMed PubMed Central Google Scholar
Wan, Y. et al. Genome-wide measurement of RNA folding energies. Mol. Cell 48, 169–181 (2012).
PubMed PubMed Central Google Scholar
Siegfried, N. A., Busan, S., Rice, G. M., Nelson, J. A. & Weeks, K. M. RNA motif discovery by SHAPE and mutational profiling (SHAPE-MaP). Nat. Methods 11, 959–965 (2014).
CAS PubMed PubMed Central Google Scholar
Spitale, R. C. et al. Structural imprints in vivo decode RNA regulatory mechanisms. Nature 519, 486–490 (2015).
CAS PubMed PubMed Central Google Scholar
Lucks, J. B. et al. Multiplexed RNA structure characterization with selective 2′-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc. Natl Acad. Sci. USA 108, 11063–11068 (2011).
CAS PubMed PubMed Central Google Scholar
Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J. S. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705 (2014).
CAS PubMed Google Scholar
Zubradt, M. et al. DMS-MaPseq for genome-wide or targeted RNA structure probing in vivo. Nat. Methods 14, 75–82 (2017).
CAS PubMed Google Scholar
Ding, Y. et al. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature 505, 696–700 (2013).
PubMed Google Scholar
Strobel, E. J., Yu, A. M. & Lucks, J. B. High-throughput determination of RNA structures. Nat. Rev. Genet. 19, 615–634 (2018).
CAS PubMed PubMed Central Google Scholar
Sharon, D., Tilgner, H., Grubert, F. & Snyder, M. A single-molecule long-read survey of the human transcriptome. Nat. Biotechnol. 31, 1009–1014 (2013).
CAS PubMed PubMed Central Google Scholar
Au, K. F. et al. Characterization of the human ESC transcriptome by hybrid sequencing. Proc. Natl Acad. Sci. USA 110, E4821–E4830 (2013).
CAS PubMed PubMed Central Google Scholar
Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).
CAS PubMed PubMed Central Google Scholar
Tilgner, H. et al. Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events. Nat. Biotechnol. 33, 736–742 (2015).
CAS PubMed PubMed Central Google Scholar
Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15, 201–206 (2018).
CAS PubMed Google Scholar
Oikonomopoulos, S., Wang, Y. C., Djambazian, H., Badescu, D. & Ragoussis, J. Benchmarking of the Oxford Nanopore MinION sequencing for quantitative and qualitative assessment of cDNA populations. Sci. Rep. 6, 31602 (2016).
CAS PubMed PubMed Central Google Scholar
Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407–410 (2017).
CAS PubMed Google Scholar
Parker, M. T. et al. Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m⁶A modification. eLife 9, e49658 (2020).
Liu, H. et al. Accurate detection of m⁶A RNA modifications in native RNA sequences. Nat. Commun. 10, 4079 (2019).
PubMed PubMed Central Google Scholar
Weeks, K. M. Advances in RNA structure analysis by chemical probing. Curr. Opin. Struct. Biol. 20, 295–304 (2010).
CAS PubMed PubMed Central Google Scholar
Spitale, R. C. et al. RNA SHAPE analysis in living cells. Nat. Chem. Biol. 9, 18–20 (2013).
CAS PubMed Google Scholar
Sachsenmaier, N., Handl, S., Debeljak, F. & Waldsich, C. Mapping RNA structure in vitro using nucleobase-specific probes. Methods Mol. Biol. 1086, 79–94 (2014).
CAS PubMed Google Scholar
Guo, F., Gooding, A. R. & Cech, T. R. Structure of the Tetrahymena ribozyme: base triple sandwich and metal ion at the active site. Mol. Cell 16, 351–362 (2004).
CAS PubMed Google Scholar
Winkler, W., Nahvi, A. & Breaker, R. R. Thiamine derivatives bind messenger RNAs directly to regulate bacterial gene expression. Nature 419, 952–956 (2002).
CAS PubMed Google Scholar
Jambhekar, A. et al. Unbiased selection of localization elements reveals cis-acting determinants of mRNA bud localization in Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA 102, 18005–18010 (2005).
CAS PubMed PubMed Central Google Scholar
Sexton, A. N., Wang, P. Y., Rutenberg-Schoenberg, M. & Simon, M. D. Interpreting reverse transcriptase termination and mutation events for greater insight into the chemical probing of RNA. Biochemistry 56, 4713–4721 (2017).
CAS PubMed Google Scholar
Li, F. et al. Global analysis of RNA secondary structure in two metazoans. Cell. Rep. 1, 69–82 (2012).
CAS PubMed Google Scholar
Sun, L. et al. RNA structure maps across mammalian cellular compartments. Nat. Struct. Mol. Biol. 26, 322–330 (2019).
CAS PubMed PubMed Central Google Scholar
Wilbert, M. L. et al. LIN28 binds messenger RNAs at GGAGA motifs and regulates splicing factor abundance. Mol. Cell 48, 195–206 (2012).
CAS PubMed PubMed Central Google Scholar
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
CAS PubMed PubMed Central Google Scholar
Pan, Q., Shai, O., Lee, L. J., Frey, B. J. & Blencowe, B. J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 1413–1415 (2008).
CAS PubMed Google Scholar
Moqtaderi, Z., Geisberg, J. V. & Struhl, K. Extensive structural differences of closely related 3′ mRNA isoforms: links to Pab1 binding and mRNA stability. Mol. Cell 72, 849–861 (2018).
CAS PubMed PubMed Central Google Scholar
Floor, S. N. & Doudna, J. A. Tunable protein synthesis by transcript isoforms in human cells. eLife 5, e10921 (2016).
Aw, J. G. et al. In vivo mapping of eukaryotic RNA interactomes reveals principles of higher-order organization and regulation. Mol. Cell 62, 603–617 (2016).
CAS PubMed Google Scholar
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
CAS PubMed PubMed Central Google Scholar
Mustoe, A. M. et al. Pervasive regulatory functions of mRNA structure revealed by high-resolution SHAPE probing. Cell 173, 181–195 (2018).
CAS PubMed PubMed Central Google Scholar
Das, R., Laederach, A., Pearlman, S. M., Herschlag, D. & Altman, R. B. SAFA: semi-automated footprinting analysis software for high-throughput quantification of nucleic acid footprinting experiments. RNA 11, 344–354 (2005).
CAS PubMed PubMed Central Google Scholar
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
CAS PubMed PubMed Central Google Scholar
Sovic, I. et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat. Commun. 7, 11307 (2016).
CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
PubMed PubMed Central Google Scholar
Shah, A., Qian, Y., Weyn-Vanhentenryck, S. M. & Zhang, C. CLIP Tool Kit (CTK): a flexible and robust pipeline to analyze CLIP sequencing data. Bioinformatics 33, 566–567 (2017).
CAS PubMed Google Scholar
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
CAS PubMed PubMed Central Google Scholar
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank members of the Wan and Tan labs and F. Yao, H. M. Loh, C. C. Khor and M. Sikic for helpful discussions. Y.W. is supported by funding from A*STAR (A*STAR investigatorship 1630700155), the National Research Foundation Singapore (NRF2019-NRF-ISF003-2970 and CRP21-2018-0101), the EMBO Young Investigatorship and the CIFAR global scholarship. F.R.P.L. is supported by a doctoral scholarship from the Warwick-A*STAR research attachment programme.

Author information

These authors contributed equally: Jong Ghut Ashley Aw, Shaun W. Lim, Jia Xu Wang.

Authors and Affiliations

Stem Cell and Regenerative Biology, Genome Institute of Singapore, A*STAR, Singapore, Singapore
Jong Ghut Ashley Aw, Shaun W. Lim, Jia Xu Wang, Finnlay R. P. Lambert, Wen Ting Tan, Yu Zhang, Pornchai Kaewsapsak, Meng How Tan & Yue Wan
Division of Biomedical Sciences, Warwick Medical School, University of Warwick, Coventry, UK
Finnlay R. P. Lambert
Computational and Systems Biology, Genome Institute of Singapore, A*STAR, Singapore, Singapore
Yang Shen, Chenhao Li & Niranjan Nagarajan
Genome Technologies Platform, Genome Institute of Singapore, A*STAR, Singapore, Singapore
Sarah B. Ng
Skin Research Institute of Singapore, A*STAR, Immunos, Singapore
Leah A. Vardy
School of Chemical and Biomedical Engineering, Nanyang Technological University, Singapore, Singapore
Meng How Tan
Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
Niranjan Nagarajan & Yue Wan
School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
Yue Wan

Authors

Jong Ghut Ashley Aw
View author publications
Search author on:PubMed Google Scholar
Shaun W. Lim
View author publications
Search author on:PubMed Google Scholar
Jia Xu Wang
View author publications
Search author on:PubMed Google Scholar
Finnlay R. P. Lambert
View author publications
Search author on:PubMed Google Scholar
Wen Ting Tan
View author publications
Search author on:PubMed Google Scholar
Yang Shen
View author publications
Search author on:PubMed Google Scholar
Yu Zhang
View author publications
Search author on:PubMed Google Scholar
Pornchai Kaewsapsak
View author publications
Search author on:PubMed Google Scholar
Chenhao Li
View author publications
Search author on:PubMed Google Scholar
Sarah B. Ng
View author publications
Search author on:PubMed Google Scholar
Leah A. Vardy
View author publications
Search author on:PubMed Google Scholar
Meng How Tan
View author publications
Search author on:PubMed Google Scholar
Niranjan Nagarajan
View author publications
Search author on:PubMed Google Scholar
Yue Wan
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.W. conceived the project. Y.W., N.N., M.H.T., B.S.N. and L.V. designed the experiments and analysis. S.W.L. and J.X.W. performed the experiments with help from P.K. J.G.A.A. and Y.S. performed the computational analysis with help from C.L. and E.P.K. Y.W. organized and wrote the paper with J.G.A.A., S.W.L. and all other authors.

Corresponding authors

Correspondence to Niranjan Nagarajan or Yue Wan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Chemical structures of RNA structure probing compounds, associated reaction products, mapped length and statistics of error rates.

a, Chemical structures of RNA structure probing compounds. Side chains for the carbodiimide of CMCT are highlighted and abbreviated as R’ and R’ for part (b). b, RNA nucleotide triphosphates with chemical adducts formed from reaction with structure probing compounds. Adducts are highlighted in green. c, Median lengths of mapped nanopore reads for unmodified and modified Tetrahymena RNA with different structure probing compounds. d, e, f, Boxplots showing the frequency of mismatch (d), deletion (e) and insertion (f) rates for different structure probing chemicals on Tetrahymena RNA, as compared to unmodified RNA. P-values were calculated using the two-sided Wilcoxon Rank Sum test. h–j, Boxplots showing the AUC-ROC performance of mismapping (h), deletion (i) and insertion (j) rates for the different compounds on the Tetrahymena RNA secondary structure. P-values were calculated using two-sided Wilcoxon rank-sum test. c-j, 6962-42107 reads from different libraries were used for comparisons (Supplementary Table 1). The middle, lower and upper boundary lines in the boxplot correspond to median, first and third quartiles. The upper whisker extends to the largest value no further than 1.5 × IQR from the hinge (where IQR is the inter-quartile range) and the lower whisker extends to the smallest value at most 1.5 × IQR of the hinge.

Source data

Extended Data Fig. 2 Distribution of mismatches, insertions, deletions along Tetrahymena RNA sequence.

Line plots of normalized number of mismatches (a), deletions (b) and insertions (c) caused by the different compounds and unmodified, along the length of the Tetrahymena RNA sequence. The red bars on top of the plots indicate the location of single-stranded bases in the secondary structure.

Extended Data Fig. 3 Error characterization of the modifications along the secondary structure of the Tetrahymena RNA.

a, Positions and intensity of mismatches (red), deletions (green) and insertions (purple) caused by the different chemical compounds are mapped along the secondary structure of the Tetrahymena RNA. b, Percentage of observed bases (upper number) and corresponding P-values were shown (lower number) for each observation. P-value was calculated using two-sided chi-square test for all modified versus unmodified comparisons.

Source data

Extended Data Fig. 4 Schematic of the bioinformatic workflow of PORE-cupine and characteristics of direct RNA sequencing signal.

Sequenced reads were basecalled using Albacore or Guppy, and mapped to the reference sequences using Graphmap. We used Nanopolish to align the raw signals and to extract current features which were used to train unmodified data using SVM. We then filter for reads that are longer than 50% of annotated lengths and for transcripts that have at least 100 reads in the modified library and 200 reads in the unmodified library for downstream analysis. b, Normalized current dwell time for single-stranded regions on Tetrahymena RNA modified with NAI-N3. With footprinting gels as a guide, the top 10% of the single-stranded regions on Tetrahymena RNA were selected for these plots. c-e, Normalized current mean (c), standard deviation (d) and dwell time (e) distributions for all positions on unmodified Tetrahymena RNA and RNA modified with NAI-N3. f, Bioanalyzer traces of in vitro transcribed, full-length, unmodified, and NAI-N3 (100 mM for 5 mins or 25 mins) modified Tetrahymena RNA. g, Mapping rates for modified versus unmodified Tetrahymena RNA. The total number of sequenced reads for unmodified, NAI-N3 (5 mins), and NAI-N3(25 mins) are 20149, 51760, and 22155 reads respectively. The percentage of mapped reads for unmodified, NAI-N3 (5 mins), and NAI-N3 (25 mins) are 75%, 81%, and 17% respectively. h, Density plots showing the distribution of lengths of sequenced unmodified and modified Tetrahymena RNA. Top: unmodified and NAI-N3 modified (100 mM, 5 min) RNA. Bottom: unmodified and NAI-N3 modified (100 mM, 25 min) RNA. i, Coverage of reads mapping to Tetrahymena RNA along its length, for unmodified (top), NAI-N3 modified (100 mM, 5 min)(middle) and extended NAI-N3 modified (100 mM, 25 min) RNA (bottom).

Extended Data Fig. 5 Optimization of PORE-cupine using 11 RNAs as training set.

a, Scatterplot showing the distribution of normalized base reactivity between N = 2 biological replicates of modified Tetrahymena RNA. R = 0.97, CI_95% = [0.97,0.98] (Pearson correlation). P-value=2.5×10^-262, two-tailed Student’s T-test. b, Distribution of current mean and standard deviation for a unimodal (left) and bimodal (right) position in two biological replicates. c, AUC-ROC performance of the correlation of NAI-N3 reactivities of the training set based on PORE-cupine versus footprinting from 11 transcripts. d,e, Comparison of PORE-cupine reactivity and traditional footprinting. Two replicates of gels were shown for Tetrahymena RNA (d, R = 0.80) and lysine riboswitch (e, R = 0.74). Lane 1 of the footprinting gels show A (left, Tetrahymena) or G (right, Tetrahymena and lysine) ladder. Lane 2 shows unmodified RNA, and lane 3 shows NAI-N3 modified RNA. Quantification of the bands on the gels was done using SAFA. Pearson correlation was used to compare between SAFA and PORE-cupine signals. f, List of RNAs used for training and test. g, Scatter plot of per-base reactivity in two biological replicates of the three test RNAs. P-value = 0 using two-tailed Student’s T-test. R = 0.877, CI_95% = [0.87, 0.89], by Pearson correlation. h-j, Line plots showing the per-base reactivity along the length of three test RNAs, for two biological replicates. R > = 0.89, using Pearson correlation. k, Boxplot showing the performance of the SVM parameters on the 3 test RNAs, based on training on the Tetrahymena RNA (left) or on 11 RNAs (right, footprinting gels). l, AUC-ROC performance of SVM parameters on 3 test RNAs (red, based on our current 11 training RNAs) versus test RNAs after random selection of 11/14 RNAs as training, for 20 times. m, Boxplot showing the performance of all, unimodal and bimodal positions on test RNAs using AUC-ROC based on footprinting gels from 3 transcripts. In c, k-m, the middle, lower and upper boundary lines in the boxplot correspond to median, first and third quartiles. The upper whisker extends to the largest value no further than 1.5 × IQR from the hinge (where IQR is the inter-quartile range) and the lower whisker extends to the smallest value at most 1.5 × IQR of the hinge. Outliers are shown as dots.

Source data

Extended Data Fig. 6 Comparison between PORE-cupine and footprinting signals.

a, Bioanalyzer traces of unmodified and in vivo NAI-N3 modified (100 mM, 5 min) total B. subtilis RNA. b, Secondary structure model of B. subtilis 16 S rRNA. The structure probed regions are boxed in pink, green and blue. c-e, Comparisons between PORE-cupine and footprinting. Two replicates of gels are shown for each of the three regions along B. subtilis RNA. The gels show G ladder (lane 1), unmodified RNA (lane 2) and NAI-N3 modified RNA (lane 3) and a correlation of R (Pearson)= 0.91 (c), 0.74 (d) and 0.24 (e) between the gels. Quantification of the bands on the gels were done using SAFA. Comparison between SAFA quantification and PORE-cupine for each of the regions is shown as a line plot to the right of the gels. R = 0.52 (c), 0.76 (d), 0.62 (e) by Pearson correlation. f,g, Bioanalyzer traces of unmodified and in vitro NAI-N3 modified RPS29 (100 mM, 5 min) (f) and Adocbl riboswitch (g). h-i, Comparisons between PORE-cupine and footprinting. Two replicates of gels are shown for along RPS29 and Adocbl riboswitch RNA, R(Pearson)= 0.93(h) and 0.73(i). The gels show G ladder (lane 1), unmodified RNA (lane 2) and NAI-N3 modified RNA (lane 3). Quantification of the bands on the gels were done using SAFA. Comparison between SAFA quantification and PORE-cupine for each of the regions is shown to the right of the gels. R(Pearson)=0.68 (h) and 0.69 (i).

Source data

Extended Data Fig. 7 PORE-cupine reactivity signals on TPP.

Line plots showing 2 replicates of PORE-cupine reactivities along TPP riboswitch in the presence of water (R = 0.86), 250 nM TPP (R = 0.87), 750 nM TPP (R = 0.83), and 10 μM TPP (R = 0.94). Pearson correlation is used to calculate the similarities between the reactivities of the replicates.

Source data

Extended Data Fig. 8 PORE-cupine results on the hESC transcriptome.

a, b, Bioanalyzer traces of unmodified and modified total (a) and polyA(+) selected (b) hESC. c, Barplots showing the number of reads after basecalling (12007032 and 10118432) and mapping (86% and 60%) in unmodified and modified hESC samples respectively. d, Histogram showing the distribution of reads with different amounts of modification in hESC. e, Boxplots showing the performance of reads with different amounts of modification, calculated using AUC-ROC on the test set of 3 RNAs, based on 10 footprinting regions. Reads were grouped into different classes: all reads (current, 147670 reads), with only 1 modification for the strand (only 1, 15461 reads), with 0-1% modification (68025 reads), with 1-2% modification (9771 reads), with 2-3% modification (803 reads), and with 3-4% modifications (76 reads). P-value is calculated using two-sided Wilcoxon Rank Sum test. The middle, lower and upper boundary of the boxplot correspond to the median, first and third quartiles, while the upper and lower whiskers extend from the hinge to the largest and smallest value at most 1.5 × IQR of the hinge respectively. f, Top: Fraction of modified reads mapped across exon-exon junctions (position 0) in hESC (black) and across artificial junctions with 50 base insertions (red). Bottom: Difference between mapping rates across normal versus artificial exon-exon junctions at each base; p-value was calculated using two-tailed Wilcoxon Rank-Sum Test. g, Line graph showing the percentage of bimodal positions observed (for both current standard deviation and mean) for all 1024 kmers. The orange line indicates the top 1% of bimodal signals across all kmers and the identity of the corresponding kmers are labelled above. h-j, Base composition along positions 1,2,3,4,5 of unimodal (left) and bimodal kmers (right). These include kmers that show bimodal current mean (h), bimodal current standard deviation (i), and both bimodal current mean and current standard deviation (j). k, Coverage of unmodified (left) and modified (right) reads along the hESC transcriptome using direct RNA sequencing.

Source data

Extended Data Fig. 9 Structural properties of the hESC transcriptome.

Scatterplot showing the distribution of total reads per position for each hESC transcript, for N = 2 biologically independent replicates, of unmodified (left, p-value =0, using two-tailed Student T test, CI_95% = [0.98,0.98], 1613 transcripts) and modified transcripts (right, 1751 transcripts, p-value =0 using two-tailed Student T test, CI_95% = [0.97,0.97]). R (Pearson)=0.98 (left) and 0.97 (right). b, Barplot showing the number of transcripts left after abundance and length filter. The number of transcripts in each group is shown above the plot. c, Boxplots showing the distribution of median mapped lengths of unmodified (left) and NAI-N3 modified (middle) hESC mRNAs (1751 trancripts). Annotated refers to the distribution of expected lengths for each transcript based on ENSEMBL GRCh38 annotation (right). d, Histogram showing the distribution of transcripts having different fractions of annotated length in unmodified and modified samples. e, Distribution of Pearson correlations between full-length (>99% of known length) and partial transcripts in hESC (83 transcripts from N = 2 two biological replicates were used). The Y-axis shows the fraction of transcripts with a particular correlation. The X-axis depicts Pearson correlation coefficients. f, Boxplot showing PORE-cupine reactivity of different classes of transcripts. P-values were calculated using two-sided Wilcoxon Rank Sum test.1584 coding genes, 67 pseudogenes, 81 non-coding genes and 4 rRNAs were used. g, Top, Metagene analysis of PORE-cupine-derived mean reactivities aligned according to start (Upper) and stop (Lower) codons for all 559 transcripts. Bottom, Metagene autocorrelation function (ACF) plot for the 5’ UTR, CDS and 3’ UTR. In c and f, the middle, lower and upper boundary of the boxplot correspond to the median, first and third quartiles. The upper and lower whisker extends from the hinge to the largest and smallest value at most 1.5 × IQR of the hinge. Outliers are shown as dots.

Source data

Extended Data Fig. 10 RNA structures in gene-linked isoforms.

a, Upper: Transcript organization of different RPLP0 isoforms. Alternative exons seen in our structural data are highlighted in red. Lower, normalized reactivity profiles for the different isoforms and their aggregate signal. b, Upper, Transcript organization of different RACK1 isoforms. Alternative exon is shown in red (also in inset). Lower, Line plots for the aggregate reactivity signal between the two isoforms are shown (Top). Middle, Line plots showing the expanded view of the reactivity difference between the isoforms. Bottom, Line plots showing the individual reactivity information for each isoform along its length. c, No. of transcripts with two structure changing regions that are more than 100, 200, 300, 400 or 500 bases apart. The value for each group is shown above the bar. d, Schematic of the TrIP-seq workflow. e, Line plot showing the absorbance A260 of each fraction (2-12) after polysome fractionation. f, Pair-wise correlations (Spearman correlation) of the read-counts/transcript for each fraction between two biological replicates. Fractions and batches are denoted as F2-12 and B1-2 respectively. g, Distribution of read-counts across different polysome fractions for two biological replicates of Actin B (left) and Activating transcription factor 4 (right).

Source data

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aw, J.G.A., Lim, S.W., Wang, J.X. et al. Determination of isoform-specific RNA structure with nanopore long reads. Nat Biotechnol 39, 336–346 (2021). https://doi.org/10.1038/s41587-020-0712-z

Download citation

Received: 28 May 2019
Accepted: 18 September 2020
Published: 26 October 2020
Issue date: March 2021
DOI: https://doi.org/10.1038/s41587-020-0712-z

This article is cited by

De novo basecalling of RNA modifications at single molecule and nucleotide resolution
- Sonia Cruciani
- Anna Delgado-Tejedor
- Eva Maria Novoa
Genome Biology (2025)
Knowing when to fold ’em
- Michael Eisenstein
Nature Methods (2025)
DEMINERS enables clinical metagenomics and comparative transcriptomic analysis by increasing throughput and accuracy of nanopore direct RNA sequencing
- Junwei Song
- Li-an Lin
- Lu Chen
Genome Biology (2025)
Identification of RNA structures and their roles in RNA functions
- Xinang Cao
- Yueying Zhang
- Yue Wan
Nature Reviews Molecular Cell Biology (2024)
Isoform-specific RNA structure determination using Nano-DMS-MaP
- Anne-Sophie Gribling-Burrer
- Patrick Bohn
- Redmond P. Smyth
Nature Protocols (2024)

Subjects

Abstract

Similar content being viewed by others

Main

Results

Chemical modifications on RNAs can result in detectable errors in direct RNA sequencing

PORE-cupine accurately identifies NAI-N3 modifications using machine learning

Genome-wide analysis of RNA structures in hESCs using PORE-cupine

Detecting structural differences in shared exons from alternative isoforms

PORE-cupine can phase structures along long isoforms

Isoforms with structural differences show differences in translation efficiency

Discussion

Methods

RNA-modifying reagents

In vitro transcription, folding and in vitro structure probing

RNA footprint analysis

Human and bacterial cell culture with in vivo SHAPE modification

Direct RNA sequencing library preparation

Polysome fractionation

Data analysis

Processing and quantification of TrIP-seq data

Base-calling and mapping of nanopore reads

Determination of single-stranded positions

Calculation and comparison of error rates

Alignment of nanopore signals

Training of parameters, determination of thresholds and calculation of reactivity profiles

Reactivity near RBP binding sites

SNV structure analysis

Determination of structure changing regions

Analysis of isoform pairs

Reporting Summary

Data availability

Code availability

Change history

12 November 2020

23 March 2021

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links