Fig. 1: Design of a random oligonucleotide-based RNA modification (RM) training dataset with diverse sequence contexts and distinguishable RM signals. | Nature Communications

Fig. 1: Design of a random oligonucleotide-based RNA modification (RM) training dataset with diverse sequence contexts and distinguishable RM signals.

From: Comprehensive discovery of m6A sites in the human transcriptome at single-molecule resolution

Fig. 1: Design of a random oligonucleotide-based RNA modification (RM) training dataset with diverse sequence contexts and distinguishable RM signals.

a Overview of DeepRM (top) versus previous approaches (bottom). In the DeepRM dataset, signals of individual modifications are isolated (see panel b), while in most previous datasets, modification signals overlap because all adenosines are modified into m6A, reducing prediction accuracy. Since nanopore signals are largely affected by flanking sequences116, it is essential to include various local sequence contexts in the training dataset for sensitive RM detection. The DeepRM dataset encompasses all possible 11-mer sequence contexts (Fig. 2d), while the previous datasets contain limited and short sequence contexts (Fig. 2e). DeepRM uses raw electric currents as a feature, employing a large Transformer architecture to capture full information from Nanopore sequencing. In contrast, previous models use a few statistical features and shallow architectures. Consequently, DeepRM achieves unprecedentedly high accuracy in RM detection. The precision and recall values shown were calculated from the precision-recall curves of DeepRM (top) and m6Anet (bottom) in Fig. 4c, using the F1-maximizing threshold. b, c Effects of a single RM on surrounding nucleotides. Mean electric currents (b) and base quality (c) are plotted for 20 flanking random unmodified nucleotides centered on A or m6A (gray or red hexagons). The number of central 5-mer sequences was equally selected. n = 1,228,800 for both A and m6A. d Design of 87-nt building blocks (BBs) for large-scale dataset generation. Each BB contains three 21-nt local sequence context blocks (LCBs) in green and four 6-nt spacers in light blue. The base of interest is A (gray) and m6A (red). e Sequence of synthetic 49-nt RNA oligonucleotides and digested fragments for liquid chromatography tandem mass spectrometry (LC MS/MS). f MS chromatogram (left) and spectra (right) of a 13-nt fragment containing the designated site, showing different retention times between A (gray) and m6A oligonucleotides (red). Peak areas for each fragment were integrated to measure A or m6A purity. MS1 spectra of triply charged ions ([M-3H]−3) display measured and theoretical mass/charge ratio (m/z) values and mass difference (â–³m) between A and m6A oligonucleotides indicated.

Back to article page