Fig. 4: Long-read RNA-seq data improves read-to-transcript assignment and transcript abundance estimation compared to short-read RNA-seq data.

a, Scatterplots of log2-transformed CPM values obtained from long-read direct cDNA and PCR cDNA, and short-read RNA-seq, compared with expected log2-transformed CPMs for spike-in transcripts of four different spike-in RNAs. Light blue points represent Sequin Mix A version 1 and SIRV E2; dark blue points represent Sequin Mix A version 2, and SIRV E0 + long SIRV RNAs. b, Box plots showing the median, upper and lower quartiles, and 1.5 times the interquartile ranges of the Spearman correlation coefficient for mean log2-transformed CPM estimates for dominant-status-categorized protein-coding gene isoforms between different RNA-seq protocols for each cell line (n = 7). Dark blue indicates comparison between long-read RNA-seq protocols; light blue indicates comparison between long-read and short-read protocols. c, Scatterplot of log2-transformed CPM for dominant-status-categorized protein-coding gene isoforms obtained from long-read direct cDNA RNA-seq compared with those obtained from short-read RNA-seq in the A549 cell line. d, Fraction of alternative events identified when comparing major isoforms only in long-read (long-read-specific major isoform) and major isoforms only in short-read RNA-seq (short-read-specific major isoform). Background simulation distribution with mean ± s.d. represented by a point with an error bar (n = 20). e–g, Box plots showing the median, upper and lower quartiles, and 1.5 times the interquartile ranges of the fraction of dominant-status-categorized protein-coding gene isoforms expressed with at least 1 CPM (e), the number of junctions covered per read (f) and the number of transcripts uniquely assigned per read for all experiments categorized by five RNA-seq protocols (g; n = 55, 30, 27, 6 and 21, for direct RNA, direct cDNA, cDNA, PacBio and Illumina, respectively).