Abstract
Accurate assembly of rRNA-depleted (total) RNA sequencing (RNA-seq) remains challenging because existing methods often conflate incomplete, nascent RNA with fully processed mature isoforms, leading to misassemblies and quantification errors. Here, we present StringTie3, a major update to the widely used StringTie assembler, specifically designed for total RNA-seq. StringTie3 introduces a nascent mode that models co-transcriptional splicing to separate nascent from mature transcripts, and a refined long-read module that distinguishes genuine polyadenylation sites from poly(A)-priming artifacts. Across short-, long- and hybrid-read datasets, StringTie3 substantially reduces assembly errors and outperforms existing tools. In Argonaute knockout experiments, nascent-mode analysis reveals that single knockouts predominantly alter nascent transcripts while leaving mature RNA largely unchanged, whereas double or triple knockouts disrupt both fractions. In breast cancer samples, certain extracellular matrix and tumor suppressor genes show discordant nascent and mature expression, suggesting posttranscriptional regulation. StringTie3 provides a framework for investigating transcriptional and posttranscriptional processes in total RNA-seq data.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout






Data availability
The RNA-seq data for DLPFC poly(A) and DLPFC RiboZ libraries are available through the Lieber Institute for Brain Development at http://eqtl.brainseq.org/phase2/ and http://eqtl.brainseq.org/phase1/, respectively. The neuron differentiation long and short-read RNA-seq data are accessible in the Gene Expression Omnibus (GEO) under accession number GSE245325, and the breast cancer RNA-seq dataset is available in the GEO under accession number GSE103001.
The ENCODE Consortium datasets used in the LRGASP challenges for human WTC‑11 and H1‑mix samples across ONT dRNA, PacBio cDNA and Illumina cDNA are available from ENCODE under accessions ENCSR392BGY, ENCSR673UKZ, ENCSR507JOF, ENCSR967FTZ, ENCSR154RVC and ENCSR731MFY. LRGASP sample and file accessions used in this study (including ENCSR and ENCFF identifiers for WTC‑11 and H1‑mix) are provided in Supplementary Data 13. Illumina short‑read files from these datasets were used only to derive the splice‑junction BED file supplied to minimap2. The Argonaute dataset is available under accession GSE146688. VASA-seq data are available under accession GSE176588; our analyses used only the Mus musculus runs (n = 45).
Code availability
StringTie3 is implemented in C++ and is freely available as open-source software under the MIT license at https://github.com/gpertea/stringtie/ and is archived at Zenodo (https://doi.org/10.5281/zenodo.17604767)52. Additional instructions for running nascent mode, including parameters and command-line flags, can be found in the project documentation.
References
Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. 20, 631–656 (2019).
Yao, L. et al. A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers. Nat. Biotechnol. 40, 1056–1065 (2022).
Chu, T. et al. Chromatin run-on and sequencing maps the transcriptional regulatory landscape of glioblastoma multiforme. Nat. Genet. 50, 1553–1564 (2018).
Gaidatzis, D., Burger, L., Florescu, M. & Stadler, M. B. Analysis of intronic and exonic reads in RNA-seq data characterizes transcriptional and post-transcriptional regulation. Nat. Biotechnol. 33, 722–729 (2015).
Ameur, A. et al. Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nat. Struct. Mol. Biol. 18, 1435–1440 (2011).
Zhao, S., Zhang, Y., Gamini, R., Zhang, B. & von Schack, D. Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion. Sci. Rep. 8, 4781–4812 (2018).
Adiconis, X. et al. Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nat. Methods 10, 623–629 (2013).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Choquet, K. et al. Pre-mRNA splicing order is predetermined and maintains splicing fidelity across multi-intronic transcripts. Nat. Struct. Mol. Biol. 30, 1064–1076 (2023).
Svoboda, M., Frost, H. R. & Bosco, G. Internal oligo(dT) priming introduces systematic bias in bulk and single-cell RNA sequencing count data. NAR Genom. Bioinform. 4, lqac035 (2022).
Viscardi, M. J. & Arribere, J. A. Poly(a) selection introduces bias and undue noise in direct RNA-sequencing. BMC Genomics 23, 530 (2022).
Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput. Biol. 18, e1009730 (2022).
Zhang, Q., Shi, Q. & Shao, M. Accurate assembly of multi-end RNA-seq data with Scallop2. Nat. Comput. Sci. 2, 148–152 (2022).
Collado-Torres, L. et al. Regional heterogeneity in gene expression, regulation, and coherence in the frontal cortex and hippocampus across development and schizophrenia. Neuron 103, 203–216 (2019).
Jaffe, A. E. et al. Developmental and genetic regulation of the human cortex transcriptome illuminate schizophrenia pathogenesis. Nat. Neurosci. 21, 1117–1125 (2018).
Wenric, S. et al. Transcriptome-wide analysis of natural antisense transcripts shows their potential role in breast cancer. Sci. Rep. 7, 17452 (2017).
Ulicevic, J. et al. Uncovering the dynamics and consequences of RNA isoform changes during neuronal differentiation. Mol. Syst. Biol. 20, 767–798 (2024).
Varabyou, A. et al. CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure. Genome Biol. 24, 249–316 (2023).
Nido, G. S. et al. Common gene expression signatures in Parkinson’s disease are driven by changes in cell composition. Acta Neuropathol. Commun. 8, 55 (2020).
Pardo-Palacios, F. J. et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Nat. Methods 21, 1349–1363 (2024).
Prjibelski, A. D. et al. Accurate isoform discovery with IsoQuant using long reads. Nat. Biotechnol. 41, 915–918 (2023).
Chen, Y. et al. Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat. Methods 20, 1187–1195 (2023).
Chu, Y. et al. Argonaute binding within 3′-untranslated regions poorly predicts gene repression. Nucleic Acids Res. 48, 7439–7453 (2020).
Cernilogar, F. M. et al. Chromatin-associated RNAi components contribute to transcriptional regulation in Drosophila. Nature 480, 391–395 (2011).
Zaytseva, O. et al. Transcriptional repression of Myc underlies the tumour suppressor function of AGO1 in Drosophila. Development 147, dev190231 (2020).
Huang, V. et al. Ago1 interacts with RNA polymerase II and binds to the promoters of actively transcribed genes in human cancer cells. PLoS Genet. 9, e1003821 (2013).
Mayr, C., Hemann, M. T. & Bartel, D. P. Disrupting the pairing between let-7 and Hmga2 enhances oncogenic transformation. Science 315, 1576–1579 (2007).
Dang, C. MYC on the path to cancer. Cell 149, 22–35 (2012).
Matsui, M. et al. Activation of LDL receptor expression by small RNAs complementary to a noncoding transcript that overlaps the LDLR promoter. Chem. Biol. 17, 1344–1355 (2010).
Wagschal, A. et al. Genome-wide identification of microRNAs regulating cholesterol and triglyceride homeostasis. Nat. Med. 21, 1290–1297 (2015).
Su, H., Trombly, M. I., Chen, J. & Wang, X. Essential and overlapping functions for mammalian Argonautes in microRNA silencing. Genes Dev. 23, 304–317 (2009).
De Martino, D. & Bravo-Cordero, J. J. Collagens in cancer: structural regulators and guardians of cancer progression. Cancer Res. 83, 1386–1392 (2023).
Chen, X. et al. COL5A1 promotes triple-negative breast cancer progression by activating tumor cell-macrophage crosstalk. Oncogene 43, 1742–1756 (2024).
Shi, Y. et al. Reduced expression of METTL3 promotes metastasis of triple-negative breast cancer by m6A methylation-mediated COL3A1 up-regulation. Front. Oncol. 10, 1126 (2020).
Kwon, J. J., Factora, T. D., Dey, S. & Kota, J. A Systematic review of miR-29 in cancer. Mol. Ther. Oncolytics 12, 173–194 (2019).
Zhu, J. et al. Chaperone Hsp47 drives malignant growth and invasion by modulating an ECM gene network. Cancer Res. 75, 1580–1591 (2015).
Wang, Y., Mei, X., Song, W., Wang, C. & Qiu, X. LncRNA LINC00511 promotes COL1A1-mediated proliferation and metastasis by sponging miR-126-5p/miR-218-5p in lung adenocarcinoma. BMC Pulm. Med. 22, 272 (2022).
Wang, Y. et al. MiR-410 is overexpressed in liver and colorectal tumors and enhances tumor cell growth by silencing FHL1 via a direct/indirect mechanism. PLoS ONE 9, e108708 (2014).
Wang, J. et al. lncRNA ZNRD1-AS1 promotes malignant lung cell proliferation, migration, and angiogenesis via the miR-942/TNS1 axis and is positively regulated by the m6A reader YTHDC2. Mol. Cancer 21, 229 (2022).
Pool, A., Poldsam, H., Chen, S., Thomson, M. & Oka, Y. Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references. Nat. Methods 20, 1506–1515 (2023).
Salmen, F. et al. High-throughput total RNA sequencing in single cells using VASA-seq. Nat. Biotechnol. 40, 1780–1793 (2022).
Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022).
Shinder, I. & Pertea, M. Filtered CHESS 3.0.1 human transcript annotation for StringTie3 benchmarking. Zenodo https://doi.org/10.5281/zenodo.18223655 (2026).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
Shinder, I., Hu, R., Ji, H. J., Chao, K. & Pertea, M. EASTR: identifying and eliminating systematic alignment errors in multi-exon genes. Nat. Commun. 14, 7223 (2023).
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Pertea, G., Shinder, I. & Pertea, M. StringTie 3.0.3. Zenodo https://doi.org/10.5281/zenodo.17604767 (2025).
Acknowledgements
This work was supported in part by National Science Foundation grant DBI-2412449 (to M.P.) and National Institutes of Health grants R01-MH123567 (to M.P.) and R35-GM156470 (to M.P.).
Author information
Authors and Affiliations
Contributions
I.S. conceived the study, designed and contributed to the implementation of the software, conducted computational analyses, analyzed and interpreted the results, and wrote the manuscript. Z.R. and R.H. aided in the analysis of results. G.P. assisted with software development. M.P. conceived the study, contributed to software, assisted in writing and editing the manuscript, and supervised the entire project. All authors reviewed and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Adam Ameur, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Lei Tang and Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Annotation‑guided long‑read assembly performance across three modalities.
Per‑replicate scatter plots of precision (%) (x‑axis) versus the number of reference transcripts correctly reconstructed by each assembler (y‑axis; ×103) for assemblies run in annotation‑guided mode against the CHESS 3.0.1 reference. Rows show (top to bottom): neuron‑differentiation ONT cDNA (days 0/3/5; three biological replicates each), LRGASP ONT dRNA (H1‑mix and WTC11; three replicates each), and LRGASP PacBio cDNA (H1‑mix and WTC11; three replicates each). Colors denote methods (StringTie2, StringTie3, StringTie3 nascent mode, IsoQuant, Bambu); shapes indicate replicate identity (see panel legends). “X” marks the per‑method mean across replicates.
Extended Data Fig. 2 SIRV only assembly accuracy.
Scatter plots show SIRV-only precision (x-axis) versus matching transcripts (TPs) or the number of correctly assembled SIRV transcripts (y-axis) in annotation free (left column) and annotation guided (right column) modes. Each point represents one replicate. In the guided panels, CHESS was provided to each method but SIRV models were withheld to test generalization to novel transcript discovery beyond the guidance set; metrics in all four panels are computed only on SIRVs.
Extended Data Fig. 3 Transcript length profiles in assemblies from short-read, long-read, and hybrid data of matched samples.
A. Reference transcripts (CHESS 3.0.1) were binned by exonic length (x-axis). For each modality—short‑read rRNA‑depleted, poly(A)‑selected long‑read, and hybrid (one short‑read and one long‑read library from the same day)—the y‑axis shows the fraction of expressed references in each bin that were exactly reconstructed. A transcript was considered expressed if it had coverage ≥ 1 read per bp in any modality, and exactly reconstructed if its intron chain was identical and its start and end coordinates were within ±100 bp of the reference start and end. Curves plot the weighted average across assemblies, with weights proportional to the number of expressed references in each bin. Shaded areas indicate 95% confidence intervals across assemblies. Numbers above the x‑axis represent the median number of expressed references per bin across assemblies. Results aggregate 14 short‑read assemblies, 9 long‑read assemblies, and 42 hybrid assemblies from the iPSC‑to‑neuron series. B. Longest reference reconstructed per assembly (kb). Each point represents one assembly. Box plots display the median (center line), 25th and 75th percentiles (box bounds), and whiskers extending to the most extreme data points within 1.5X interquartile range; points beyond whiskers represent outliers. The same 14 short‑read, 9 long‑read, and 42 hybrid assemblies described in panel A were analyzed.
Extended Data Fig. 4 StringTie3’s nascent mode algorithm.
A. Candidate transcript selection and quantification in a gene locus. Read coverage is shown at the top, with regions corresponding to the candidate transcript highlighted in orange. Arcs below the coverage plot represent splice junctions supported by spliced reads; thicker arcs indicate stronger read support. The splice graph, with the candidate transcript (that is, the heaviest path) highlighted in orange, is shown below. Arrows beneath the coverage plot indicate the genomic regions corresponding to the nodes in the splice graph. Dashed nodes and edges (for example, nodes 1_2 and 5_6) represent intronic nodes that, due to sparse coverage, are not included in the splice graph when using the non-nascent mode. A flow network (highlighted in green) is then constructed using all nodes from the heaviest path, with edges connecting two nodes if a transfrag starts at one and ends at the other. B. Nascent transcript quantification. Splice graphs and flow networks corresponding to the two nascent transcripts derived from the transcript in panel A are shown. The paths of the nascent transcripts are highlighted in orange in the splice graphs.
Supplementary information
Supplementary Information (download PDF )
Supplementary Tables 1–3, Supplementary Figs. 1–14 and Supplementary Notes 1–4.
Supplementary Data (download XLSX )
Excel workbook containing 13 tabs: (1–2) Per-sample assembly precision and sensitivity for rRNA-depleted and poly(A) + DLPFC libraries; (3) Transcript overlap between library types; (4) Long-read alignment statistics; (5–8) Differential expression results for AGO1, AGO2, AGO1/AGO2 and AGO1/AGO2/AGO3 knockouts; (9) Breast cancer tumor versus normal differential expression; (10–12) Runtime and memory benchmarks; (13) LRGASP sample and file accessions.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shinder, I., Pertea, G., Hu, R. et al. StringTie3 improves total RNA-seq assembly by resolving nascent and mature transcripts. Nat Methods (2026). https://doi.org/10.1038/s41592-026-03080-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41592-026-03080-3