StringTie3 improves total RNA-seq assembly by resolving nascent and mature transcripts

Shinder, Ida; Pertea, Geo; Hu, Richard; Rudnick, Zoe; Pertea, Mihaela

doi:10.1038/s41592-026-03080-3

Article
Published: 19 May 2026

StringTie3 improves total RNA-seq assembly by resolving nascent and mature transcripts

Nature Methods (2026) Cite this article

Subjects

Abstract

Accurate assembly of rRNA-depleted (total) RNA sequencing (RNA-seq) remains challenging because existing methods often conflate incomplete, nascent RNA with fully processed mature isoforms, leading to misassemblies and quantification errors. Here, we present StringTie3, a major update to the widely used StringTie assembler, specifically designed for total RNA-seq. StringTie3 introduces a nascent mode that models co-transcriptional splicing to separate nascent from mature transcripts, and a refined long-read module that distinguishes genuine polyadenylation sites from poly(A)-priming artifacts. Across short-, long- and hybrid-read datasets, StringTie3 substantially reduces assembly errors and outperforms existing tools. In Argonaute knockout experiments, nascent-mode analysis reveals that single knockouts predominantly alter nascent transcripts while leaving mature RNA largely unchanged, whereas double or triple knockouts disrupt both fractions. In breast cancer samples, certain extracellular matrix and tumor suppressor genes show discordant nascent and mature expression, suggesting posttranscriptional regulation. StringTie3 provides a framework for investigating transcriptional and posttranscriptional processes in total RNA-seq data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Comparison of poly(A)+ selection and rRNA depletion reveals misassemblies arising from nascent transcripts.**

**Fig. 2: Modeling partially spliced nascent transcripts and mature isoforms within StringTie3’s splicing graph.**

**Fig. 3: Comparison of StringTie3 (nascent mode), StringTie2 and Scallop2 in rRNA-depleted RNA-seq evaluated against the CHESS 3.0 annotation.**

**Fig. 4: Precision, sensitivity and coverage differences for StringTie2 versus StringTie3 (nascent mode) in poly(A)+ and rRNA-depleted libraries.**

**Fig. 5: Long-read assembly performance of StringTie3, StringTie2, IsoQuant and Bambu in annotation-free mode.**

**Fig. 6: Argonaute knockouts reveal transcriptional versus posttranscriptional regulation.**

Data availability

The RNA-seq data for DLPFC poly(A) and DLPFC RiboZ libraries are available through the Lieber Institute for Brain Development at http://eqtl.brainseq.org/phase2/ and http://eqtl.brainseq.org/phase1/, respectively. The neuron differentiation long and short-read RNA-seq data are accessible in the Gene Expression Omnibus (GEO) under accession number GSE245325, and the breast cancer RNA-seq dataset is available in the GEO under accession number GSE103001.

The ENCODE Consortium datasets used in the LRGASP challenges for human WTC‑11 and H1‑mix samples across ONT dRNA, PacBio cDNA and Illumina cDNA are available from ENCODE under accessions ENCSR392BGY, ENCSR673UKZ, ENCSR507JOF, ENCSR967FTZ, ENCSR154RVC and ENCSR731MFY. LRGASP sample and file accessions used in this study (including ENCSR and ENCFF identifiers for WTC‑11 and H1‑mix) are provided in Supplementary Data 13. Illumina short‑read files from these datasets were used only to derive the splice‑junction BED file supplied to minimap2. The Argonaute dataset is available under accession GSE146688. VASA-seq data are available under accession GSE176588; our analyses used only the Mus musculus runs (n = 45).

Code availability

StringTie3 is implemented in C++ and is freely available as open-source software under the MIT license at https://github.com/gpertea/stringtie/ and is archived at Zenodo (https://doi.org/10.5281/zenodo.17604767)⁵². Additional instructions for running nascent mode, including parameters and command-line flags, can be found in the project documentation.

References

Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. 20, 631–656 (2019).
Article CAS PubMed Google Scholar
Yao, L. et al. A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers. Nat. Biotechnol. 40, 1056–1065 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chu, T. et al. Chromatin run-on and sequencing maps the transcriptional regulatory landscape of glioblastoma multiforme. Nat. Genet. 50, 1553–1564 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gaidatzis, D., Burger, L., Florescu, M. & Stadler, M. B. Analysis of intronic and exonic reads in RNA-seq data characterizes transcriptional and post-transcriptional regulation. Nat. Biotechnol. 33, 722–729 (2015).
Article CAS PubMed Google Scholar
Ameur, A. et al. Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nat. Struct. Mol. Biol. 18, 1435–1440 (2011).
Article CAS PubMed Google Scholar
Zhao, S., Zhang, Y., Gamini, R., Zhang, B. & von Schack, D. Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion. Sci. Rep. 8, 4781–4812 (2018).
Article PubMed PubMed Central Google Scholar
Adiconis, X. et al. Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nat. Methods 10, 623–629 (2013).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Article CAS PubMed PubMed Central Google Scholar
Choquet, K. et al. Pre-mRNA splicing order is predetermined and maintains splicing fidelity across multi-intronic transcripts. Nat. Struct. Mol. Biol. 30, 1064–1076 (2023).
Article CAS PubMed PubMed Central Google Scholar
Svoboda, M., Frost, H. R. & Bosco, G. Internal oligo(dT) priming introduces systematic bias in bulk and single-cell RNA sequencing count data. NAR Genom. Bioinform. 4, lqac035 (2022).
Article PubMed PubMed Central Google Scholar
Viscardi, M. J. & Arribere, J. A. Poly(a) selection introduces bias and undue noise in direct RNA-sequencing. BMC Genomics 23, 530 (2022).
Article CAS PubMed PubMed Central Google Scholar
Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput. Biol. 18, e1009730 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Q., Shi, Q. & Shao, M. Accurate assembly of multi-end RNA-seq data with Scallop2. Nat. Comput. Sci. 2, 148–152 (2022).
Article CAS PubMed PubMed Central Google Scholar
Collado-Torres, L. et al. Regional heterogeneity in gene expression, regulation, and coherence in the frontal cortex and hippocampus across development and schizophrenia. Neuron 103, 203–216 (2019).
Article CAS PubMed PubMed Central Google Scholar
Jaffe, A. E. et al. Developmental and genetic regulation of the human cortex transcriptome illuminate schizophrenia pathogenesis. Nat. Neurosci. 21, 1117–1125 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wenric, S. et al. Transcriptome-wide analysis of natural antisense transcripts shows their potential role in breast cancer. Sci. Rep. 7, 17452 (2017).
Article PubMed PubMed Central Google Scholar
Ulicevic, J. et al. Uncovering the dynamics and consequences of RNA isoform changes during neuronal differentiation. Mol. Syst. Biol. 20, 767–798 (2024).
Article CAS PubMed PubMed Central Google Scholar
Varabyou, A. et al. CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure. Genome Biol. 24, 249–316 (2023).
Article CAS PubMed PubMed Central Google Scholar
Nido, G. S. et al. Common gene expression signatures in Parkinson’s disease are driven by changes in cell composition. Acta Neuropathol. Commun. 8, 55 (2020).
Article CAS PubMed PubMed Central Google Scholar
Pardo-Palacios, F. J. et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Nat. Methods 21, 1349–1363 (2024).
Article CAS PubMed PubMed Central Google Scholar
Prjibelski, A. D. et al. Accurate isoform discovery with IsoQuant using long reads. Nat. Biotechnol. 41, 915–918 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chen, Y. et al. Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat. Methods 20, 1187–1195 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chu, Y. et al. Argonaute binding within 3′-untranslated regions poorly predicts gene repression. Nucleic Acids Res. 48, 7439–7453 (2020).
CAS PubMed PubMed Central Google Scholar
Cernilogar, F. M. et al. Chromatin-associated RNAi components contribute to transcriptional regulation in Drosophila. Nature 480, 391–395 (2011).
Article CAS PubMed PubMed Central Google Scholar
Zaytseva, O. et al. Transcriptional repression of Myc underlies the tumour suppressor function of AGO1 in Drosophila. Development 147, dev190231 (2020).
Huang, V. et al. Ago1 interacts with RNA polymerase II and binds to the promoters of actively transcribed genes in human cancer cells. PLoS Genet. 9, e1003821 (2013).
Mayr, C., Hemann, M. T. & Bartel, D. P. Disrupting the pairing between let-7 and Hmga2 enhances oncogenic transformation. Science 315, 1576–1579 (2007).
Article CAS PubMed PubMed Central Google Scholar
Dang, C. MYC on the path to cancer. Cell 149, 22–35 (2012).
Article CAS PubMed PubMed Central Google Scholar
Matsui, M. et al. Activation of LDL receptor expression by small RNAs complementary to a noncoding transcript that overlaps the LDLR promoter. Chem. Biol. 17, 1344–1355 (2010).
Article CAS PubMed PubMed Central Google Scholar
Wagschal, A. et al. Genome-wide identification of microRNAs regulating cholesterol and triglyceride homeostasis. Nat. Med. 21, 1290–1297 (2015).
Article CAS PubMed PubMed Central Google Scholar
Su, H., Trombly, M. I., Chen, J. & Wang, X. Essential and overlapping functions for mammalian Argonautes in microRNA silencing. Genes Dev. 23, 304–317 (2009).
Article CAS PubMed PubMed Central Google Scholar
De Martino, D. & Bravo-Cordero, J. J. Collagens in cancer: structural regulators and guardians of cancer progression. Cancer Res. 83, 1386–1392 (2023).
Article PubMed PubMed Central Google Scholar
Chen, X. et al. COL5A1 promotes triple-negative breast cancer progression by activating tumor cell-macrophage crosstalk. Oncogene 43, 1742–1756 (2024).
Article CAS PubMed Google Scholar
Shi, Y. et al. Reduced expression of METTL3 promotes metastasis of triple-negative breast cancer by m⁶A methylation-mediated COL3A1 up-regulation. Front. Oncol. 10, 1126 (2020).
Article PubMed PubMed Central Google Scholar
Kwon, J. J., Factora, T. D., Dey, S. & Kota, J. A Systematic review of miR-29 in cancer. Mol. Ther. Oncolytics 12, 173–194 (2019).
Article CAS PubMed Google Scholar
Zhu, J. et al. Chaperone Hsp47 drives malignant growth and invasion by modulating an ECM gene network. Cancer Res. 75, 1580–1591 (2015).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y., Mei, X., Song, W., Wang, C. & Qiu, X. LncRNA LINC00511 promotes COL1A1-mediated proliferation and metastasis by sponging miR-126-5p/miR-218-5p in lung adenocarcinoma. BMC Pulm. Med. 22, 272 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. MiR-410 is overexpressed in liver and colorectal tumors and enhances tumor cell growth by silencing FHL1 via a direct/indirect mechanism. PLoS ONE 9, e108708 (2014).
Article PubMed PubMed Central Google Scholar
Wang, J. et al. lncRNA ZNRD1-AS1 promotes malignant lung cell proliferation, migration, and angiogenesis via the miR-942/TNS1 axis and is positively regulated by the m⁶A reader YTHDC2. Mol. Cancer 21, 229 (2022).
Article CAS PubMed PubMed Central Google Scholar
Pool, A., Poldsam, H., Chen, S., Thomson, M. & Oka, Y. Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references. Nat. Methods 20, 1506–1515 (2023).
Article CAS PubMed Google Scholar
Salmen, F. et al. High-throughput total RNA sequencing in single cells using VASA-seq. Nat. Biotechnol. 40, 1780–1793 (2022).
Article CAS PubMed PubMed Central Google Scholar
Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022).
Article CAS PubMed PubMed Central Google Scholar
Shinder, I. & Pertea, M. Filtered CHESS 3.0.1 human transcript annotation for StringTie3 benchmarking. Zenodo https://doi.org/10.5281/zenodo.18223655 (2026).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
Article PubMed PubMed Central Google Scholar
Shinder, I., Hu, R., Ji, H. J., Chao, K. & Pertea, M. EASTR: identifying and eliminating systematic alignment errors in multi-exon genes. Nat. Commun. 14, 7223 (2023).
Article CAS PubMed PubMed Central Google Scholar
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).
Article Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Article CAS PubMed Google Scholar
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Article PubMed PubMed Central Google Scholar
Pertea, G., Shinder, I. & Pertea, M. StringTie 3.0.3. Zenodo https://doi.org/10.5281/zenodo.17604767 (2025).

Download references

Acknowledgements

This work was supported in part by National Science Foundation grant DBI-2412449 (to M.P.) and National Institutes of Health grants R01-MH123567 (to M.P.) and R35-GM156470 (to M.P.).

Author information

Authors and Affiliations

Cross Disciplinary Graduate Program in Biomedical Sciences, Johns Hopkins School of Medicine, Baltimore, MD, USA
Ida Shinder
Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
Ida Shinder, Richard Hu, Zoe Rudnick & Mihaela Pertea
Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
Geo Pertea
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Richard Hu & Mihaela Pertea
Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, Baltimore, MD, USA
Zoe Rudnick & Mihaela Pertea
Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
Mihaela Pertea

Authors

Ida Shinder
View author publications
Search author on:PubMed Google Scholar
Geo Pertea
View author publications
Search author on:PubMed Google Scholar
Richard Hu
View author publications
Search author on:PubMed Google Scholar
Zoe Rudnick
View author publications
Search author on:PubMed Google Scholar
Mihaela Pertea
View author publications
Search author on:PubMed Google Scholar

Contributions

I.S. conceived the study, designed and contributed to the implementation of the software, conducted computational analyses, analyzed and interpreted the results, and wrote the manuscript. Z.R. and R.H. aided in the analysis of results. G.P. assisted with software development. M.P. conceived the study, contributed to software, assisted in writing and editing the manuscript, and supervised the entire project. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Ida Shinder or Mihaela Pertea.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Adam Ameur, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Lei Tang and Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Annotation‑guided long‑read assembly performance across three modalities.

Per‑replicate scatter plots of precision (%) (x‑axis) versus the number of reference transcripts correctly reconstructed by each assembler (y‑axis; ×10³) for assemblies run in annotation‑guided mode against the CHESS 3.0.1 reference. Rows show (top to bottom): neuron‑differentiation ONT cDNA (days 0/3/5; three biological replicates each), LRGASP ONT dRNA (H1‑mix and WTC11; three replicates each), and LRGASP PacBio cDNA (H1‑mix and WTC11; three replicates each). Colors denote methods (StringTie2, StringTie3, StringTie3 nascent mode, IsoQuant, Bambu); shapes indicate replicate identity (see panel legends). “X” marks the per‑method mean across replicates.

Extended Data Fig. 2 SIRV only assembly accuracy.

Scatter plots show SIRV-only precision (x-axis) versus matching transcripts (TPs) or the number of correctly assembled SIRV transcripts (y-axis) in annotation free (left column) and annotation guided (right column) modes. Each point represents one replicate. In the guided panels, CHESS was provided to each method but SIRV models were withheld to test generalization to novel transcript discovery beyond the guidance set; metrics in all four panels are computed only on SIRVs.

Extended Data Fig. 3 Transcript length profiles in assemblies from short-read, long-read, and hybrid data of matched samples.

A. Reference transcripts (CHESS 3.0.1) were binned by exonic length (x-axis). For each modality—short‑read rRNA‑depleted, poly(A)‑selected long‑read, and hybrid (one short‑read and one long‑read library from the same day)—the y‑axis shows the fraction of expressed references in each bin that were exactly reconstructed. A transcript was considered expressed if it had coverage ≥ 1 read per bp in any modality, and exactly reconstructed if its intron chain was identical and its start and end coordinates were within ±100 bp of the reference start and end. Curves plot the weighted average across assemblies, with weights proportional to the number of expressed references in each bin. Shaded areas indicate 95% confidence intervals across assemblies. Numbers above the x‑axis represent the median number of expressed references per bin across assemblies. Results aggregate 14 short‑read assemblies, 9 long‑read assemblies, and 42 hybrid assemblies from the iPSC‑to‑neuron series. B. Longest reference reconstructed per assembly (kb). Each point represents one assembly. Box plots display the median (center line), 25th and 75th percentiles (box bounds), and whiskers extending to the most extreme data points within 1.5X interquartile range; points beyond whiskers represent outliers. The same 14 short‑read, 9 long‑read, and 42 hybrid assemblies described in panel A were analyzed.

Extended Data Fig. 4 StringTie3’s nascent mode algorithm.

A. Candidate transcript selection and quantification in a gene locus. Read coverage is shown at the top, with regions corresponding to the candidate transcript highlighted in orange. Arcs below the coverage plot represent splice junctions supported by spliced reads; thicker arcs indicate stronger read support. The splice graph, with the candidate transcript (that is, the heaviest path) highlighted in orange, is shown below. Arrows beneath the coverage plot indicate the genomic regions corresponding to the nodes in the splice graph. Dashed nodes and edges (for example, nodes 1_2 and 5_6) represent intronic nodes that, due to sparse coverage, are not included in the splice graph when using the non-nascent mode. A flow network (highlighted in green) is then constructed using all nodes from the heaviest path, with edges connecting two nodes if a transfrag starts at one and ends at the other. B. Nascent transcript quantification. Splice graphs and flow networks corresponding to the two nascent transcripts derived from the transcript in panel A are shown. The paths of the nascent transcripts are highlighted in orange in the splice graphs.

Supplementary information

Supplementary Information (download PDF )

Supplementary Tables 1–3, Supplementary Figs. 1–14 and Supplementary Notes 1–4.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Supplementary Data (download XLSX )

Excel workbook containing 13 tabs: (1–2) Per-sample assembly precision and sensitivity for rRNA-depleted and poly(A) + DLPFC libraries; (3) Transcript overlap between library types; (4) Long-read alignment statistics; (5–8) Differential expression results for AGO1, AGO2, AGO1/AGO2 and AGO1/AGO2/AGO3 knockouts; (9) Breast cancer tumor versus normal differential expression; (10–12) Runtime and memory benchmarks; (13) LRGASP sample and file accessions.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shinder, I., Pertea, G., Hu, R. et al. StringTie3 improves total RNA-seq assembly by resolving nascent and mature transcripts. Nat Methods (2026). https://doi.org/10.1038/s41592-026-03080-3

Download citation

Received: 21 May 2025
Accepted: 26 March 2026
Published: 19 May 2026
Version of record: 19 May 2026
DOI: https://doi.org/10.1038/s41592-026-03080-3