Fig. 1: A computational pipeline to accurately quantify 3′-UTR isoform abundances from scRNA-seq data.
From: The landscape of alternative polyadenylation in single cells of the developing mouse embryo

a Venn diagram of a set of three PAS annotation resources and their degree of intersection. A PAS intersecting within ± 20 nt from another was considered an intersecting hit to account for the heterogeneity of the cleavage and polyadenylation machinery16. b Profile of nucleotide frequencies in the ± 50 nt vicinity of the annotated cleavage site position, derived from the union of the three databases. Shown above the plot are the known positionally enriched mammalian motifs known to guide mRNA cleavage34. c Distribution of scRNA-seq reads mapping within the ± 400-nt vicinity of the annotated cleavage site position, derived from the union of the three databases. To avoid an ambiguous signal, the analysis was restricted to PASs not within the same ± 400-nt window as another PAS. Data are binned at 5-nt resolution. Shown within the dotted red lines are the acceptable distance thresholds to associate a read to an annotated PAS. See also Supplementary Fig. 1 for comparisons of (b, c) for each individual PAS database. d Schematic depicting the association of each scRNA-seq read to a PAS in order to quantify relative PAS abundances for a gene. Shown from top to bottom are: (i) the read coverage of scRNA-seq reads mapped to the gene. (ii) The three PAS annotation resources considered, showing the location of each PAS along the 3′-UTR. (iii) The subset of chosen PASs to which reads were greedily assigned, colored from blue to green to indicate which reads from the coverage plot were assigned to them. (iv) The three gene annotation databases integrated with bulk 3P-seq data from ten tissues and cell lines8 to identify the longest known 3′-UTR. This integrated 3′-UTR was used to associate PASs to the gene. (v) A visualization of relative 3′-UTR isoform abundances after read-to-PAS assignment, with vertical lines at each chosen PAS proportional to the assigned number of read counts. Reads not overlapping within the −300 to +20 vicinity of a known PAS were treated as likely internal priming artifacts and discarded. (vi) The resulting isoform inclusion rate (IIR) plot to quantify the cumulative proportion of 3′-UTR isoforms remaining along the length of a 3′-UTR. See also Supplementary Data 1 for the integrated 3′-UTR database and gene annotations. e Scatter plots comparing gene expression levels estimated using scRNA-seq read abundances mapping to the full gene body (left panel) or the sum of reads mapping to PASs (right panel), relative to median gene expression levels from bulk RNA-seq data37 (n = 19,517 protein-coding genes). Regions are colored according to the density of data from light blue (low density) to yellow (high density). Shown are the corresponding Pearson (r) and Spearman (rho) correlations for each comparison. See also Supplementary Fig. 2 for sequence features explaining biased estimates in the gene body approach. f Shown are IIR plots for two genes, comparing the profiles for the raw scRNA-seq data and post-processed data after read-to-PAS assignment with respect to the profile for bulk 3P-seq data8 as a gold standard. Slight vertical jitter was added for enhanced line visibility. See also Supplementary Fig. 3 for a global comparison among all genes.