Fig. 2: mRNA isoform diversity revealed by lrCaptureSeq.

a Total number of isoforms cataloged for each gene after completion of lrCaptureSeq bioinformatic pipeline. b UpSet plot comparing isoform numbers in the PacBio lrCaptureSeq dataset with public databases (RefSeq, UCSC Genes). Intersections show that 53.9% of NCBI RefSeq isoforms were detected in the PacBio dataset (255 RefSeq isoforms, 4rd + 6th columns from left). For UCSC genes, 72.3% of isoforms annotated in this database were detected in the PacBio dataset (102 UCSC isoforms, 5th + 6th columns). c Lorenz plots depicting total number of isoforms cataloged for each gene (right Y intercepts), and fraction of each gene’s total reads represented by each of its isoforms (dots). Curves are cumulative functions, with isoforms displayed in order from highest (left) to lowest (right) fraction of total gene reads. Also see Supplementary Fig. 2D. d Shannon diversity index was used to compare the relative diversity of each gene. Higher Shannon index reflects both higher isoform number and parity of isoform expression. e Treeplot depicting relative abundance of genes (colors) and isoforms (nested rectangles) within the entire dataset. Rectangle size is proportional to total read number. The most abundant isoform belonged to Crb1; the most abundant gene was Nrcam. f, g Unsupervised clustering applied at single gene level identifies families of related isoforms that share specific sequence elements. Ptprd gene is shown as an example. A subset of Ptprd isoforms cluster into 5 groups (F, bottom). These differ based upon 3 variables: length of 5′ UTR; length of 3′ UTR; and splicing of a variable exon cluster (f, top). The same groups segregate within principal components plot (g).