Introduction

Differential alternative splicing is a fundamental mechanism that increases molecular diversity. Splicing involves processing of pre-mRNAs by the spliceosome, resulting in removal of intronic sequences. Most metazoan genes undergo splicing, and splicing is critical not only for producing mature mRNAs but also for nuclear export and therefore translation. Alternative splicing (AS) occurs when a pre-mRNA is processed in more than one way, resulting in removal of different introns and the consequent production of mature mRNAs with different sequences. AS can alter the mRNA coding potential, resulting in expression of different protein isoforms. AS can also affect the stability and other features of the mRNA. The majority of human genes undergo AS1,2, and defects in AS have been linked to disease3.

Differential AS occurs when AS is regulated spatially or temporally, so that different cells express separate isoforms (differential AS is often referred to simply as ‘AS’; here, the terms are distinct). In the nervous system, where it is most prevalent4,5, differential AS controls multiple aspects of neuron identity6, including global AS switches during development7, isoform differences in neuron type specification8,9, axon specification, guidance, and synaptogenesis10,11, and has been linked to brain disorders3,12,13. Differential AS has also been studied in cancer; indeed, some forms of cancer are dependent on differential AS14. AS, because it is not cell-specific, can be identified by methods such as sequencing bulk cDNA. By contrast, identification of differential AS requires comparison of splicing between different samples, and thus has been less well characterized. One theme that has emerged is that differential AS is often not all-or-none, but rather is characterized by different ratios of splice isoforms across time or space.

How is splicing regulated? For AS, cis-acting factors—sequences or structures within the pre-mRNA—can regulate splicing diversity. The logic of cis regulation has been analyzed to identify features that give rise to AS15,16,17,18,19. But for differential AS, features within the nascent transcript are not sufficient, since the transcript is the same in all cell types. There must also exist trans-acting factors that regulate AS in a cell type-specific manner, for example by interacting with the spliceosome to promote or inhibit particular splicing events3. Several approaches have been developed to establish splicing regulatory networks5,20,21,22,23,24,25, or to integrate trans features within a framework developed on sequence motifs26,27,28,29. Nevertheless, our understanding of the ‘splicing code’—the regulatory framework involving cis and trans elements that determines differential AS across all transcripts—is incomplete.

The nematode Caenorhabditis elegans is a powerful model organism for studies of the nervous system including AS30. Although gene expression atlases of developing and aging C. elegans cells are now available31,32,33,34,35,36,37,38, less has been done to systematically establish AS patterns39,40,41.

Here, we analyze data generated by the CeNGEN project to produce an atlas of AS for 55 single neuron types in C. elegans. We develop analytical tools to make the data available to the research community. We study differential AS between neuron types and show that key neuronal genes display broad patterns of differential AS. Focusing on canonical AS events, we establish overall patterns of differential splicing. Finally, we develop a principled computational approach to extract a regulatory network for differential AS, and use the network to identify candidate factors that regulate differential alternative splicing.

Results

Visualization and analysis of alternative exon usage

The CeNGEN project generated a data set covering 55 individual neuron types suitable for splicing analysis. As described previously42,43, a series of C. elegans strains were used, each with specific promoters that uniquely label individual neuron types. For each neuron type, neurons were recovered by Fluorescence Activated Cell Sorting (FACS) from L4 hermaphrodites, with multiple independent biological replicates (average 3.8 replicate per neuron type, Supp. Data 1). Libraries were prepared using an optimized ribodepletion protocol43 and sequenced on the Illumina platform, an approach that yielded robust coverage across the gene body (Fig. 1A). Our recent analysis distinguished 128 neuron types in the C. elegans nervous system each defined by its unique transcriptome34. Here, we isolated 55 neuron types, or 43% of known neuron types. To analyze alternative splicing (AS) in this data set we used three parallel approaches: (1) raw data visualization, (2) local quantification, and (3) transcript-level quantification. We illustrate our approaches on the gene ric-4, the homolog of the SNARE protein SNAP25, which displays an alternative first exon expressed differentially between neuron types44,45. All approaches are available online at www.splicing.cengen.org.

Fig. 1: Overview of data collection and splicing analysis.
figure 1

A Schematic of the experimental procedure. BE Four methods to analyze alternative splicing, applied to the gene ric-4/SNAP25 in the neurons NSM and PVM. B Raw data visualization. Top track: Gene model of ric-4 (along with non-coding RNAs 21ur-13262 and Y22F5A.10). Bottom tracks: Read coverage and junction counts. Numbers denote junction-spanning reads for the splice junctions of interest. Blue and orange boxes indicate alternative first exons. C Local Splicing Variation (LSV) visualization. Top: Gene model of ric-4, highlighting the alternative first exons corresponding to ric-4a (blue) and ric-4b (orange). Bottom: Posterior mPSI (MAJIQ-defined Percent Selected Index) estimates displayed as violin plots. Numbers indicate the posterior expected mPSI (summing to one in each neuron, as the orange and blue junctions are mutually exclusive). D Transcript-level quantification. Top: Average transcript TPM across all sequenced neurons. Middle: Relative transcript usage (in proportion to the total TPM for the gene, in each neuron), in NSM and PVM. Bottom: Transcript quantification, indicating the mean +/- standard deviation of TPM across samples for each transcript in each neuron. E Neuron-wise comparison between PVM and NSM. Top: The application returns a list of splice junctions with differential usage between the two neurons, including two junctions belonging to the same LSV in the gene ric-4. Bottom: Visualization of junction usage for this LSV in neurons PVM and NSM.

Raw data visualization to detect alternative splicing

Raw data visualization is a direct approach for splicing analysis that depends on displaying raw read counts in a genome browser. Direct visualization allows inspection of exon and splice usage in the full context of all the data for that gene. Raw data visualization does not use a statistical model and can be applied to individual biological replicates, to pooled data for each individual neuron type, or to all samples grouped together. Our browser is based on JBrowse246. For each individual biological replicate, we generated a pair of browser tracks. The tracks underwent minimal filtering for clarity (see Methods).

First, a density plot indicates the number of reads aligned at a particular genomic position (normalized by the total number of reads in that sample and multiplied by one million, yielding Counts Per Million). Second, a splice junction track indicates the number of junction-spanning reads supporting that junction, without any normalization. We computed similar tracks for each neuron type, using all biological replicates for that type. Here, the density histogram represents the mean coverage across replicates at each genomic position (for each base pair). In addition, the junction-spanning reads (see Methods) are summed for each junction, to give a total junction usage track for that neuron. Finally, to allow rapid examination of a genomic locus across many neurons, we generated an additional set of six “global” tracks: the mean coverage (for each genomic position) across all neuron types, the minimum and maximum coverage at each genomic position across all neuron types, and the sum, minimum, and maximum of junction-spanning reads for each splice junction across neuron types. The mean exonic coverage and sum of splice junction tracks enable convenient visualization of an “average” transcript across neuron types. The minimum and maximum tracks facilitate the identification of rare transcripts: if a single neuron type expresses a given exon, it will be apparent in the maximum coverage track; if a single neuron does not express a given exon, it will not appear in the minimum track (and similarly for splice junctions).

For example, for the gene ric-4 (Fig. 1B), we apply raw data visualization to the neuron types NSM and PVM. Read coverage shows that the distal first exon (blue box), corresponding to transcript ric-4a, is preferentially expressed in NSM. PVM displays weaker preferential usage of the proximal first exon (orange box), corresponding to transcript ric-4b. In addition, NSM displays 292 junction-spanning reads connecting the distal first exon, and only 18 junction-spanning reads connecting the proximal first exon. By contrast, in PVM there are 124 junction-spanning reads connecting the distal first exon, and 283 junction-spanning reads connecting the proximal first exon. This analysis indicates that ric-4 undergoes differential alternative splicing in NSM vs PVM.

Quantification of alternative splicing

Although inspecting raw data on the genome browser is useful for visualizing differential AS at the single-gene level, additional methods are needed for genome-wide analysis. For this purpose, we quantified splice junction usage with the software package MAJIQ47. MAJIQ defines a Local Splicing Variation (LSV) as a set of splice junctions (SJ) starting from the same source exon or ending in the same target exon. For each LSV, the relative usage of each possible SJ is quantified, and a MAJIQ-defined Percent Selected Index (mPSI) is estimated from a Bayesian model. For example, an exon skipping/inclusion event (known as a ‘cassette exon’) is represented by MAJIQ as two LSVs: one upstream LSV, containing two splice junctions (one SJ that links the upstream exon to the cassette exon, and one SJ that skips the cassette to link to the downstream exon), and a second LSV, also with two SJs (one SJ from the upstream exon into the downstream exon, the other SJ from the cassette exon into the downstream exon). Similarly, an alternative first exon is represented in MAJIQ by a single LSV, immediately downstream of the alternative exons. Quantitative data generated using MAJIQ can be represented using VOILA47, which displays the mPSI of junctions belonging to an LSV using violin plots.

In the case of ric-4, the alternative first exon is quantified as a single LSV containing two splice junctions, and quantification demonstrates the preferential use of the ric-4a exon in NSM and the ric-4b exon in PVM (Fig. 1C). We use the quantitative data generated by MAJIQ to analyze global splicing patterns across genes and neuron types in the following sections.

Besides quantifying individual splicing events, it is also useful to visualize alternative events in the context of complete transcripts. To address this question, we used the software package StringTie in quantification mode to analyze transcript levels in individual neuron types48. Given a set of annotated transcripts (we used all transcripts annotated in WormBase), StringTie uses a maximum flow computational approach to estimate the expression level of each transcript. Thus, StringTie output represents not only the relative abundance of the alternative transcripts of a gene, but also the measured level of each transcript in each neuron type (in Transcript Per Million or TPM units).

For example, in the case of ric-4 (Fig. 1D), StringTie analysis compares the levels of the transcripts ric-4a and ric-4b in NSM and PVM. First, the transcript ric-4a is more common than ric-4b when averaging across all neurons (ric-4a is detected at 134 TPM, ric-4b at 47 TPM on average). Second, examining the relative transcript usage in individual neurons, 90% of the ric-4 expression in NSM is attributed to ric-4a, whereas 80% of the ric-4 expression in PVM is attributed to ric-4b. Comparing the total transcript expression values, ric-4a is expressed at 38 + /− 31 TPM in NSM and 60 + /− 67 TPM in PVM, whereas ric-4b is expressed at 4 + /− 0.9 TPM in NSM and 212 + /− 116 TPM in PVM, where the interval given corresponds to the standard deviation across biological replicates from the same neuron type.

Finally, a common research need is to determine genes that differ in their splicing patterns between two neurons or sets of neurons. We thus provide an additional tool at www.splicing.cengen.org to interrogate the quantifications performed by MAJIQ. For example, comparing the PVM and NSM neurons yields 49 splice junctions belonging to 27 LSVs in 22 genes with differential splicing. Among these differentially spliced junctions are the two junctions corresponding to the alternative first exon of ric-4, identifying this gene as a good candidate for further investigation with the event-centric tools described above (Fig. 1E). This approach provides an important investigative method to select splicing events potentially relevant to a known phenotype.

Comparison to known instances of alternative splicing

Next, we asked if our analysis aligns with existing data on alternative exon usage in C. elegans neurons. A two-color splicing reporter previously indicated that elp-1/EMAP undergoes differential alternative splicing, with exon 5 skipped in touch neurons49. Indeed, our data confirms exon 5 skipping in the AVM touch neurons (Fig. 2A). In another case, a cassette exon (11.5) in daf-2/IGFR was previously reported to undergo differential alternative splicing in many neuron classes50 (Fig. 2B). Using local quantification (Fig. 2C), we also find differential alternative splicing at exon 11.5, with the splicing patterns we observe in good agreement with previous results. A similar pattern is seen by visual exploration of the raw data (Fig. 2D, red box). It is interesting to observe that despite relatively clear data for exon 11.5, our transcript-level analysis shows that the transcript known to contain this exon (daf-2c) is only modestly enriched in individual neuron types, likely owing to the relatively large number of alternative transcripts of the daf-2 gene44 (Fig. 2E). Together, these results indicate that our data are consistent with existing in vivo observations at single-neuron type resolution.

Fig. 2: Comparison to previous splicing data.
figure 2

A Exon 5 skipping in elp-1 in AVM neuron. Left: Raw data visualization in the neuron AVM. Right: Local quantification of the two LSVs corresponding to the cassette exon. The violin plots represent the posterior distribution computed by MAJIQ. BD Cassette exon 11.5 in daf-2 in individual neuron types. B daf-2 gene structure. The cassette exon 11.5 (red box) is unique to the transcript daf-2c. C Local quantification of the upstream (top) and downstream (middle) events flanking the cassette exon 11.5. Left: Schematic representation of splice junctions constituting the local event. Right: Splice junction quantifications in 16 neuron types. Bottom: Inclusion pattern of exon 11.5 from Tomioka et al.50, and agreement with our data. D Raw data visualization of exon 11.5 alternative splicing in 5 neuron types. Top: Gene model. Bottom: Exonic and junction-spanning counts. Right: Exon 11.5 inclusion pattern from Tomioka et al.50 in the same neurons. E Transcript quantification of daf-2 in the same neurons as (C). The transcripts daf-2a, d, e, and f, (other) which do not include exon 11.5, were grouped together. Right: Exon 11.5 inclusion pattern from Tomioka et al. 50.

Web tools for splicing analysis

To facilitate use by the scientific community, these data and analytical methods are available via a web portal at www.splicing.cengen.org. For raw data visualization, the user can select a gene or genomic region, and also choose the data to display: all individual samples are available, as well as the averaged data for each neuron type and the global data showing mean, maximum and minimum. For each data set displayed, the user can select whether to display the read counts, the exon-spanning reads, or both. In addition, due to our use of interoperable formats, our tracks can be imported to other genome browsers (such as WormBase or UCSC), and tracks generated by other projects can be displayed in our genome browser, allowing simultaneous examination of data from separate sources. For local quantification using MAJIQ, we display the results using VOILA. Finally, for transcript-level quantification, we developed a custom web application to display the results. For a single gene, the application displays both relative transcript usage and absolute transcript expression for all annotated transcripts. Multiple genes can be represented as a heatmap of transcript expression. These three tools offer users complementary levels of interpretation: quantification of transcript usage reflects the underlying biology of RNA processing. However, expression levels of complete transcripts are difficult to infer from short-read data and may be inaccurate (see Discussion). By contrast, local LSV-level quantification more directly reflects our measurements and is thus likely more accurate. For complex alternative transcripts, interpretation can be complicated by the need to consider multiple splice junctions simultaneously. Finally, the browser view does not offer rigorous quantification but can be used to examine the full context of a genomic region, including constitutive exons and non-coding RNAs.

Axon guidance receptor gene unc-40/DCC is differentially spliced in specific neurons

Most gene models of splicing in C. elegans were obtained from sequencing bulk samples. However, if differential alternative splicing occurs in only a small number of cells, these rare splicing patterns might not be detected. To test this idea, we used manual inspection of raw data visualizations to examine well-studied neuronal genes.

Although a single transcript is annotated for the gene unc-40/DCC44, our analysis detected two novel exons, exon 8.5 and exon 14.5 (Fig. 3A). In particular, exon 14.5 is preferentially included in AVM, whereas other neuron types (e.g. AVL, AWA) exclusively express the canonical splice variant, skipping exon 14.5 (Fig. 3B). Using RT-PCR, we validated the presence of exon 14.5-including transcripts in cDNA extracted from whole animals, as well as the conventional skipped transcript (Fig. 3C). Interestingly, the additional exons (8.5, 14.5) do not disrupt the open reading frame, and lead to insertions between known domains of the UNC-40 protein (Fig. 3D). To determine whether the inclusion of exon 14.5 represents a transcript unique to C. elegans, we examined the locus of the orthologs of unc-40 in the closely related nematodes Caenorhabditis briggsae and Caenorhabditis brenneri (Fig. 3E), and examined bulk RNA-Seq data for those species available from Wormbase44. We found that C. briggsae displays four annotated transcripts, with cassette exons corresponding to exons 8.5 and 14.5 of C. elegans. On the other hand, C. brenneri presents evidence of unannotated exons corresponding to exons 8.5 and 14.5. This finding suggests that other nematode species express a similar exon. The additional exons in Cbr-UNC-40 and Cbn-UNC-40 encode protein sequences with high identity with exons 8.5 and 14.5 of Cel-UNC-40 (Fig. 3F).

Fig. 3: Detection of novel cassette exon in unc-40.
figure 3

A Alternative expression of a novel cassette exon in unc-40. Top: Schematic of the gene structure, showing the novel alternative exons 8.5 and 14.5 (green) and the annotated constitutive exons (gray). Bottom: Genome browser representation around exon 14.5, displaying the maximum track, the AVM track, and samples collected from whole animals. B Local quantification of exon 14.5 inclusion. The violin plots represent the posterior distribution computed by MAJIQ. C Validation by RT-PCR of the novel exon 14.5 in cDNA from whole animals (N2 wild type). Top: Location of the primers relative to the alternative exon 14.5. Left, the primer pair only results in amplification from a transcript with exon 14.5 included; right, the primer pair flanking exon 14.5 is able to amplify from both transcripts. Arrowheads indicate the expected product sizes, green arrowhead: transcript with exon 14.5 included (left: 754 bp; right: 837 bp); blue arrowhead: transcript with exon 14.5 skipped (582 bp). For each experiment, the amplification was performed on three biological replicates along with a control without reverse transcriptase (RT). Molecular weights reported in base-pairs. Source data are provided as a Source Data file. D Protein structure of UNC-40, indicating the cassette exons positions vs known protein domains. E Structure of the unc-40 orthologous genes in Caenorhabditis briggsae (Cbr-unc-40) and C. brenneri (Cbn-unc-40). Top: Annotated gene models (unannotated features in green). Bottom: RNA-Seq data. Red arrowheads denote orthologs of exon 8.5 and 14.5. F Alignment of the protein sequences of C. elegans UNC-40, C. briggsae UNC-40, and C. brenneri UNC-40 around exon 8.5 orthologs (top) and around exon 14.5 orthologs (bottom).

The expression of unc-40 is necessary for ventral guidance of the AVM axon51,52. Mutation of unc-40 leads to aberrant anterior growth of the AVM axon with a penetrance of 20-40%. To assess whether the additional exon 14.5 is necessary for the function of UNC-40 in AVM, we performed a CRISPR deletion of this exon (Figure S1A), directly joining exons 14 and 15. We did not observe AVM guidance defects following excision of exon 14.5 (Figure S1B). It is possible that this exon plays a role in a function other than axon guidance in the AVM neuron.

Thus, C. elegans unc-40 has previously unannotated alternative transcripts, with conserved sequence and potential functional impact. In general, our analysis of splicing in individual neuron types can identify novel mRNA sequences.

sax-3/Robo and the homeobox factor ceh-8 have novel alternative first exons

We also identified novel splice variants in the Slit receptor gene sax-3/ROBO. The sax-3 annotation shows two alternative transcripts, differing by 13 bp in exon 11 length (Fig. 4A) and 29 bp in the length of the annotated 5’ UTR (not shown in figure for clarity). We found a novel alternative splice site that shortens exon 9 by 15 bp. In addition, we detected a novel alternative first exon 5.5 (positioned between annotated exons 5 and 6). LSV quantification shows that the annotated alternative splice site in exon 11 is not used in the neurons sequenced here. By contrast, the novel alternative splice site in exon 9 and the alternative first exon 5.5 are both differentially expressed in broad subsets of neuron types in our data set (Fig. 4B, showing AVL and AVM as examples). We confirmed the in vivo expression of both alternative first exons by RT-PCR (Fig. 4C, D). Both of these novel events affect coding potential: the alternative splice site in exon 9 alters the amino acid sequence of the intracellular domain of SAX-3, whereas the alternative first exon 5.5 generates a short isoform (SAX-3S) lacking four of the five Ig domains, but encoding its own signal peptide in frame with the remainder of the protein (Fig. 4E). We examined the locus of the C. briggsae and C. brenneri orthologs in bulk RNA-Seq data, and find a remarkable conservation of the gene structure, including the novel exon 5.5 (Fig. 4F). These orthologous exons 5.5 encode an identical amino acid sequence (Fig. 4G).

Fig. 4: Detection of novel alternative first exon in sax-3.
figure 4

A Structure of the gene sax-3/ROBO indicating the novel exon 5.5 and alternative 5’ splice site at the end of exon 9 (green), and the annotated alternative 3’ splice site at the start of exon 11 (red). Bottom: Enlarged 3’ end of the locus. B Local quantification of LSVs in the representative neurons AVL and AVM for the novel alternative first exon 5.5 (left), the novel alternative splice site (middle), and the annotated alternative splice site (right). Violin plots represent the posterior distribution computed by MAJIQ. C, D Validation by RT-PCR of the novel exons in cDNA from whole animals (N2 wild type). Alternative first exon 5.5 (C), top: Location of primers relative to the exon. Left, the primer pair only amplifies the annotated transcript including exons 1-5; right, the primer pair only amplifies the novel transcript using exon 5.5 as first exon. Arrowheads: the expected product sizes (307 bp, 286 bp). Alternative 5’ splice site in exon 9 (D), top: Location of the primers relative to the event. Left, the primer pair only amplifies the novel transcript with shorter exon 9; right, the primer pair only amplifies the annotated transcript with longer exon 9. Arrowheads: expected product sizes (242, 248 bp). Three biological replicates along with a control without reverse transcriptase. Molecular weights reported in base-pairs. Source data are provided as a Source Data file. E Protein structure of SAX-3, indicating the impact of the novel splice variants. Top: Overall structure of annotated isoform SAX-3A. Bottom: Overall structure of the novel short isoform starting at novel exon 5.5 (SAX-3S). The position of the novel splice site in exon 9 is indicated. F Structure of the orthologs of sax-3 in C. briggsae (Cbr-sax-3) and C. brenneri (Cbn-sax-3). Top: Gene models. Bottom: RNA-Seq data. Red arrowheads denote the orthologs of exon 5.5. G Protein sequence alignment of the short isoforms of SAX-3 orthologs from C. elegans, C. briggsae, and C. brenneri.

Finally, we identified a novel alternative first exon in the homeobox transcription factor ceh-8 (Figure S2A, B). Interestingly, this transcript would lead to translation of a protein with a truncated homeobox domain (Figure S2C). This finding is consistent with data from recent studies that also show that alternative splicing can lead to partial or complete loss of a DNA-binding domain8,53. The ceh-8 locus is not well conserved in C. briggsae and C. brenneri, precluding direct comparison of alternative transcripts. Together, these examples demonstrate that our approach can detect novel alternative first exons. Such events also contribute to transcript diversity, but are generated by different biological mechanisms than other forms of alternative splicing. While most alternative splicing is performed by the spliceosome, alternative first exons are the result of alternative promoter usage.

Global detection of novel splicing events across neuron types

Given these examples of novel isoform detection, we sought to identify candidate novel splice junctions across our data set. We used STAR to generate a preliminary list of 1,026,619 unannotated splice junctions54. We filtered these junctions using multiple criteria to focus on well-supported novel junctions, for example by requiring high expression relative to neighbor genes (see Methods). We also leveraged the biological replicates in our data set, requiring novel splice junctions to be present in at least half the samples from a single neuron type, minimum of two. This analysis yielded 1722 novel junctions (Supp. Data 2). Attaching novel junctions to annotated genes is a transcript discovery task which is challenging using short reads, and current tools do not reach high accuracy48. However, each novel junction must belong to a gene in its immediate neighborhood; we provide a list of all neighbor genes (Supp. Data 2). We also sought to estimate the number of genes containing novel splicing events, without precise knowledge of which gene each junction belongs to. As multiple junctions may belong to the same novel transcript, we expect the number of affected genes to be less than the number of novel splice junctions. We thus estimated the number of genes containing novel splicing events under three assumptions: (1) minimizing the number of novel junction-containing genes that could explain the observed pattern, (2) maximizing this number, or (3) estimating an average number of novel junction-containing genes by selecting random samples, under the assumption that the novel junctions are uniformly distributed among the genes. Following this procedure, we estimate that approximately 1361 genes (at least 1353 and at most 1368 genes) contain novel junctions detected in our data. Overall, this analysis indicates that many novel splice sites and mRNA isoforms remain to be described, and provides a list of reliable candidates for future study.

Detection of differential AS between neuron types identifies genes associated with neuronal excitability

Our analysis of known differential alternative splicing events (ric-4, elp-1, daf-2; Figs. 1, 2) and identification of novel events (unc-40, sax-3, ceh-8; Figs. 3, 4) demonstrate that our data set can be used to identify instances of differential alternative splicing (DAS) in the C. elegans nervous system. To identify candidate DAS events across all genes and neuron types, we quantified differential event usage using MAJIQ47. Specifically, within its Local Splicing Variation framework, MAJIQ models the relative usage of splice junctions that share a common source or target exon. Across the 55 neuron types in our data set, we detected 1940 genes displaying DAS. To validate these findings, we compared the genes detected here to a list of 542 genes compiled from the literature (Supp. Data 3, see Methods) and found a large overlap of 461 genes (85%; Fig. 5A). This comparison indicates that the novel instances of DAS we detect are indeed strong candidates.

Fig. 5: Differential AS between neuron types.
figure 5

A Euler plot of the number of genes presenting differential alternative splicing (DAS) in this analysis, and from a literature review (Supp. Data 3). Source data are provided as a Source Data file. B Proportion of genes DAS within families of neuronally significant genes56. The number of genes in the family is indicated inside each bar. Hypergeometric test: * Significantly enriched, # significantly depleted. One-sided hypergeometric test and Benjamini-Hochberg correction for multiple comparisons. Exact p-values reported in Source Data file. C Proportion of DAS genes per neuron pair, among genes co-expressed in the neurons of this pair. D Number of DAS genes plotted against number of differentially expressed genes for each neuron pair. The line corresponds to a linear fit, the shaded area to the 95% confidence interval. E Estimate of the number of genes in which DAS can be detected relative to the number of neuron types sequenced. Each subsampling was repeated 10 times; the box plots are centered on the median and between the first and third quartiles, the whiskers extend to the furthest value no further than 1.5 times the interquartile range. Source data are provided as a Source Data file.

Next, we sought to explore the function of candidate DAS genes. Gene Ontology analysis55 showed enrichment of terms related to neuronal function (Figure S3A). To investigate the specific role of DAS, we examined major neuronal functional gene classes56. We found that the prevalence of DAS is highly variable by gene class. For example, the majority of potassium channels and voltage-gated calcium channels undergo differential alternative splicing, whereas ribosome subunits and neuropeptide-encoding genes tend to be similarly spliced across neuron types (Fig. 5B). This analysis suggests that DAS is preferentially used to fine tune neuronal excitability, enhancing functional diversity among different neuronal types.

Global patterns and prevalence of DAS

With a list of DAS events in hand (Fig. 5A), we examined global patterns of DAS across neuron types. We performed pairwise comparisons of all neuron types in our data set, and assessed the proportion of DAS genes among the genes co-expressed in both neurons of the pair (Fig. 5C, full table in Source Data file). Clustering this data revealed that that a group of 10 ciliated sensory neurons (ASK, ADF, ASG, ASEL, ASER, AFD, BAG, AWA, AIY, AWB) have similar global patterns of AS. In addition, some pairs of neurons with similar functions have similar splicing profiles (DA-VB, DD-VD, AVM-PVM, AVH-AVK) (Figure S3B). Interestingly, another group of 6 neurons comprising I5, LUA, OLQ, OLL, PHA, and PVC also display similar AS profiles (Fig. 5C), even though they do not share known functional or morphological characteristics. These data suggest that at least in some cases, neurons with similar characteristics adopt similar patterns of DAS, presumably to support specific functional specialization.

Besides DAS, gene expression patterns are highly correlated with neuron type. In fact, gene expression patterns alone can be used to group single cells into clusters corresponding to individual neuron types34. Given the very strong association between gene expression and neuron type, we wondered if DAS patterns and gene expression patterns are related. In this case, both gene expression and alternative splicing might encode the same information reflecting the underlying structure of neuronal cell identities. To test this model, we compared (for each neuron pair) the most strongly differentially expressed (DE) genes to the most strongly DAS genes, and found limited overlap (Figure S3C). In addition, for each pair of neurons, we compared the number of DE genes to the number of DAS genes and found only a weak correlation (Fig. 5D). Together, this analysis indicates that DAS and gene expression are two largely independent dimensions of neuron type identity57.

We then aimed to determine the number of DAS genes in the entire nervous system. As our dataset covers 55 neuron types out of 119 classes (118 canonical classes, plus ASEL/ASER), we used a downsampling approach to estimate the number of detected DAS genes depending on the number of neuron types sequenced (Fig. 5E). By projection, we estimate that 3192 genes are differentially alternatively spliced within the nervous system (see Methods), corresponding to about one quarter of all genes expressed in neurons.

Sequence features of alternative splicing

Alternative splicing can take many forms, with implications for both regulatory mechanisms and biological effects. To assess the representation of different alternative splice types in the C. elegans nervous system, we grouped events into canonical event types (Fig. 6A). We found that alternative splice sites, cassette exons, and alternative first exons are well represented in the genome. By contrast, alternative last exons and coordinated multiple exon skipping are relatively rare. Using the software SUPPA to estimate a binary Percent Spliced-In (bPSI) for each event, we assessed the prevalence of differential alternative splicing within each event type and found that DAS is well-represented among all forms of alternative splicing. To compare the prevalence of neuron-specific versus tissue-specific event regulation, we retrieved published tissue-specific TRAP-Seq data39 and processed it using our pipeline. This analysis showed a significantly smaller number of tissue-regulated compared to neuron-regulated events (Fig. 6A), which may reflect a biologically meaningful increase in the amount of AS in the nervous system4,5, the higher statistical power resulting from the greater number of neuronal samples, or both (see Discussion). Our approach, which relies on existing genome annotations, detected fewer instances of intron retention compared to an independent analysis of the same dataset58. As intron retention cannot be identified through junction-spanning reads and requires careful consideration, we do not further discuss this event type. Overall our analysis indicates that neurons use all available mechanisms to increase their molecular diversity.

Fig. 6: Landscape of alternative splicing event types.
figure 6

A Number of events by types. Bar length indicates the number of canonical events of each type predicted from genome annotation. Dark gray: events that did not display DAS between neuron types in our dataset; dark red: events DAS in at least one pair of neurons; dark blue: events DAS between tissues, based on published tissue-specific TRAP-Seq data, but not between individual neuron types in our data; light gray: events that could not be measured in either dataset. B Distribution (density plot) of length or conservation score of particular sequences within splicing events, for DAS events (red) vs non-DAS events (gray). Only features with a statistically significant difference between DAS and non-DAS events are represented here. Vertical dashed lines represent the median of each group. Lengths in basepairs (bp). C Histogram of cassette exon lengths, separating DAS exons (red) and non-DAS exons (gray). The vertical dashed bar at 27 bp delimits microexons. D For each exon (each dot), proportion of neuron pairs where the exon is differentially AS between the two neuron types (proportion calculated among the neuron pairs in which the exon-containing gene is expressed in both neurons of the pair). p = 0.0035 (two-sided Wilcoxon test). E Potential impact of nucleotides addition on open reading frame. For alternative splice sites and cassette exons overlapping a CDS, the color indicates whether inclusion of the alternative event results in a nucleotide addition that is a multiple of 3 (lilac) or is not a multiple of 3 (green), separating events DAS between neuron types. Alt. 3’ ss, p = 0.023; alt. 5’ ss, p = 0.053; cassette exon, p = 0.41 (Chi-square test, comparison of the proportion of PFS events between DAS and non-DAS events). F Location of events within the gene body. For alternative splice sites and cassette exons, the color indicates non-PFS (lilac) and PFS events (green). The density curve indicates the distance from the event to the closest extremity (start or end) of the gene. ***: p < 0.001 (two-sided Wilcoxon test, comparison of distance from extremity between PFS and non-PFS events, test statistics reported in Source Data). Source data are provided as a Source Data file.

Next, we explored the usage and specificity of event types and by analyzing the distribution of absolute differences in bPSI (\(|\Delta {bPSI}|\)) across neuron types (Figure S4A, full table in Source Data file). For all canonical event types, one-third to two-thirds of the events displayed a large (\(|\Delta {bPSI}| > 0.5\)) change in at least one pair of neurons. The Gini index is a measure of specificity, reflecting the number of neuron pairs displaying a substantial \(|\Delta {bPSI}|\). Most events with large \(\max (|\Delta {bPSI}|)\) did not exhibit a large Gini index, suggesting broad usage of both splice variants.

Do sequence features also affect differential AS?

Previous work has found a role of sequence features in distinguishing AS events from constitutive splicing15,16,17,18,19. Do sequence features also affect differential AS? To address this question, we examined the association of broad sequence features with differential AS. For each event type, we delineated the genomic regions composing the event locus. For example, for cassette exons, we considered the alternative exon as well as the two flanking introns. For each of these genomic regions, we measured its length, GC content, and conservation score. We compared the resulting values between differentially spliced events vs. events where we did not detect differential AS. In total, this resulted in 72 comparisons (Supp. Data 4), of which 6 were statistically significant (Fig. 6B, Table 1). Strikingly, four of the significant differences were for alternative first exons. Alternative first exons that display differential AS between neuron types have longer exons, a less conserved distal intron, and a longer distal intron. Alternative first exons are features of transcriptional regulation rather than post-transcriptional splicing. These data indicate that sequence features surrounding first exons may be highly variable to support fine-tuning expression of alternative isoforms at the level of transcription.

Table 1 Sequence features presenting a statistically significant difference between DAS events and non-DAS events. Two-sided Wilcoxon tests with Benjamini-Hochberg correction for multiple comparisons

Alternative splicing of microexons

By contrast, cassette exons presenting differential AS in the nervous system appeared shorter on average. This observation is reminiscent of recent findings that microexons, with a length of 27 bp or less, are differentially spliced between C. elegans tissues39 and frequently display neuron-specific inclusion59. We thus focused more closely on the differential splicing of microexons and found that, of the 75 microexons in the annotated genome, 71 are measurable in our dataset. Of those 71, we found that 53 (75%) displayed differential splicing in the nervous system, as opposed to 59% of the larger alternative exons (Fig. 6C, full list of annotated cassette exons available in Source Data file) (p = 0.0136 by proportion test). Furthermore, we asked if microexons showed more differential AS than other exons (Fig. 6D). Indeed, microexons were on average DAS in 11% of the neuron pairs tested, as opposed to 7% for longer exons (p = 0.0035, Wilcoxon test). Correspondingly, these microexons displayed a modest increase in specificity (Figure S4B). Overall, this analysis supports the previously reported importance of microexon inclusion in the nervous system.

Impact of alternative splicing on coding potential

We then sought to examine the potential impact of alternative splicing on the corresponding protein isoforms. We focused on alternative splice sites and cassette exons, restricting our analysis to events overlapping the coding sequence. Among the events present in the genome annotation and intersecting a coding sequence, 21% of alternative 3’ splice sites, 37% of alternative 5’ splice sites, and 24% of cassette exons result in addition of a number of nucleotides that is not a multiple of 3 (potentially frame-shifting, PFS). Further examination of these PFS events reveals their ability to disrupt the coding sequence of the gene, leading to protein isoforms with alternative N- and C-termini (Figure S5). We examined the proportion of PFS events that were also DAS. Alternative splice sites displayed a weak depletion of PFS events among DAS events (Fig. 6E, Table 2), while cassette exons did not show any meaningful difference in proportion of PFS events. We further hypothesized that PFS events resulting in substantial changes to the protein sequence would be subject to negative selection. Indeed, PFS events are concentrated near the 5’ and 3’ ends of the gene, primarily affecting the sequences of the N- or C-terminus of the protein (Fig. 6F).

Table 2 Proportion of PFS events among DAS events. Chi-square tests with Holm correction for multiple comparisons

Overall, our analysis indicates that differential alternative splicing is associated with specific features of the pre-mRNA. For alternative first exons, which are associated with an increase in intron and exon length, these features likely include transcriptional start sites. Cassette exons displaying DAS, by contrast, tend to be shorter than constitutive exons. In this case, shortness may be a result of selection for protein integrity, with only minor insertions or deletions well-tolerated.

Splicing regulatory network

Differential alternative splicing between neuron types is likely regulated in part by differential expression of splice factor (SF) genes. Our data enables the concurrent measurement of DAS and of gene expression in the same samples across 55 neuron types. We reasoned that these measurements might enable elucidation of a splicing regulatory network that links splice factor expression to DAS. To perform this analysis, we first sought a single measurement that could quantify DAS across genes and neuron types. For this purpose, we restricted our analysis to cassette exons—exons which are either included or excluded from the final transcript (e.g., unc-40 cassette exon in Fig. 3). For each cassette exon, we computed an exonic Percent Spliced-In measure (ePSI) which captures the extent to which the exon is included in each individual neuron type (see Methods). We also compiled a list of 239 putative SF genes (Supp. Data 5) and quantified their expression (Supp. Data 6). Our list includes well-studied splice factors, as well as genes with only a speculative role in splicing regulation. Since SF genes themselves are heavily AS, we separately quantified expression of each SF transcript. These two measurements—ePSI and SF expression—constitute the input data for our model.

We constructed a covariance matrix which assesses, for each cassette exon ePSI and each SF transcript, how they covary over all 55 neuronal types in our data. In principle, the inverse of this covariance matrix is a precision matrix corresponding to the splicing regulatory network. However, the covariance matrix cannot be directly inverted due to the underdetermined system of equations that comprises many more covariates (SFs and ePSIs) than observations (neuron samples). Thus, we sought to estimate the precision matrix (Fig. 7A).

Fig. 7: Splicing Regulatory Network.
figure 7

A Schematic representation of the approach: The splicing and expression measurements in each samples are grouped in a single data matrix, to compute a covariance matrix. The precision matrix is estimated, whose side quadrants represent the weights of a splicing regulatory network linking Splice Factors (SF) to cassette exons. B Comparison of metrics for selected parameters. The four metrics are represented across all methods. The Frobenius loss was inverted such that for all metrics, higher is better. The arrow represents the selected “best” method. C The four metrics plotted against sparsity for a range of penalties. The shaded area highlights the selected penalty of 0.25, corresponding to a sparsity of 97%. The red triangles correspond to values significantly different from the permuted data. D, E Subnetworks centered on exon 5 of C07A12.7 (D) and exon 11.5 of daf-2 (E). Blue diamond: cassette exon, red ellipse: putative splice factor displaying a non-zero weight in the network, green ellipse: splice factor identified from the literature to regulate this exon. Source data are provided as a Source Data file.

To select our method of precision matrix estimation, and optimize the hyperparameters, we used 5-fold cross-validation and computed four metrics (Figure S6A, see Methods). First, the Frobenius norm loss reflects the ability of the method to capture correlations. Second, the Fraction Explained Variance (FEV) of the PSI reconstruction using regression coefficients60 reflects the ability to capture relationships between SF expression and ePSI. Third, we compiled a ground truth dataset of splicing regulatory interactions for comparison to the network structure (Supp. Data 7). Finally, we used a scale-free criterion on the structure of the network61. Using these metrics, we compared the following combinations of approaches. For precision matrix estimation, we considered the glasso62, QUIC63, CLIME64, and SCIO65 algorithms. For input to these algorithms, we considered either the measured ePSI or reconstructed counts. For normalizing the range of these inputs, we considered applying a Z-score or nonparanormal transformation66. Lastly, to handle missing values we considered k-nearest neighbors interpolation or median interpolation. Based on our evaluation metrics in a cross-validation, we chose to use the following methods: glasso as the precision matrix estimator algorithm, ePSI as input, nonparanormal truncated transformation, and k-nearest neighbors imputation (Fig. 7B, arrow, values in Supp. Data 8 and Source Data file), selecting 4-nearest neighbors (Figure S6B).

We then used all the training and validation data, together with the selected methods, to train a final model. We selected an optimal penalty for the glasso algorithm and assessed the model’s performance on the as yet unseen test data using a permutation test approach. First, we permuted the model by randomizing the input cassette exon quantifications between samples in the training and validation data. For each permutated dataset, we tested a range of glasso penalty values. We then recalculated the model based on these unstructured data. We compared the performance metrics of these randomized models to that of the model trained on the unaltered data, at each glasso penalty value. For most metrics, the model fails to appropriately reconstruct the validation set when the input data is unstructured, indicating that the model captures relevant relationships (Figure S6C). We also calculated a permutation test p-value that reflects the relevance of the model for each metric and selected a glasso penalty value that optimized performance (Fig. 7C, Supp. Data 9). Our final model establishes a virtual network relating putative splice factors to splicing of cassette exons (Supp. Data 10), with a sparsity of 97% and 80% of the variance of ePSI explained. We compiled a “ground truth” dataset of genes whose splicing has been shown to change upon mutation of Splice Factors (see Methods). Using this dataset, we find that our splicing model reaches a True Positive Rate of 8.1% and a False Positive Rate of 4.6%. However, these evaluative metrics are limited by the scarcity of ground truth information on AS (see Discussion).

One application of our splicing network is discovery of potential splice factors. To this end, we examined the putative splice factors with the most central position in the network, ordered by number of connections to cassette exons. We found that many of the most central genes are indeed splice factors known to act in C. elegans neurons (Table 3). For example, mbl-1/Muscleblind67, the hnRNP hrpa-150, and the CELF family gene unc-7568,69 are all known splice factors in C. elegans neurons. Our analysis indicates that these key factors likely regulate differential alternative splicing across genes and neuron types, at least of cassette exons. Other central nodes are genes not previously known to function in neuronal splicing. For example, we identified the CELF family gene etr-1 as a key splice factor. etr-1 has known roles as a splice factor in muscle70,71, and our recent data from single-cell RNA-Seq suggests expression in a restricted set of neurons34. Together, these data suggest that etr-1 might play a novel role in regulating neuronal DAS, similar to its known role in muscle. Our analysis also placed the transcription factor sma-9 and melo-1/periphilin 1 in a central position. Neither sma-9 nor melo-1 has an established role in splicing regulation. Thus, central nodes in our network, besides known splice factors, may indicate novel regulators of alternative splicing in neurons.

Table 3 Top 20 putative splice factors with highest network degree

A second application of our network is to analyze the regulation of specific cassette exons. We examined the subnetwork of putative SFs connected to C07A12.7 exon 5 (Fig. 7D) and daf-2 exon 11.5 (Fig. 7E). For C07A12.7, the known regulation by unc-7569 was correctly detected. In addition, 12 other putative interactors appeared in the network: acin-1, aly-2, C16C10.4, etr-1, exos-4.1, F59A7.8, fubl-3, plrg-1, ruvb-2, sftb-2, sma-9, and uaf-2. Similarly, for daf-2 exon 11.5, we detected the known regulation by ptb-1, rsp-2, and unc-75 (we did not detect regulation by asd-1, exc-7, hrpa-1, hrpf-1, and rsp-8)50,72. We obtained an additional 17 predicted interactors: C16C10.4, cpsf-1, etr-1, hrpk-1, lsm-7, mbl-1, melo-1, moa-2, pab-1, pes-4, prpf-4, rnp-6, sma-9, snrp-40.1, sqd-1, srrt-1, and uaf-1. In addition, we examined the subnetwork of putative splice factors connected to unc-16. We quantified 6 separate events, corresponding to 4 exons (two pairs of events correspond to the same exon with differing flanking introns) in the gene unc-16, represented as individual nodes in the subnetwork (Figure S6D). We identified the known regulation by unc-75 and exc-7, but did not detect regulation by prp-4069,73,74. In addition, we found 58 network connections that do not correspond to known regulatory interactions. Thus, our splicing network can generate detailed hypotheses about the regulation of specific splicing events.

Discussion

In this study, we present an atlas of alternative splicing (AS) in the C. elegans nervous system, at single neuron type resolution, across 55 neuron types. We develop a toolset of analytic approaches and a website to facilitate their use by the research community. We show that our approach yields results in agreement with existing data, and identifies multiple new examples of alternative splicing in genes with key roles in neuronal function. Systematic quantification reveals broad patterns of differential AS, particularly affecting ion channels, shaping the landscape of neuronal mRNAs separately from gene expression. We provide a broad description of the C. elegans neuronal alternative transcriptome, and observe that microexons are notably differentially alternatively spliced (DAS) in the nervous system. Finally, we compute a splicing regulatory network to formulate new hypotheses on splice factor regulation of differential alternative splicing, focused on cassette exons.

With the wider availability of single-cell RNA-Seq methods, large efforts have been made in the recent years to establish atlases of gene expression in C. elegans31,32,33,34,35,36,37. However, these sequencing approaches often present a strong 3’ end bias which make them unsuitable for AS analysis. A more restricted set of studies have focused on AS. In C. elegans, these studies have been performed mostly at the level of whole animals, or through comparison between tissues39,40,41. Our data are broadly consistent with these previous analyzes, for example confirming the reported rarity of alternative last exons in the C. elegans transcriptome (Fig. 6A), the orthogonality of gene expression and alternative splicing (Fig. 5E, Figure S3C), and an increased differential splicing of shorter cassette exons (Fig. 6C, D)39.

While our analysis largely aligns with other work, two apparent discrepancies are our global counts of intron retention and alternative first exons. For intron retention, we found relatively fewer instances than a parallel analysis of the same data58. A key underlying factor is our use of the tool SUPPA, which only measures intron retention relative to annotated transcripts, and does not detect unannotated instances. For alternative first exons, we found more instances than an analysis of a separate data set39. In this instance, SUPPA treats multiple alternative first exons in the same gene as distinct events, rather than grouping them as a single event. Overall, these discrepancies highlight the difficulty of computationally interpreting alternative splicing at genome scale. For this reason, we encourage systematic inspection of the raw data visualization we provide, to guide investigation of splicing at any specific locus.

For further comparison, we processed the previously published TRAP-Seq dataset39 using SUPPA and examined the events detected as DAS between serotonergic neurons and the rest of the nervous system both in our dataset and in the TRAP-Seq dataset (not shown). We found little overall agreement between the two datasets, attributable to several causes. First, as our dataset did not measure the MI serotonergic neuron, restricted but strong event inclusion in this neuron (or exclusion) may be evident in the TRAP-Seq measurements and not in our data. Conversely, not all serotonergic neurons share the same splicing patterns, thus the increased resolution of our dataset can detect additional DAS events. For example, in the case of mvk-1 exon 12, the TRAP-Seq data correctly indicates inclusion in most neurons, but our higher-resolution data detects exclusion in HSN and DA. Finally, only 2 TRAP-Seq replicates are available per tissue, and the sequencing data displays a 3’ bias. This limits statistical power and can preclude correct detection of events far from the 3’ end of the gene. For example, daf-2 exon 13 appears DAS from our data, but, being located 12.5 kb away from the gene 3’ end, the TRAP-Seq data does not provide sufficient coverage to analyze its usage. Thus, while the two datasets complement each other by enabling measurements inside and outside the nervous system, our data allows deeper, higher-resolution interrogation of events within the nervous system. Studies in other organisms have examined AS at the tissue21 or single-cell level in the nervous system5,75,76,77,78,79. A feature of our dataset is its genome-wide scope and its high-resolution analysis of AS across many individual neuronal types. Thus, this analysis complements and extends previous studies.

Our analysis discovered novel, previously unannotated, exons in at least 1353 genes. Interestingly, in the case of unc-40, orthologous exons had been previously annotated in C. briggsae, despite evidence suggesting that Cbr-unc-40 is expressed at a lower level than its C. elegans ortholog44,80. This highlights the noisy nature of gene and transcript annotation, which is highly dependent both on the available data and the algorithms used for transcript reconstruction. Consequently, researchers interested in the study of a particular gene should always consult available sequencing data in C. elegans and related species to identify potential unannotated isoforms.

We used a computational approach to define a Splicing Regulatory network. A challenge in evaluating the performance of such an approach is the availability of a ground truth, which is instrumental both for selecting hyperparameters during model training, and for evaluating the accuracy of predictions. Although we used a compiled list of interactions between splice factors and splice events as one metric (Supp. Data 7), this approach has strong limitations. First, some of the data are not obtained from neurons, and even the neuronal data does not reach the resolution of individual neuron types. Second, we aggregated this data at the level of entire genes rather than individual events. Third, a substantial portion of this data corresponds to genes whose splicing pattern changes as a consequence of SF (Splice Factor) mutation; these changes may reflect a direct regulatory interaction or be an indirect effect of the SF knock out. In contrast to previous attempts that did not evaluate alternative approaches, and either did not justify the initial selection of hyperparameters21,24, or used a single criterion5,22, we adopted an innovative approach to select the hyperparameters of the model using four separate criteria. Our principled tuning method may be beneficial for other models with limited access to ground truth. However, the final step of evaluating the model remains challenging. The True Positive Rate of our final model indicates that only 8% of the regulatory interactions present in the ground truth data are captured by the model. While the ground truth for the selected examples of C07A12.7, daf-2, and unc-16 is more reliable, coming from low throughput studies, these examples also highlight the model’s imperfections. Nevertheless, due to its high sparsity, our model can strongly narrow the candidate space, making it a uniquely valuable tool for hypothesis generation. Overall, our final network captures many known interactions between putative splice factors and splice events, and constitutes a powerful tool for the exploration of novel splicing regulatory mechanisms.

Most previous efforts to model the role of splice factors on DAS were performed with RNA-Seq data set either derived from whole animals24 or from tissue-specific samples21,22,26,29,81,82. Our approach enables profiling the transcriptome at single cell type resolution. However, each sample is composed of thousands of individual cells, at a single time point. Previous work reports that 18%83 or 30%40 of AS events are differentially regulated during development, which cannot be evaluated in our dataset. The recent progress in single-cell RNA-Seq facilitates higher resolution studies in multiple conditions5,20; in particular, current progress in single-cell long-reads sequencing offers great opportunities to extend this analysis84,85,86. These single cell approaches allow for more specificity, capture the heterogeneity within a population of single cells of the same type, and potentially allow for whole-transcript analysis, promising more comprehensive, higher-quality atlases in the near future.

Our approach has cataloged DAS in single neuron types for almost half the canonical neuron types in the C. elegans nervous system. Our data indicate that alternative splicing affects the function of key neuronal genes, and reveals substantial novel splicing diversity. These splicing events might control subtle, cell-specific alterations of neuronal form and function that are not accessible by broader forms of genome regulation. Thus, we expect studies of gene function to be informed by these data about differential alternative splicing in specific neuron types. In addition, from the perspective of splicing itself, we use the diversity we have discovered to model regulatory mechanisms that mediate the control of differential alternative splicing.

Methods

FACS isolation and sequencing

For single-cell type bulk RNA-Seq, C. elegans strains expressing a fluorescent protein or combination of fluorescent proteins in a single neuron type were dissociated and sorted into Trizol as described previously34,42,87. RNA was extracted and sequencing libraries were prepared using the Ovation® SoLo® RNA-Seq library preparation kit, yielding even coverage along the gene body, as described previously43. Libraries were sequenced with an Illumina HiSeq 2500 or NovaSeq 6000 (Supp. Data 1). The dataset covers 211 samples corresponding to 55 neuron types, and an additional 8 control samples from whole animal sorts. All neuron types were sequenced in 3-6 replicates, except ADF, M4 (1 replicate each), OLL and PVD (2 replicates). Four samples failed quality control and were excluded from subsequent analyzes.

Following trimming, the RNA-Seq reads were aligned to the C. elegans genome (Wormbase WS289) using STAR 2.7.7a54 with option --outFilterMatchNminOverLread 0.3 (all other settings left to default). Deduplication was performed using UMI-tools88. The pipeline code is available at: https://github.com/cei/bulk_align.

RT-PCR

For RT-PCR, mixed stage N2 C. elegans were grown following standard methods89. Plates were washed with M9 buffer, and 100 µL of worm suspension was added to 400 µL of Trizol and immediately frozen in liquid nitrogen. Samples were stored at −80 °C. RNA extraction was performed with chloroform in Phase Lock Gel Heavy tubes (QuantaBio), treated with DNase I (Thermo Fisher) and purified with the Macherey-Nagel Nucleospin kit following manufacturer’s instructions. Finally, cDNA was synthetized using the Affinity Script Multiple Temperature cDNA Synthesis kit (Agilent), following manufacturer’s instructions with oligo-dT primers. For each sample, an additional tube was prepared with identical composition, adding water in place of reverse transcriptase. The resulting cDNA was stored at −20 °C. RT-PCR reactions were performed with Phusion polymerase (New England Biolabs), following the manufacturer’s protocol. The primers used are listed in Table S1.

CRISPR excision of unc-40 exon 14.5

To knock-out exon 14.5 of unc-40, we injected wild type N2 animals with a Cas9 mix using two crRNAs to create two double-strand breaks near the splice sites of exons 14 and 15, and a single-stranded repair template. The corresponding sequences are listed in Table S1. The deletion was confirmed by sequencing, and crossed with the fluorescent marker for touch receptor neurons. Young adult individuals were examined under a Zeiss Axioplan 2 microscope, imaging approximately 20 individuals of each genotype on 3 separate days. The strains used are listed in Table S2.

Discovery of novel splice junctions

STAR produces junction files, providing a list of splice junctions detected in the processed sample, along with the measured count, annotation status, and other information54. We only considered novel junctions (not present in the annotation), that were flanked by canonical splice site motifs. In addition, we only considered splice junctions supported by reads with 12 bp overhang on each side of the junction (STAR’s default value for --outSJfilterOverhangMin). We defined the neighborhood of a splice junction as the set of genes within 60 bp of either splice site, regardless of the strand. We then filtered the junctions in each sample fulfilling the following criteria:

  • Junction no longer than 1000 bp

  • At least 2 supporting reads (uniquely mapped, with at least 12 bp overhang following STAR default) supporting that junction

  • Not in the neighborhood of an rRNA gene

  • In the neighborhood of a protein-coding gene, long-non-coding RNA, or pseudogene

  • Has at least 20% as many reads as the most highly detected splice junction from the neighbor genes

After processing each sample with the above filters, we aggregated the junctions across samples and conducted a second round of filtering. We kept novel splice junctions that were detected in at least half the samples from a single neuron type (with a minimum of two samples from a single neuron type). This analysis identified 1722 novel junctions robustly expressed in our dataset. Reliably attributing each junction to a novel transcript is challenging with available methods48. Instead, we only attempted to estimate the total number of genes containing novel junctions, without determining their precise identity. To this end, we used three approaches under different assumptions.

First, we determined the minimal number of novel junction-containing genes that could explain the observed pattern by formulating an integer programming problem. For \(S={{\mathrm{1722}}}\) novel junctions and \(G={{\mathrm{1430}}}\) genes that are neighbors to one of these junctions, we denote the neighbors of junction \(s\) as the set of genes \({N}^{s}\). Then, we define the binary variables \({({x}_{s,g})}_{s\in [1,S],g\in [1,G]}\) for each combination of a splice junction \(s\) and a gene \(g\), with \({x}_{s,g}=1\) if the splice junction \(s\) is not part of the gene \(g\), and \({x}_{s,g}=0\) if the splice junction is in the gene. Thus, since each splice junction is part of a single gene, we can define the following constraint for each splice junction:

$$\forall s\in \left[1,S\right],\,{\sum}_{g\in {N}^{s}}(1-{x}_{s,g})=1$$
(1)

Since \(({x}_{s,g})\) are binary, this is equivalent to having \({x}_{s,g}=1\) for all genes neighbor of \(s\) except one, i.e. a junction only belongs to a single gene. Further, we define the objective function as the total number of genes that do not have any novel splice junction. To this end, we define a new set of variables \({y}_{g}={\prod}_{s}{x}_{s,g}\left\{{s|g}\in {N}^{s}\,\right\}\) such that gene \(g\) contains a splice junction iff \({y}_{g}=0\). The objective function is thus the sum of all \({y}_{g}\) variables. Since the \({y}_{g}\) are products, the objective function can be linearized into a sum by introducing additional constraints90:

$$\begin{array}{c}{y}_{g}\le {x}_{s,g},\,\forall {s|g}\in {N}^{s}\\ {y}_{g}\ge {\sum }_{s}{x}_{s,g}-\left(n-1\right)\end{array}$$
(2)

Where \(n\) is the number of \({x}_{s,g}\) variables in the product \({y}_{g}\). Having formulated this integer programming problem, we used the R package lpSolve to maximize the objective function, thus minimizing the number of novel junction-containing genes while ensuring that each junction is counted in a single gene. Conversely, we also used lpSolve to minimize the objective function, thus maximizing the number of junction-containing genes while ensuring that each junction is counted in a single gene. This provides bounds on the total number of novel junction-containing genes.

To obtain a single, average, number of novel junction-containing genes, we further used a resampling approach. We consider the set of genes neighboring each novel junction. Under the assumption that the novel junctions are uniformly distributed among the genes, we randomly selected a single gene for each junction, and evaluated the total number of genes with novel junctions for this sample. We repeated this procedure 1000 times to estimate an average number of novel junction-containing genes. The complete code, along with an illustrative toy example, is available at: https://gicengenproject/cei/novel_junctions.

DAS with MAJIQ

Local quantification of AS and the analysis of differential AS were performed with MAJIQ 2.347. We built a configuration file using the reference annotation from Wormbase (WS289), strandness forward, and one experiment per neuron type (grouping the biological replicates by experiment). We subsequently ran majiq build with parameter --min-experiments 2 (i.e. a splice junction is retained if it passes filters in at least two replicates from the same neuron), keeping the other options to their default values. We performed mPSI quantification, and delta mPSI quantification between each pair of neuron types with the default parameters.

For the subsequent analysis, we grouped the resulting delta mPSI files from all neuron pairs, obtaining 16,379,082 individual comparisons (for a given LSV in a given pair of neurons). We filtered comparisons to retain only those where the LSV-containing gene was expressed in both neurons of the pair, using the threshold “3” we previously defined based on single-cell RNA-Seq34, resulting in 8,783,582 comparisons (corresponding to 3787 measurable genes). We define a comparison as DAS if the “probability of not changing” (computed by MAJIQ deltapsi) is lower than 0.05, and the “probability of changing” is higher than 0.5, corresponding to 928,985 comparisons. Finally, we define a gene as DAS if it contains at least one DAS comparison, yielding 1940 DAS genes.

To predict genes expected to display DAS in the nervous system, we compiled a list of 11 studies reporting individual genes91,92,93,94,95,96,97 or performing transcriptomic analyzes in mutant backgrounds disrupting AS in neurons67,74,98,99. This resulted in a list of 759 genes (Supp. Data 3), of which 542 are measurable in our dataset (expressed in the neurons sequenced and quantified by MAJIQ). Gene Ontology analysis was performed using a background list of 10,312 genes expressed in at least two neurons sequenced here (as per the threshold above).

To compare the differentially AS to differentially expressed (DE) genes, the DE genes were obtained from integration of single-cell RNA-Seq data with this dataset, as described in ref. 42. For each neuron pair, the DE genes were ordered by absolute fold change and the genes with the 100 highest values were selected. The DAS genes were ordered by absolute delta mPSI of their highest LSV, and the 100 genes containing the highest values were selected. Out of 595 neuron pairs, we could select a top 100 DAS genes for 432 pairs; 15 had 101 genes (because of a tie in the highest delta mPSI), 148 pairs did not have 100 DAS genes. We only represented pairs where we could select 100 top DAS genes.

To predict the total number of DAS genes in the nervous system, we randomly selected between 2 and 55 neuron types among those sequenced, and estimated the number of DAS genes that could be detected. We repeated the procedure 10 times for each number of neurons. We then performed a linear regression of the number of DAS genes detected vs the logarithm-transformed number of neurons subsampled, yielding the relationship \({N}_{{genes}}=-302+731\cdot \log ({N}_{{neurons}})\). The total number of DAS genes for 119 neuron types is then 3192. We applied the same procedure to estimate the number of genes expressed in the subsampled neurons (above threshold “3” as above). We obtained the relationship \({N}_{{genes}}=4756+2127\cdot \log ({N}_{{neurons}})\), and estimate a total number of to 14,920 genes expressed in the C. elegans nervous system.

Transcript quantification with StringTie

For the transcript-level quantification, we used StringTie 2.2.1 using the annotation from Wormbase (WS289), without novel transcript discovery. We applied it to the aligned reads from STAR, following deduplication. Code available at: https://github.com/cengenproject/stringtie_quantif.

Website

The output of STAR was used to generate browser tracks. First, the junction counts generated by STAR were processed, and the number of uniquely mapped reads was kept as junction count. Junctions longer than 25,000 bp, and junctions with a count lower than 3 reads were filtered out. Junction counts for individual samples were combined into neuron type average, and global sum, minimum, and maximum. The global tracks underwent a second filtering, requiring 13 and 21 reads for the maximum and sum tracks respectively. The tracks were exported to bed format using the R package rtracklayer100. Second, the bam files generated by STAR were used to generate the coverage tracks using custom code, and exported to bigWig format with rtracklayer. The individual tracks can be used in JBrowse246 or downloaded from the website. All code is available at: https://github.com/cengenproject/splicing_browser.

To explore the local splicing quantification in an event-centric manner, the results of the MAJIQ analysis (see above) were loaded in VOILA according to author’s instructions (see https://majiq.biociphers.org). To enable exploration in a neuron-centric manner, we further developed a custom R Shiny application operating on the tsv files generated by MAJIQ deltapsi. For comparison of a pair of neurons, the user can directly apply a threshold on the estimates of “probability_changing” and “probability_non_changing” reported by MAJIQ. For comparison of two sets of neurons, the application selects the estimated mPSI of each event in each neuron of both sets, performs a t-test between these two sets, and reports the p-value corrected for multiple comparisons by the Benjamini-Hochberg method. The assumptions of the t-test may or may not be met depending on the exact sets of neurons chosen, thus offering no guarantee that the FDR is appropriately controlled. We observed that this approach robustly selects events of interest in the majority of use cases, however we caution against using this approach to draw conclusions about arbitrary sets of neurons, and added a warning to this effect in the application. The source code of this application is available at: https://github.com/cengenproject/das_by_neuron.

For the transcript-level quantification, the quantifications from StringTie (see above) were loaded in a custom R Shiny application, source code available at: https://github.com/cengenproject/isoform_compare/.

Binary DAS with SUPPA2

We used SUPPA 2.3101 according to the documentation. We generated all local event types (SE, SS, MX, RI, FL) with --boundary S based on the genome annotation (Wormbase WS289). We then quantified the event bPSI by running psiPerEvent using the StringTie quantifications (see above) as expression file and a threshold of 5 TPMs. Finally, we split the resulting TPM and bPSI files by neuron type, before computing delta bPSI for each pair of neuron, using the settings –method empirical –combination –gene-correction and a threshold of 0.3 bPSI. We considered an event differentially AS in a neuron pair if it is displayed a p-value lower than 0.1 (p-values corrected for multiple comparison by SUPPA with the -gc option), and displays a delta bPSI higher than 0.3.

To explore sequence features of DAS events, we used the R package GenomicFeatures102 to delineate the genomic regions of interest, the package Biostrings to extract the sequence and calculate its GC content, and the PhastCons conservation score track downloaded from UCSC103. The features were compared using a Wilcoxon test followed by Benjamini-Hochberg correction for multiple comparisons, and we applied a threshold of 0.05 to consider a comparison significant. To explore the impact of AS events on protein sequence, we only considered cassette exons and alternative 3’ and 5’ splice sites. We used the function join_overlap_inner_directed in the plyranges R package to determine if the added sequence (the cassette exon, or the overhang in case of alternative splice sites) overlaps with the CDS from the Wormbase WS289 annotation. We then computed the length of the added sequence and determined if it is a multiple of 3.

For microexons, we focused on cassette exons. We compared the number of exons with DAS using a two-proportion test with Yates’ continuity correction. To compare the proportion of neuron pairs with DAS, we only analyzed exons detectable in at least 10 neuron pairs (787 out of 913 cassette exons). We used a Wilcoxon test to compare the proportions. Code available at https://github.com/cengenproject/suppa_events.

Reanalysis of the TRAP-Seq dataset

We downloaded the published TRAP-Seq data39 from the Short Read Archive database (GSE106374; we processed the 10 samples SRR6238092-SRR6238101, we did not re-process the input controls) and aligned them to the C. elegans genome (Wormbase WS289) using STAR 2.7.11a with option --outFilterMatchNminOverLread 0.3.

Tissue-level coverage and junction genome browser tracks for each of the 5 tissue types analyzed were generated as described above by aggregating the corresponding 2 biological replicates. Transcript-level quantification with StringTie performed as described above and used as input for SUPPA splice event quantification as described above.

All code is available at: https://github.com/cengenproject/reanalysis_tissue

Splicing regulatory network

Quantification of ePSI

Cassette events were extracted from the genome annotation using suppa generateEvents101. With an approach adapted from104, we then used bedtools and grep on the STAR output to count the number of inclusion reads \({N}_{i}\) covering the alternative exon, and the number of exclusion reads \({N}_{e}\) spanning the alternative splice junction. We then computed exonic Percent Spliced-In (ePSI) from normalized read counts \(\overline{{N}_{i}}\) and \(\overline{{N}_{e}}\) based on exon length \({L}_{{exon}}\) and read length \({l}_{{read}}\):

$$\begin{array}{c}\overline{{N}_{i}}=\,\frac{{N}_{i}}{{L}_{{exon}}+{l}_{{read}}-1}\\ \overline{{N}_{e}}=\frac{{N}_{e}}{{l}_{{read}}-1}\\ {{{\rm{PSI}}}}=\,\frac{\overline{{N}_{i}}}{\overline{{N}_{i}}+\overline{{N}_{e}}}\end{array}$$
(3)

All relevant code is stored in the Github repository https://github.com/cengenproject/quantif_exon_skipping.

Before use in the model, we removed measurements of a cassette exon in a neuron type if the exon-containing gene is not expressed in that neuron type, based on a thresholding integrating this dataset with single-cell RNA-Seq42. Additionally, we only considered neuron types for which we had 3 or more biological replicates. We also filtered the cassette exons, keeping only events covered by more than 20 reads, measured in more than 70 samples from 23 neuron types, and presenting differential splicing between neuron types (standard deviation above 0.3).

Quantification of putative splice factor transcripts

We compiled a list of putative splice factors, available in Supp. Data 5. The transcripts are quantified using StringTie48, without novel transcript discovery (see above). StringTie gives quantifications in Transcripts Per Million (TPM), which undergo a log10 transformation with a pseudocount of 1 before further processing.

Precision matrix estimation

Here we describe the procedure to select our network construction method (Figure S6A). We build a data matrix where the 127 rows correspond to samples, the 730 columns correspond to cassette exon ePSI (172 events) or splice factor log-TPM (558 transcripts). We perform a first split: 30% of the rows (39 samples) are kept as testing set. The other 70% of samples undergo 5-fold cross-validation: each fold contains 17 or 18 samples, the training is performed on 4 folds, the validation on the held-out fold. As the splicing of a cassette exon in a neuron type can only be meaningfully measured if the gene containing that exon is expressed in that neuron, the ePSI matrix contains missing values that are first imputed (using the column median or k-nearest neighbors). The training matrices \({{{{\rm{SE}}}}}_{{{{\rm{train}}}}}\) (containing the skipped exons) and \({{{{\rm{SF}}}}}_{{{{\rm{train}}}}}\) (containing the splice factors) are then transformed (using Z-score or NPN transformation), and a covariance matrix \({S}_{{{{\rm{train}}}}}\) is computed from the transformed values. We store the variables used for transformation (e.g. the mean and standard deviation for Z-scoring and distribution quantiles for the NPN method) for inverting the transformations later to yield predictions in the original data range. For permutation tests, the \({{{{\rm{SE}}}}}_{{{{\rm{train}}}}}\) values randomized within an event (i.e. within a column) after transformation, but before computing the covariance matrix. The covariance matrix is then used to estimate the precision matrix \({\widehat{\varOmega }}_{{{{\rm{train}}}}}\) (using glasso, QUIC, CLIME, or SCIO), which is inverted to recover the estimated covariance matrix \({\hat{S}}_{{{{\rm{train}}}}}\).

Separately, the validation fold matrices \({{{{\rm{SE}}}}}_{{{{\rm{valid}}}}}\) and \({{{{\rm{SF}}}}}_{{{{\rm{valid}}}}}\) are transformed re-using the same parameters as the training folds, to compute the covariance matrix \({S}_{{{{\rm{valid}}}}}\). From the estimated precision matrix, following60, we extract the quadrants \({\widehat{\varOmega }}_{21}\)(with the splice factors as rows and the cassette exons as columns), and \({\widehat{\Omega }}_{11}\) (with the cassette exons as rows and columns) and compute \(W=\,{\widehat{\varOmega }}_{21}\,{\widehat{\varOmega }}_{11}^{-1}\), the matrix of regression coefficients. The splicing measurements in the validation set can then be estimated from the splice factors in the validation set and the precision matrix learned from the training set following:

$${\widehat{{{{\rm{SE}}}}}}_{{{{\rm{valid}}}}}={W}^{t}\cdot {{{\rm{S}}}}{{{{\rm{F}}}}}_{{{{\rm{valid}}}}}$$
(4)

Finally, we invert the transformation of \({\widehat{{{{\rm{SE}}}}}}_{{{{\rm{valid}}}}}\) using the stored transformation variables (e.g. the mean and standard deviation for Z-scoring and distribution quantiles for the NPN method) to get back to the initial scale.

Model components

We tried several approaches to develop an optimal model. The data matrix \({{{{\rm{SE}}}}}_{{{{\rm{train}}}}}\) can be constructed from ePSIs, a ratio of inclusion and exclusion counts. We reasoned that a model may have a better performance when directly estimating the inclusion and exclusion counts rather than the ratio. Thus we reconstructed counts by multiplying the ePSI with the total count for that exon, and used either ePSIs or reconstructed counts as the columns of \({{{{\rm{SE}}}}}_{{{{\rm{train}}}}}\).

As our downstream methods are incompatible with the presence of missing values, we need to remove them from \({{{{\rm{SE}}}}}_{{{{\rm{train}}}}}\). We used median imputation, where the missing value is replaced by the median of the column (i.e. the median of the cassette exon across samples). Alternatively, we used a k-nearest neighbors imputation implemented by the R package impute.

ePSIs (or reconstructed counts) and log-TPMs follow very different distributions, which would distort the covariance computed by simply concatenating them. In addition, they do not follow a normal distribution, making them inappropriate for the downstream algorithms. We thus standardized the training data either with a Z-score transformation, or with nonparanormal transformations66. As our cross-validation procedure requires that we transform the validation set using the parameters from the training set, and that we invert the transformation to obtain values in the original scale, we implemented these transformations in the R package projectNPN, available at https://github.com/cengenproject/projectNPN.

Finally, the estimation of the precision matrix can also be performed with several implementations. We used the R packages glasso, QUIC, FLARE (implementing CLIME) and SCIO.

Metrics definitions

First, we compare the covariance matrix measured in the validation set \({S}_{{valid}}\) to the covariance matrix estimated from the training set \({\hat{S}}_{{train}}\), obtained by inverting the estimated precision matrix. we focus on the quadrant containing the covariance between the skipped exons and the splice factors, as we are interested in the ability of our model to capture this relationship. We then compute the Frobenius loss as \({||}{S}_{{{{\rm{valid}}}},12}-{\hat{S}}_{{{{\rm{train}}}},12}|{|}_{F}\) where || ||F represents the Frobenius norm.

Further, we compare the skipping values measured in the validation set \({{{{\rm{SE}}}}}_{{{{\rm{valid}}}}}\) to the values \({\widehat{{{{\rm{SE}}}}}}_{{{{\rm{valid}}}}}\) estimated from the precision matrix. To this end, we first compute the residuals: \({resid}={\widehat{{{{\rm{SE}}}}}}_{{valid}}-\,{{{{\rm{SE}}}}}_{{{{\rm{valid}}}}}\). We then estimate, for each event \(e\), the sum of squared residuals: \({{{{\rm{SS}}}}}_{{{{\rm{err}}}},e}=\sum {{{{\rm{resid}}}}}_{e}^{2}\) and the total sum of squares \({{{{\rm{SS}}}}}_{{{{\rm{tot}}}},{{{\rm{e}}}}}=\sum {\left({{{{\rm{SE}}}}}_{{{{\rm{valid}}}},{{{\rm{e}}}}}-\overline{{{{{\rm{SE}}}}}_{{{{\rm{valid}}}},{{{\rm{e}}}}}}\right)}^{2}\) and define the fraction explained variance as \({{{{\rm{FEV}}}}}_{e}=1-\frac{{{{{\rm{SS}}}}}_{{{{\rm{err}}}},{{{\rm{e}}}}}}{{{{{\rm{SS}}}}}_{{{{\rm{tot}}}},{{{\rm{e}}}}}}\). We truncate this value at 0 and average it across events to obtain the mean fraction of explained variance.

To evaluate the biological relevance of the network edges, we extract the quadrant \({\hat{\varOmega }}_{21}\) of the precision matrix (i.e. the adjacency matrix), binarize it (taking an edge for any non-zero entry), and compare it to a ground truth dataset (Supp. Data 7). We compiled this dataset by a review of the literature, considering that a regulatory interaction between a splice factor and a cassette exon is “true” if a change in splicing was detected upon mutation of the splice factor. Note that this dataset suffers from several limitations, notably these interactions do not necessarily take place in the neurons we sequenced here, and these interactions may correspond to different splicing events within the same target gene. Thus, while we expect a better model to obtain a better match with this data, we do not expect a perfect match. We compute the True Positive Rate (TPR) as the fraction of interactions present in the ground truth that are captured by the model, and the False Positive Rate (FPR) as the fraction of interactions that are absent from the ground truth but reported by the model. As both TPR and FPR decrease with increased sparsity, we report the ratio TPR/FPR.

Finally, we seek to constrain the structure of the network. A very sparse network, where each splice factor has at most a single target, or a very dense network, where many splice factors have many targets, would be hard to interpret and likely not capture biologically meaningful interactions. As proposed by ref. 61, we use an approximate scale-free topology criterion. For each splice factor (node in the network), we compute the degree of the node k, and count the number of nodes with the same degree p(k). We then fit a linear regression between \(\log (k)\) and \(\log (p\left(k\right))\), and use the coefficient of determination R2 as a criterion. A high R2 suggests that a power law can describe the node degrees, and that the network is scale-free.

All code related to the network modeling is available at: https://github.com/cengenproject/regression_exon_skipping.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.