Background & Summary

Goats are an essential component of global food systems, with a long history of domestication1. As highly adaptable and economically valuable livestock, goats’ muscle development and function have a direct impact on livestock production efficiency and meat quality2,3. Goat meat, rich in high-quality proteins, offers health benefits such as enhanced immunity and improved metabolism, making it highly nutritious and easily digestible4,5. The longissimus dorsi and biceps femoris muscles play pivotal roles not only in locomotion and weight support but also as key indicators of overall meat quality and nutritional value6,7,8. Moreover, these muscles are crucial for in the genetic improvement of meat quality traits in both adult goats and their offspring9,10,11. Studies on fetal muscle development provide valuable insights into early nutritional interventions and breeding strategies, facilitating early optimization of meat quality traits9,12,13,14.Current research on the longissimus dorsi and biceps femoris muscles in goats has predominantly focused on their anatomical features, muscle development, and genetic basis15,16. However, the lack of long-read, full-length transcriptomic data has hindered studies on gene expression profiles, transcriptional regulation, and alternative splicing mechanisms in these muscles. This limitation prevents the establishment of precise correlations between genetic mechanisms and phenotypic traits17.

Oxford Nanopore Technologies (ONT) full-length transcriptome sequencing is a third-generation single-molecule sequencing technology that measures changes in ionic current as DNA or RNA molecules pass through protein nanopores18. Unlike fluorescence-based sequencing techniques, ONT sequencing does not rely on nucleotide-incorporating enzymes19. This approach enables the accurate identification and analysis of alternative splicing, gene fusions, and novel isoforms, as well as the precise quantification of transcript expression levels20. The ONT platform can generate ultra-long reads of up to 50 kb, allowing for the direct acquisition of complete mRNA sequences21, including 5′ and 3′ untranslated regions (UTRs) and poly-A tails, without fragmentation. This capability facilitates the accurate identification of structural variations, including intron retention and exon skipping22. ONT full-length transcriptome sequencing has been widely applied in transcriptional studies of mouse neural cells23, monkeypox virus-host interactions24, and caprine research, such as chromosome-level genome assembly of cashmere goats25, the impact of solid diets on rumen microbiota and epithelium26, and studies on goat uterine and ovarian tissues27. Transcriptomic data from the longissimus dorsi and biceps femoris muscles have also been used to investigate genetic factors influencing meat quality traits28.However, studies investigating alternative splicing and transcriptional regulatory mechanisms in goat muscles remains limited.The lack of third-generation ONT full-length transcriptome data for goat muscles constrains the comprehensive analysis of complex splicing variants, gene structures, and transcriptional regulatory mechanisms. This limitation hinders a thorough understanding of the genetic basis of meat quality and impedes the development of precise breeding strategies.

Hu Tian goats, a potential Chinese local meat goat breed, are a newly discovered livestock resource in the mountainous areas of Xiangxiang City, Hunan Province, China. They are known for their excellent meat quality and unique flavor compounds. However, the genetic mechanisms underlying their meat quality traits remain unclear. The fetal stage represents the initial phase of an organism’s development, while the maternal environment during pregnancy directly influences fetal growth and development. Compared to studying pregnant ewes or fetal lambs separately, investigating both together provides a more comprehensive perspective on the spatiotemporal dynamics of gene expression during growth and development. Therefore, In this study, we performed ONT full-length transcriptome sequencing of the longissimus dorsi and biceps femoris muscles from pregnant goats and their fetuses at 90 days (Fig. 1). A total of 169,768,069 valid reads were obtained, of which 127,954,092 were aligned to the reference genome, with an average N50 value of 1042.29 (Fig. 2a). We identified 58,092 full-length transcripts, of which 50,338 were annotated using NR, Uniprot, GO, Pathway, Pfam, and KEGG databases. Additionally, 89,468 alternative splicing (AS) events spanning seven types were identified, and 5,538 potential lncRNAs were predicted in the two muscle tissues using three methods. The generated third-generation full-length transcriptome data not only enrich the transcriptomic database for goats but also lay a solid foundation for subsequent studies on gene function, marker-assisted breeding, and genetic improvement. By sharing this dataset, we aim to advance the academic understanding of goat muscle development and its genetic regulation while contributing to the sustainable development of the livestock industry.

Fig. 1
figure 1

The flowchart illustrates an overview of the research design, which is divided into two parts: sequencing and data analysis.

Fig. 2
figure 2

Sequencing Quality Assessment. (a) Distribution of full-length sequences after redundancy removal. The x-axis denotes the read length, while the y-axis represents the number of reads within the corresponding length range. The dashed line indicates the N50 length.(b) Comparative analysis of transcript identification results between all detected transcripts and reference transcripts. The red bars indicate the number of identified genes, whereas the blue bars denote the number of identified transcripts. (c) Heatmap illustrating Pearson correlation coefficients between samples based on all detected transcripts. Both the x-axis and y-axis represent individual samples, and the color intensity reflects the magnitude of correlation between pairs of samples. A redder color signifies stronger correlation. A correlation coefficient approaching 1 indicates greater similarity in expression patterns between samples.

Method

Ethical declaration

The experimental protocol and procedures were approved by the animal protection and utilization committee of Hunan Agricultural University (protocol number: HAU ACC 2022120). All animal treatments and experiments were performed according to the recommendations of the guidelines for ethical review of animal welfare in the national standards of the People’s Republic of China (151).These activities did not require any specific permissions and did not involve any endangered or protected species.

Sample Collection and RNA Preparation

All animals in this study were sourced from Hu Tian goat samples (adult female goats and fetuses) at the Hu Tian Goat Breeding Farm in Xiangtan City. The goats were raised under normal conditions with ad libitum access to water and feed. Six adult does with similar body weights (22.37 ± 4.93 kg) at 720 ± 30 days old were selected and slaughtered at 90 days of pregnancy. Prior to slaughter, the goats were fasted for 12 hours but allowed free access to water. To investigate the differences in gene regulation between distinct muscles of fetal and pregnant ewes, trained personnel conducted anesthesia, exsanguination, and skinning in accordance with standard commercial procedures and ethical guidelines at the Hu Tian goat slaughterhouse. Two different tissues (longissimus dorsi and biceps femoris muscles) were collected, flash-frozen in liquid nitrogen, and stored at −80 °C until RNA extraction. Total RNA was extracted from the tissues using TRK Lysis Buffer (R6834 Total RNA Kit I) according to the manufacturer’s protocol. RNA concentrations were initially assessed using a NanoDrop One spectrophotometer (NanoDrop Technologies, Wilmington, DE), and precise quantification was performed using a Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). All procedures related to the goat used in this study were inaccordance with the standards of the Laboratory Animal Guidelines for the Ethical Review of Animal Welfare and were approved by the Committee on the Ethics of Animal Experiments of Hunan Agricultural University (HAU ACC 2022120).

Library construction

RNA was extracted from two different tissues of 12 goats and pooled by tissue type to construct a combined library for full-length transcriptome sequencing. Equal amounts of RNA (5 μg per tissue) from each tissue were mixed and sequenced using the ONT platform. Specifically, Oligo(dT12-18) primers were used to reverse transcribe the target mRNA. Full-length cDNA was amplified with low-cycle PCR, sequencing adapters (with motor proteins) were added, and the library was loaded onto R9.4 sequencing flow cells for 48–72 hours of sequencing on the PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK).

Format conversion and data filtering

The raw data generated from Nanopore sequencing were stored in pod5/fast5 format, preserving all original sequencing signals. Basecalling was was conducted using Dorado software (parameter:--no-trim), to convert the raw data into fastq format sequence files.

To enhance the reliability of downstream analyses, the raw sequencing data were filtered during the basecalling process. Based on the average quality score of sequencing reads, the data were classified as “pass” (Q ≥ 7) and “fail” (Q < 7). Only the “pass” data, regarded as high-quality sequences, were retained for subsequent analyses.

Alignment to the reference genome and consistent sequence identification

The Pinfish software (version 0.1.0; default parameters) was employed to rapidly construct a non-redundant transcript set from full-length sequences. Initially, minimap229 (version 2.17-r941; parameters: -ax splice -uf -k14) was used to align full-length sequences to the reference genome, producing BAM files. Subsequently, the spliced_bam2gff program converted the BAM files into GFF files. The cluster_gff, collapse_partials, and polish_clusters programs were then used for clustering, deduplication, and error correction, yielding a set of high-confidence sequences. Alignment statistics were computed using Samtools30 (version 1.11; parameter: flagstat).

Full-length sequencing and construction of non-redundant transcripts

Pychopper (version 2.4.0; parameters: -Q 7 -z 50) was employed to identify, orient, and trim full-length Nanopore cDNA sequences, as well as to correct fused sequences. Full-length sequences were extracted from high-quality sequencing data, with adapters, barcodes, and primer sequences removed during the identification process. The extracted full-length sequences were subsequently filtered using NanoFilt (version 2.8.0; parameters: -q 7 -l 50) to generate the final full-length sequences for downstream analysis.

The high-confidence sequences were aligned to the reference genome, followed by transcript reconstruction was using StringTie31 (version 2.1.4; parameters:--conservative -L -R). The reconstructed transcripts were subsequently using StringTie in merge mode (default parameters), yielding a non-redundant transcript dataset (Fig. 2b,c).

Transcript Annotation

To obtain comprehensive functional insights into the non-redundant transcripts, functional annotation was performed using the R package clusterProfiler32 Functional annotation was conducted based on seven databases: Nr33, Pfam34, Uniprot35, KEGG36, GO37, KOG38, COG39, PATHWAY (Table 2).The Gene Ontology (GO) database characterizes gene functions in three domains: Cellular Component (CC), Molecular Function (MF), and Biological Process (BP). The KOG database, established based on the phylogenetic relationships of eukaryotes, facilitates the orthologous classification of transcripts. Transcripts annotated using the KOG database were categorized into 26 functional groups according to their respective KOG classifications.

Table 1 The entire sequencing results following ONT full-length transcriptome sequencing were processed.The effective data were classified based on the average quality score of the sequencing reads into “pass” (Q ≥ 7) and “fail” (Q < 7), with only the “pass” portion being retained for further analysis.
Table 2 Statistics of Full-Length Transcript Annotations.

Prediction of Long Non-Coding RNAs (lncRNAs)

Long non-coding RNAs (lncRNAs) are RNA molecules exceeding 200 nucleotides in length that lack protein-coding potential. Newly identified transcriptswere analyzed for coding potential prediction using CNCI40 (version 2.0; default parameters),CPC241 (standalone_python3 v1.0.1) and PLEK42 software.CPC2 predicts coding potential by aligning transcripts against known protein databases using BLAST and employs a support vector machine (SVM) classifier to evaluate coding likelihood based on biological features (Fig. 4a).

CNCI effectively differentiates coding from non-coding sequences by analyzing the frequency of adjacent trinucleotides and is capable of processing incomplete and antisense transcripts.

PLEK classifies transcripts as coding or non-coding based on k-mer composition within sequences.

Alternative Splicing (AS) Event Analysis

Non-redundant transcripts were utilized to predict alternative splicing (AS) events. SUPPA243 (parameters:--boundary S -f ioe -e SE SS MX RI FL) was employed to identify AS types in each sample. Seven distinct AS event types were identified, including: alternative 5′ splice sites (A5), alternative 3′ splice sites (A3), skipped exons (SE), alternative first exons (AF), alternative last exons (AL), mutually exclusive exons (MX), and retained introns (RI) (Fig. 4d).

Percent spliced-in (PSI) values were computed for each AS event based on the gene annotation GTF file and transcript expression levels measured in TPM. Differential AS events between groups were identified by computing ΔPSI (the difference in PSI values between groups) and conducting statistical tests to determine p-values. The occurrence of AS events across transcripts was subsequently analyzed and summarized.

Data Records

The raw full-length data was deposited in the NCBI Sequence Read Archive (SRA) under accession number SRP531789. The full-length transcripts RNA-Seq dataused for correction was deposited in the SRA under accession number SRR30617348, SRR30617349, SRR30617350, SRR30617351, SRR30617352, SRR30617353, SRR30617354, SRR30617355, SRR30617356, SRR30617357, SRR30617358, SRR30617359, SRR30617360, SRR30617361, SRR30617362, SRR30617363, SRR30617364, SRR30617365, SRR30617366, SRR30617367, SRR30617368, SRR30617369, SRR30617370, SRR3061737144.

Technical Vaildation

Quality control and alignment of sequencing data

Quality control of Oxford Nanopore Technology (ONT) raw sequencing data was conducted based on sequencing quality scores, with a default threshold of Q7. Reads exceeding the threshold were classified as “pass”, whereas those below the threshold were designated as “fail”. Following sequencing, the data were quantified, and high-quality sequencing reads were retained for downstream analyses (Table 1).

Prediction and Analysis of Novel Transcripts

Non-redundant transcripts were aligned against annotated genomic transcripts using gffcompare (version 0.12.1; parameters: -R -C -K -M) to identify previously unannotated transcripts and genes, thereby refining existing genome annotations. Coding sequence (CDS) regions within the newly identified transcripts were subsequently predicted. (Table 3).

Table 3 Use the gffcompare tool to analyze new transcripts and genes in the samples and predict the coding sequence (CDS).

Transcript annotation quality and prediction of LncRNAs

Functional annotation was performed using multiple reference databases to characterize the identified transcripts.Among the analyzed sequences, 50,338 full-length transcripts were annotated based on the NR, UniProt, GO, KEGG, Pfam, TF, and Pathway databases (Table 2).The NR database provided the highest number of annotations, covering 43,295 sequences (86%), followed by the GO database with 37,256 sequences (74%) and the Pfam database with 33,920 sequences (67.38%). In the GO database, 37,256 sequences were assigned to three primary GO categories: Biological Process (BP), Cellular Component (CC), and Molecular Function (MF). The top three most enriched GO terms were “regulation of DNA-templated transcription,” “proteolysis,” and “positive regulation of cell population proliferation,” collectively accounting for 1.82% of all enriched terms (Fig. 3). LncRNA identification was conducted using three computational tools: CNCI, CPC2, and PLEK. A total of 5,538 putative lncRNAs were identified, further expanding the transcriptome dataset (Fig. 4a).

Fig. 3
figure 3

Gene Ontology (GO) annotations were categorized into three functional domains: Cellular Component (CC), Biological Process (BP), and Molecular Function (MF). The top ten most enriched terms were selected for each category based on the number of associated genes. The x-axis denotes GO functional terms, while the y-axis represents the number of transcripts mapped to each term.

Fig. 4
figure 4

LncRNA Prediction Venn Diagram, New Transcripts and Genes Prediction Analysis, and Distribution Venn Diagram of Splicing Events. (a) Venn diagram for the prediction of coding potential of newly identified transcripts using CNCI, CPC2, and PLEK. (b) Distribution plot of the CDS lengths for the newly identified transcripts. The x-axis represents the CDS length of new transcripts, and the y-axis represents the number of new transcripts within that length range, with the red dashed line indicating the N50 length. (c) Distribution plot showing the number of different types of new transcripts in the samples. (d) Statistical plot for the different types of alternative splicing (AS) events in each sample. SE: Skipped Exon, MX: Mutually Exclusive Exons, A5: 5′ Alternative Splicing, A3: 3′ Alternative Splicing, RI: Retained Intron, AF: First Exon Alternative Splicing,AL: Last Exon Alternative Splicing.

Quality control of Alternative Splicing (AS) events

A total of 89,468 alternative splicing (AS) events were identified across all samples, covering seven distinct AS types (Fig. 4b,c).The most prevalent AS type was skipped exons (SE), comprising 30.66% of all detected events.The remaining AS types included alternative 3′ splice sites (A3, 18.16%), alternative 5′ splice sites (A5, 14.17%), alternative first exons (AF, 25.63%), alternative last exons (AL, 5.35%), mutually exclusive exons (MX, 1.87%), and retained introns (RI, 4.05%) (Fig. 4d).