Background & Summary

The Gayal, also known as Mithun (Bos frontalis), is a rare semi-wild Bovine species found in the harsh environments of Indo-China1,2. In China, the Gayal is predominantly located in the narrow valleys of the Dulong and Nujiang Rivers, as well as the adjacent mountainous areas of Yunnan Province, where it is referred to as ‘Dulong cattle’3. The Gayal has a chromosome complement of 2n = 58, which is exactly intermediate between that of domesticated cattle (Bos taurus, 2n = 60) and wild gaur (Bos gaurus, 2n = 56)4,5,6. This unique animal is larger in body size than local counterparts and is characterized by distinctive ‘white stockings’ on its stout legs7. Additionally, the Gayal exhibits prominent beef-type characteristics and has high-quality carcass attributes, particularly with a Longissimus dorsi muscle fiber diameter that is nearly half the size of that found in most other Bos species8,9,10. Because of the isolation of its habitats and a range of ecological and sociopolitical challenges, the Gayal is one of the least studied ungulates11. It is classified as a threatened species by the International Union for Conservation of Nature (IUCN). There is still significant uncertainty about its origins, and information on the transcriptome architecture of the Gayal is notably sparse. Addressing this knowledge gap is crucial for gaining a comprehensive understanding of its genetic and molecular characteristics, which is essential for devising effective conservation and management strategies.

Transcriptomic analysis serves as a powerful tool, playing a pivotal role in uncovering gene expression, function, and regulatory mechanisms12,13,14. Recent transcriptomic studies of Bos species among the Gayal, yak (Bos grunniens), and cattle, have provided profound insights into their gene expression profiles, metabolic pathways, and adaptive characteristics15. Such information not only enhances our understanding of how these species survive in harsh environments but also reveals their unique adaptations to specific ecological niches. Moreover, comparative analysis of the Gayal transcriptome with that of other related species can elucidate its evolutionary history and genetic relationships, thereby providing a scientific basis for conservation and restoration efforts.

In this study, samples were extracted from 10 organs and tissues of the Gayal, including the heart, liver, spleen, lung, kidney, rumen, abomasum, duodenum, ileum, and rectum, followed by an integrated PacBio Iso-seq and RNA-seq analysis. This resulted in the identification of 30,760 full-length transcripts, with lengths ranging from 363 bp to 7,157 bp. Through meticulous bioinformatics analysis, it was found that 15,783 and 14,064 of these transcript sequences matched entries in the NCBI non-redundant protein (NR) and Swiss-Prot databases, respectively. Further annotation indicated that 14,690, 13,609, and 11,406 transcripts had corresponding entries in the Kyoto Encyclopedia of Genes and Genomes (KEGG), euKaryotic Ortholog Groups (KOG), and Gene Ontology (GO) databases, respectively. Additionally, long non-coding RNAs (lncRNAs) were identified across all sampled tissues of the Gayal. These discovered full-length transcripts and gene expression profiles constitute a rich repository of genetic information, providing a foundation for comprehensive functional genomics studies and paving the way for deeper insights into the molecular mechanisms of the Gayal.

Methods

Ethics statement

All Gayal handling and experimental procedures in this study were approved by the Animal Care and Use Committee of Yunnan Agricultural University (Approval Code: 202307018, Approval Data: 3 July 2023).

Collection of samples and preparation of RNA

A 3-year-old male Gayal was sacrificed and samples were collected from the Yunnan Provincial Dulong cattle conservation and breeding base of Fenghuang Mountain, Lushui City, China. The tissue samples, including heart, liver, spleen, lung, kidney, rumen, abomasum, duodenum, ileum, and rectum, were promptly preserved in liquid nitrogen. The extracted RNA samples were first added to a DNase I buffer, followed by the addition of an appropriate amount of DNase I enzyme. The RNA samples underwent DNase I digestion at 37 °C for 30 minutes. DNase I stop solution was then added to ensure complete removal of the DNase I enzyme. Finally, the RNA samples were purified and assessed to ensure the absence of residual DNA. RNA was extracted from each tissue using TRIzol reagent (Invitrogen) according to the manufacturer’s protocol. Initial evaluation of RNA concentration and purity was performed using a Nanodrop 2000, while agarose gel electrophoresis was employed to check for genomic contamination, purity, and RNA integrity. Subsequently, the RNA integrity number was precisely determined using the Agilent 2100 platform.

PacBio library construction and sequencing

For PacBio library construction, RNA samples from 10 different organs and tissues were mixed in equal proportions. The library preparation and sequencing were performed at Wuhan Frasergen Gene Biotechnology Co., Ltd (Wuhan, China) using the PacBio Sequel II platform with a total of 6 SMRT cells. The 3′ ends of eukaryotic mRNAs feature a poly-A tail, and primers with Oligo dT were paired with the poly-A tail to synthesize cDNA via reverse transcription. These primers were added to the ends of the full-length cDNA synthesized through reverse transcription. The full-length cDNA was then amplified by PCR, and the products were purified using PB magnetic beads to remove cDNA fragments smaller than 1 kb. The ends were repaired and SMRTbell adapters were ligated. Unligated fragments were digested with exonucleases, and the sequencing library was purified using PB magnetic beads. After library construction, accurate quantification was performed using Qubit 3.0, and the library size was assessed using an Agilent 2100. Sequencing was only conducted if the library size met the expected criteria. The raw data obtained from the sequencer were filtered and processed using SMRTlink 10.0 software, with the parameter–minLength = 50 to exclude sequences shorter than 50 bp. After this filtering process, the resulting data were considered valid.

Illuminal mRNA library construction and sequencing

mRNA was isolated from 10 different organs and tissues of the Gayal using Oligo(dT) magnetic beads. Total RNA was extracted using TRIzol reagent (Invitrogen, CA, USA), and the purity and integrity of the RNA were monitored using a NanoDrop 2000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and the Bioanalyzer 2100 system (Agilent Technologies, CA, USA). A 1.5% agarose gel was used to assess RNA contamination. mRNA purification was conducted according to the manufacturer’s instructions, followed by library construction using the V AHTS Universal V6 RNA-seq Library Kit for MGI (Vazyme, Nanjing, China). The quality and size of the libraries were evaluated using a Qubit 2.0 fluorometer (Life Technologies, Carlsbad, CA, USA) and the Agilent 2100 system (Agilent Technologies, CA, USA). Subsequently, sequencing was performed on a MGI-SEQ 2000 platform by the Wuhan Frasergen Bioinformatics Co., Ltd. (Wuhan, China). The RNA-seq experiments were performed with three biological replicates.

PacBio ISO-seq data processing

As shown in Table 1, a total of 32,505,744 subreads were generated from PacBio ISO-seq. The pbccs tool (https://github.com/nlhepler/pbccs) was employed to generate Circular Consensus Sequences (CCS) from subreads. To retain low-frequency transcripts and maximize data recovery, the “--all” parameter was used, resulting in 629,246 CCS reads, which were converted to FASTQ format for quality assessment using FastQC. Subsequent processing with the lima tool (https://lima.how) removed adapter sequences, primers, and barcode tags, yielding 366,942 full-length non-chimeric (FLNC) reads with an average length of 1,899 bp. The FLNC reads were further refined using the IsoSeq3 refine module to eliminate poly-A tails and chimeric sequences, producing a high-confidence dataset suitable for clustering. To generate consistent transcript clusters, the refined FLNC reads were processed using the IsoSeq3 cluster module, which removed redundancies and aggregated similar sequences. The resulting clustered transcripts underwent error correction with LoRDEC16 (https://gitlab.inria.fr/serge.lordec/lordec), a tool that utilizes high-accuracy short-read sequencing data for base-level corrections. After correction, the dataset was reduced to 30,760 high-quality isoforms with an average length of 1,904 bp and an accuracy exceeding 99% (Fig. 1A). This non-redundant full-length transcript dataset, which was generated through deduplication or clustering methods, ensures that each sequence or cluster represents a unique gene, meaning that the sequences or clusters in the dataset are non-redundant, with each corresponding to a distinct gene. This approach is consistent with the general definition of Unigene, where the non-redundant sequences or clusters provide a solid foundation for downstream analyses.

Table 1 Statistic of ISO-sequencing in the Gayal.
Fig. 1
figure 1

Isoform analysis and annotation of the Gayal’s ISO-seq data. (A) Matrix layout for all intersections of seven databases sorted by size. Dark circles in the matrix indicate sets that are part of the intersection. (B) Venn diagram analysis of ISO-seq isoforms with NR, KOG, COG, Swiss-Prot, EggNOG, GO, and KEGG databases. (C) KOG functional analysis of ISO-seq isoforms.

Function annotation of unigenes

Functional annotation of the Gayal Unigenes, a non-redundant full-length transcript dataset obtained through deduplication or clustering, was conducted through detailed comparisons with seven databases (NR, NT, Swiss-Prot, GO, KEGG, KOG, and COG). For the NT database analysis, BLAST software was used with an E-value cutoff of <1e-10. Annotation for NR, KOG, COG, and Swiss-Prot was performed using the Diamond BLASTX method17, with the same E-value threshold of <1e-10. GO annotations for the Gayal transcriptome were derived from NR database alignments using the Diamond BLASTX tool. The BLAST output was processed to extract UniProt IDs, which were subsequently mapped to GO terms using the UniProt idmapping.tb.gz file. A custom Python script, Uniprot2GO_annotated.py, was utilized to match UniProt IDs with GO terms based on the mapping file. The script outputs a comprehensive association of query sequences with GO terms, providing functional insights into the transcriptome. KEGG functional classification was comprehensively analyzed using the KofamKOALA method (https://www.genome.jp/tools/kofamkoala/).

Quality control of annotation

To ensure the quality of annotations, the full-length transcripts were comprehensively checked against multiple reference databases. Matches were found in the following databases: 26,758 (86.99%) in Nr database, 23,629 (76.82%) in KOG, 24,064 (78.23%) in Swiss-Prot, 25,161 (81.80%) in EggNOG, and 6,118 (19.89%) in COG. Additionally, 17,256 (56.10%) and 18,825 (61.20%) transcripts were annotated in the KEGG and GO databases, respectively (Fig. 1B). Further functional classification of the isoforms was performed using the KOG database (Fig. 1C). Analysis of GO ontology and KEGG pathways was conducted using TBtools-II18 to identify annotated sequences and explore active biological pathways in the Gayal (Fig. 2). For example, KEGG analysis revealed that Unigenes are primarily involved in six pathways: metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, and human diseases. Additionally, the coding sequences of the Gayal were predicted and analyzed using TransDecoder software.

Fig. 2
figure 2

GO ontology and KEGG pathways analysis for the annotated sequences of the Gayal.

In the sequence similarity-based clustering analysis (Fig. 3A), we utilized protein sequences from 9 Bovine species, including the Gayal. All alignments were performed with an E-value threshold of <1e−5. A total of 22,963 genes were identified in the Gayal, revealing multiple homologous gene clusters shared with the other 8 Bovine species. Subsequently, a phylogenetic tree was constructed using maximum likelihood methods with OrthoFinder19 (Fig. 3B). The analysis demonstrated that the Gayal exhibits a close phylogenetic relationship with the Bos javanicus species. These phylogenetic relationships align with the known taxonomy and evolutionary positions of these species.

Fig. 3
figure 3

Cluster analysis of gene families and phylogenetic analysis. Cluster analysis of gene families of 9 Bovine species and phylogenetic analysis of 9 Bovine species though a rooted STAG species phylogenetic tree using the maximum likelihood method.

Identification of open reading frames (ORFs) and long non-coding RNAs (lncRNAs)

ORFs were predicted in the Gayal transcripts using the TransDecoder v2.0.1 software package (https://transdecoder.github.io/). Transcripts with complete ORFs, along with 5′ and 3′ untranslated regions, were identified as full-length transcripts. lncRNAs play crucial roles in the regulation of growth and development in many Bovine species. Potential lncRNAs were predicted using CNCI (Coding-Non-Coding Index)20, CPC2 (Coding Potential Calculator 2)21, CPAT (Coding Potential Assessment Tool)22, and PLEK (Prediction of Long Non-coding RNAs)23 software tools.

For CNCI, the CNCI.py script was run with the “ve” model for vertebrate species using the following command: python /path/to/CNCI.py -f /path/to/transcripts.fa -o CNCI_output -p 8 -m ve.

For CPC2, the command was: python /path/to/CPC2.py -i /path/to/transcripts.fa -o CPC2_output.

For CPAT, the following command was used: python /path/to/cpat.py -g /path/to/transcripts.fa -d /path/to/ Mouse_logitModel.RData -x /path/to/ Mouse_Hexamer.tab -o CPAT_output.

For PLEK, the command was: python /path/to/PLEK.py -fasta /path/to/transcripts.fa -out PLEK_output -thread 20. Figure 4 presents a Venn diagram analysis of the lncRNA predictions from these four software tools, revealing a total of 17,565 identified lncRNAs.

Fig. 4
figure 4

(A) Venn diagram analysis of predicted lncRNAs based on ISO-seq transcripts. (B) predicted lncRNAs number of four tools.

Identification of differentially expressed genes across tissues

Gene expression levels across the 10 organs and tissues of the Gayal were first determined through a correlation analysis and by calculating the RPKM (Reads Per Kilobase of transcript per Million mapped reads) values. To identify Differentially Expressed Genes (DEGs), we applied a log-fold change (log FC) threshold of >2 or <-2, with a false discovery rate (FDR) <0.001 and a p-value < 0.005 to ensure statistical significance.

First, we performed quality control on the raw RNA-seq data using tools: FastQC, MultiQC, Trimmomatic. We used FastQC for initial quality assessment and Trimmomatic for trimming low-quality reads and adapters from the raw data. For paired-end sequencing data, Trimmomatic was used with parameters to perform adapter clipping, quality trimming, and filtering short reads. The cleaned reads from the 10 tissue samples were aligned to the reference isoforms generated from PacBio ISO-seq data using Bowtie2. After alignment, we performed gene expression quantification using featureCounts, followed by DEG analysis using DESeq 2 (or edgeR), which is widely used for RNA-seq data analysis. The resulting DEG list was filtered based on the aforementioned thresholds for log-fold change, FDR, and p-value to identify genes with significant differential expression across the tissues. Subsequent correlation analysis (Fig. 5A) and gene expression profile examination in 10 organs and tissues (Fig. 5B) provided in-depth insights into the dynamic mRNA expression of the Gayal.

Fig. 5
figure 5

(A) Gene expression profile in 10 organs and tissues (heart, liver, spleen, lung, kidney, rumen, abomasum, duodenum, ileum, and rectum) in the Gayal. (B) Heatmap analysis of the differentially expressed mRNAs in 10 organs and tissues.

Cluster analysis of gene families and phylogenetic analysis

In the gene family clustering analysis, we utilized protein sequences from 9 Bovine species (Bos taurus, Bos indicus, Bos indicus × Bos taurus, Bos grunniens, Bos javanicus, Bubalus bubalis, Ovis aries, Capra hircus). Redundant and alternatively spliced sequences were removed, retaining only the longest transcripts. Multiple sequence alignments were performed using the “-S diamond” option, and a phylogenetic tree was constructed based on maximum likelihood methods using the “-M msa -T iqtree” parameters, resulting in a STAG (Species Tree Inference for All Genes) phylogenetic tree.

Data Records

In this study, all raw RNA-seq data for the Gayal have been deposited in the Sequence Read Archive (SRA) at the National Center for Biotechnology Information, under the accession number SRR30184652, SRR30184649, SRR30184653, SRR30184654, SRR30184655, SRR30184656, SRR30184657, SRR30184658, SRR30184659, SRR30184651, SRR30184632, SRR30184631, SRR30184648, SRR30184629, SRR30184647, SRR30184646, SRR30184645, SRR30184644, SRR30184643, SRR30184642, SRR30184641, SRR30184650, SRR30184640, SRR30184639, SRR30184638, SRR30184637, SRR30184636, SRR30184635, SRR30184630, SRR30184634, SRR3018463324,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53. The all assemblies (assembled transcriptome data), transcriptome annotation, and read counts files have been uploaded to Zenodo database and can be accessed at https://doi.org/10.5281/zenodo.14859252. The data of all the analyses described in the methods have been deposited in the GEO repository, and the records have been assigned GSM accession numbers as GSE289623, GSM8795261, GSM8795262, GSM8795263, GSM8795264, GSM8795265, GSM8795266, GSM8795267, GSM8795268, GSM8795269, GSM8795270, GSM8795271, GSM8795272, GSM8795273, GSM8795274, GSM8795275, GSM8795276, GSM8795277, GSM8795278, GSM8795279, GSM8795280, GSM8795281, GSM8795282, GSM8795283, GSM8795284, GSM8795285, GSM8795286, GSM8795287, GSM8795288, GSM8795289, GSM879529054,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84.

Technical Validation

To reveal the global mRNA expression profile of the Gayal, sequencing was performed on samples from 10 organs and tissues (heart, liver, spleen, lung, kidney, rumen, abomasum, duodenum, ileum, rectum). The quality of the RNA-seq cleaned reads was assessed using FastQC, showing the average Q20 and Q30 values were 99.04% and 97.06%, respectively. The GC content distribution of Gayal tissue samples exhibited a normal pattern, indicating no sequencing data contamination (Fig. 6).

Fig. 6
figure 6

The sequence quality of the Gayal RNA-seq cleaned reads from 10 organs and tissues with three biological replicates.