Background & Summary

The Tibetan antelope (Pantholops hodgsonii, also known as the Chiru) is a quintessential plateau-dwelling ruminant endemic to the Qinghai-Tibetan Plateau (QTP) in China, widely distributed across the QTP at altitudes ranging from 3,250 to 5,500 m1,2. It belongs to the order Artiodactyla, family Bovidae, subfamily Caprinae, and genus Pantholops1. During the 1980s and 1990s, the Tibetan antelope suffered from illegal commercial hunting for its fine wool and fur, resulting in a 90% population decline compared to the 1950s3,4. Consequently, the Tibetan antelope was listed as an Appendix I species by the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES) in 1976 and as an endangered species on the IUCN Red List of Threatened Species in 20005,6. Notably, after over 30 years of continuous conservation, the population size of the Tibetan antelope has recovered to more than 200,000, leading to its reclassification as near threatened on the IUCN Red List7. However, potential threats to the survival of the Tibetan antelope, such as human development, global warming, and grassland degradation caused by human activities, still exist, making sustained conservation efforts crucial8. Since the publication of the first scaffold-level genome of the Tibetan antelope in 2013, which consisted of 15,058 scaffolds and had a scaffold N50 of 2.8 Mb (accession number: GCA_000400835.1)9, no additional genomes related to this species have been found in available online databases. Generally, the short reads generated by Next-generation sequencing (NGS) technology are difficult to cover some complicated regions (such as repeats) on the genome accurately and completely, resulting in genomes assembled by this method having lower integrity and continuity. Limited information, such as the distribution and evolution of functional gene families and repeats like transposable elements (TEs), as well as special genomic structural variations and epigenetic modifications, can be generated from NGS genomes.

With advancements in third-generation sequencing technologies such as PacBio HiFi and Oxford Nanopore sequencing, it is now feasible to generate sequencing reads with higher accuracy, integrity, and continuity10,11,12. Additionally, employing High-throughput Chromosome Conformation Capture (Hi-C) technology enables anchoring of contigs from the assembly to chromosomes13. In our study, we assembled a high-quality chromosome-level reference genome of the Tibetan antelope, which contains 29 autosomes, one X chromosome, partial Y chromosome, one mitochondrial genome, 601 unmounted scaffolds, and a total of 28,330 protein-coding genes. The scaffold N50, Benchmarking Universal Single-Copy Ortholog (BUSCO), and quality value (QV) scores were 92.23 Mb, 98.2%, and 70.14, respectively. Our data and results not only contribute to the genetic conservation of Tibetan antelope but also provide a valuable resource for the genetic, ecological, and evolutionary research within the sub-family Caprinae.

Methods

Sampling and sequencing

Skeletal muscle tissues were collected from a freshly deceased adult male Tibetan antelope at Hoh Xil National Nature Reserve (35°54′ N, 91°92′E; 4,830 m a.s.l.) in Qinghai Province, China. The tissues were promptly stored at −80 °C for subsequent genomic DNA extraction and sequencing. A multi-platform sequencing approach involving PacBio, DNBSEQ, and Hi-C was employed to assemble a chromosome-level reference genome of the Tibetan antelope.

For HiFi sequencing, genomic DNA was extracted from the tissue sample following the protocol of the Blood/Tissue Culture DNA Midi Kit (QIAGEN)14. The integrity and quality of the DNA were assessed using agarose gel electrophoresis and the Qubit 2.0 Fluorometer (Thermo Fisher Scientific). A SMRTbell target-size library was constructed for sequencing according to PacBio’s standard protocol (Pacific Biosciences, CA, USA) using either 13 kb or 16 kb preparation solutions. Sequencing was performed on a PacBio Sequel II sequencing platform using a Sequel II Sequencing Kit 2.0 at Nextomics15. This produced a total of 107.31 Gb of HiFi long pass reads, with a coverage of approximately 35.77× (Table 1).

Table 1 Statistical information of PacBio HiFi and DNBSEQ sequencing data.

For Hi-C sequencing, a high-throughput chromatin conformation capture (Hi-C) library was prepared through formaldehyde crosslinking, MboI enzyme digestion, biotin labeling, ligation, and purification16. The qualified libraries (with insert sizes ranging from 200 to 600 bp) were subsequently subjected to sequencing on a DNBSEQ-T7 platform, yielding a total of 405.62 Gb of 150-bp paired-end reads, which covered approximately 135.20× of the genome.

For the genome survey, a paired-end library with an insertion size of 350 bp was constructed and sequenced using the DNBSEQ-T7 sequencing platform, following standard procedures (MGI Tech Co., Ltd., Shenzhen, China) at Compass Agritechnology Co., Ltd. (Beijing, China). This resulted in the generation of 478 Gb of 150 bp paired-end reads, providing a coverage of approximately 159.33× (Table 1).

Genome survey and assembly

The genome size and heterozygosity of the Tibetan antelope were estimated using paired-end reads by DNBSEQ-T7, employing jellyfish v2.2.10 with K-mer analysis (k = 20) and GenomeScope v1.017,18 with parameters “-k 20 -m 1000000” (Fig. 1A,B). The estimated haploid genome size of the Tibetan antelope is 2.79 Gb with a heterozygosity of 0.29% (Fig. 1A). Subsequently, the HiFi and Hi-C reads were utilized for contig assembly by hifiasm v0.19.8 (r603) with default parameters19,20, resulting in a primary assembly of 3.13 Gb data. Specifically, a total of 562 contigs were generated, with a contig N50 of 84.65 Mb (Table 2). The contigs in the primary assembly were then anchored onto chromosomes using yahs v1.121 and juicer_tools v1.19.0222. Initially, the index of contigs was built using samtools v1.19.223, and the Hi-C reads were mapped to the contigs utilizing chromap v0.2.6 (r486)24 with parameters “-r $contigs -x contigs.index -1 $Hi-C.R1reads -2 $Hi-C.R2reads -o aligned.sam–present hic–remove-pcr-duplicates–SAM -t 200”. Subsequently, the aligned SAM files were converted to BAM and BED formats using samtools. The contigs were sorted, pruned, and optimized based on signal strength after utilizing Hi-C data to identify interaction signals between contigs using yahs with default parameters. Finally, the hic contact matrix was generated by juicer_tools and manually adjusted by juicerbox v1.11.0822. A total of 2.71 Gb data (86.56%) of contigs were anchored to 30 chromosomes, with a scaffold N50 of 92.23 Mb, which is consistent with previous karyotype studies25 (Fig. 1C and Table 2). The sex chromosomes (X and partial Y) were identified by mapping the 30 chromosomes and remaining unanchored contigs to genomes of closely related species using minimap2 v2.28 (r1209)26 (Fig. 2A). However, there are 509 scaffolds (most of them are repeats) failing anchored to the chromosomes. According to previous research, this may be caused by the lower mapping accuracy in the complex repeats near the centromere regions in the bovine acrocentric chromosomes27. Additionally, the integrity and quality value (QV) of the assembled genome were evaluated using benchmarking universal single-copy ortholog (BUSCO) (v5.5.0) and merqury v1.3, respectively28,29. The BUSCO results showed that 98.2% of complete BUSCOs and 0.8% of fragmented BUSCOs were found in mammalia_odb10, with 1.0% of BUSCOs missing. Moreover, the average QV value of the assembly is 70.14, indicating good quality. Therefore, based on these results, a high-quality chromosome-level reference genome of the Tibetan antelope has been successfully established.

Fig. 1
figure 1

The genome survey, heat map of Hi-C contact matrix, and the functional annotation of the protein-coding genes in the genome of Tibetan antelope. (A) The K-mer analysis of Tibetan antelope genome. (B) The adult male Tibetan antelope. (C) The heat map of Hi-C contact matrix of the Tibetan antelope genome assembly. (D) The upset bar plot showed that the functional annotation of the protein-coding genes in the genome of Tibetan antelope. The left horizontal bar represented the number of annotated genes in different database and the right vertical bar represented the number of shared genes in the five databases. The right pie chat showed the functional annotation of the remained unannotated genes by NR database after the annotation by EggNOG database. The blue parts in the pie chart represented genes with annotation and the orange represented genes without annotation.

Table 2 Statistics of assembly in our study (NWIPB_Pahodg_1.0) and the one released in 2013 (PHO1.0).
Fig. 2
figure 2

The collinearity of the Tibetan antelope assembled genome. (A) The collinearity between homology chromosomes in Tibetan antelope, takin, goat, argali and sheep. (B) Overview of the Tibetan antelope assembled genome. From the outer to the inner layers (a), sequencing depth of Hi-C data, (b), sequencing depth of WGS data, (c), repeats density, (d), GC content, (e), gene density, (f), chromosomes, (g), syntenic block links among the chromosomes. The syntenic block links between different chromosomes were orange colored.

Genome repeats annotation

The de novo libraries of transposable elements (TEs) and repeats for the Tibetan antelope were generated using RepeatModeler v2.0.5 and EDTA v2.1.0 pipeline, respectively, with default settings30,31. Subsequently, the two TE libraries were merged and employed for genome masking with RepeatMasker v4.1.2, utilizing rmblastn v2.14.1 as the search engine32,33. These analyses revealed approximately 1.65 Gb of repeat sequences, constituting 52.47% of the whole genome assembly. Specifically, among the repeats, short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs), long terminal repeats (LTRs), DNA transposons, and other unclassified repeats accounted for 0.61%, 17.62%, 9.86%, 23.90%, and 0.46%, respectively.

Genome structure prediction and functional annotation of protein-coding genes

The homologous comparison approach was employed to predict protein-coding genes. High-quality genomes and annotation files of eight ruminants—cattle (Bos taurus), buffalo (Bubalus bubalis), domestic yak (Bos grunniens), Chinese forest musk deer (Moschus berezovskii), sheep (Ovis aries), goat (Capra hircus), argali (Ovis ammon), and takin (Budorcas taxicolor)—were downloaded from the NCBI Genebank database. Utilizing these species’ genomes and annotations, protein-coding genes in the Tibetan antelope masked genome were predicted using GeMoMa v1.9 with mmseqs as the search engine34. Besides, the protein sequences of the eight relative species described above were aligned to the Tibetan antelope soft-masked genome by miniport and the protein-coding genes in the genome were annotated by GALBA based on de novo prediction strategy35. Additionally, the protein-coding genes in the Tibetan antelope soft-masked genome were also annotated by helixer based on deep neural networks36. Finally, the annotation file generated by the three strategies were combined by the EVidenceModeler37. A total of 28,330 protein-coding genes were successfully predicted, with BUSCO results indicating 97.4% complete BUSCOs, 0.7% fragmented BUSCOs, and 1.9% missing BUSCOs. Subsequently, the function of these protein-coding genes was annotated using EggNOG-mapper v2.1.12 with the EggNOG database (http://eggnog5.embl.de/#/app/downloads, Mammalia, 40674)38,39. The longest represented transcripts were identified and translated to protein sequences using TBtools-II v2.08340. These protein sequences were then mapped to the EggNOG database (which includes GO, KEGG_Knum, KEGG_pathway, KEGG_EC, and pfam.domain databases) using the blsatp method in diamond v 2.1.7.161 with default parameters41. Results showed that 26,938 genes (95.08%) matched the database entries (Fig. 1D). The remaining 1,392 genes were mapped to the NR database (http://www.ncbi.nlm.nih.gov/protein), with 736 matching the database entries (Fig. 1D). In summary, the function of 27,674 protein-coding genes (97.68%) was successfully annotated. In addition, the overview of this genome including the syntenic block links among the chromosomes, GC content, and gene density were analyzed and visualized using MCScanX42 and Advanced Circos43 in TBtools-II (Fig. 2B).

Data Records

The genome assembly and genomic sequencing data including paired-end reads used for genome survey by DNBSEQ-T7, Hi-C reads by DNBSEQ-T7 and PacBio HiFi reads were deposited at the Sequence Read Archive database of NCBI under BioProject ID PRJNA1099927. The accession number of DNBSEQ-T7 sequencing data is SRR2875987744. The accession number of Hi-C sequencing data are SRR28759867-SRR28759873 and SRR28759878-SRR2875989444. The accession number of PacBio HiFi sequencing data is SRR28759874-SRR2875987644. The accession number of the final chromosome assembly of Tibetan antelope is GCA_040182635.145. The annotation results of repeats sequences, gene structure and functional prediction were deposited at the Figshare database46.

Technical Validation

Agarose gel electrophoresis were used to confirm the absence of total RNA and the fragment size of the purified DNA molecules. The concentration was measured using the Qubit 2 Fluorometer (Thermo Fisher Scientific, MA, USA). The main bands of genomic DNA fragments were over 20 kb, and the Nanodrop ND-1000 DNA spectrophotometer (LabTech, Corinth, MS, USA) ratio (260/280) was 1.81.

The HiFi raw reads sequenced by Pacbio Sequel II were cleaned by CCS (v4.0.0) with default parameters (Table 1). A total of 405.63 Gb Hi-C raw reads sequenced by DNBSEQ were cleaned by removing the low-quality reads and adaptors using SOAPnuke with default parameters and finally generated 403.67 Gb clean reads. The low-quality paired-end raw reads used for genome survey and adaptors sequenced by DNBSEQ was removered by fastp v0.23.4 with parameters “-W 5 -M 20 -q 15 -5 -u 40 -n 0 -l 75 -w 4” (Table 1).