Background & Summary

The Tibetan Plateau is the highest altitude plateau on Earth, also known as the ‘Third Pole’1. Its southernmost and highest region is the Himalayan mountain chain, which has an average width of 200–350 km and a length of 2400 km. Here, the extremely cold and dry local climatic conditions combined with low oxygen levels have given rise to plants2 and animals3, including sheep, with excellent environmental adaptability, forming a unique ecosystem. Tibetan sheep broadly refers to all indigenous sheep breeds from the Tibetan Plateau region, which are mainly divided into the highland-, valley-, and oula-types4,5. As a breed of highland Tibetan sheep, the Gamba sheep have been shaped by long-term natural and artificial selection in the northern Himalayas6 and was recognized as a national-level breed in 2022. The breed is renowned for its uniquely flavoured mutton7 serving as a source of food and livelihood for locals. Despite its excellent adaptation to the harsh high-altitude environment, the breed’s low production efficiency still restricts economic development8,9. Unfortunately, due to the lack of genetic systematic characterization, the functional gene dissection and genetic improvement of Gamba sheep have been significantly hindered.

Despite this, research on Tibetan sheep breeds has expanded, focusing on two main topics of research namely genetic evolution and environmental adaptability. For example, Li et al. investigated their high-altitude adaptation10, while Sha et al. focused on ruminal microbiota-host gene interaction11. Based on multi-omics analyses, Han et al. found that the BMPR1B gene affects litter size in Tibetan sheep12, while Sun et al. investigated the genetic diversity and population structure of 11 Tibetan sheep populations13. The existing resequencing data of Tibetan sheep in public databases mainly stem from the hinterland of the Tibetan Plateau. As the national breed raised in the highest altitude county in China, Gamba sheep hold significant value in livestock genetic evolution and environmental adaptability studies. However, the breed’s whole genome sequence data still needs to be generated.

Next-generation sequencing technology enables high-throughput whole-genome sequencing (WGS) and it has wide applications in domestic animals, including tracing a species’ evolutionary trajectory or animal domestication14,15,16,17, assessing species-level biodiversity18, and identifying genes associated with animal economic traits19,20,21. Genomic analyses of samples from diverse ecological zones can elucidate the genetic underpinnings of local environmental adaptation, where beneficial genotypes can be identified can applied in improved productive breeds for superior performance under local environment.

Here we present new WGS data from 301 indigenous Tibetan sheep, encompassing 12.3 Tb of raw sequence data. These samples were sourced from the Himalayan region, specifically from five townships in Gamba County, Tibet Autonomous Region. Given the extreme environmental conditions of the sampling sites, with grazing areas situated at altitudes exceeding 5000 meters, this dataset constitutes the largest and highest-altitude WGS data generated from Tibetan sheep. The average sequencing depth of 13.8X in this dataset ensured the statistical power for genomic analyses and the sequencing data were aligned to the Ovis aries reference genome ARS-UI_Ramb_v3.022 (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_016772045.2/). After performing variant calling and variant filtration, the Variant Call Format (VCF) files that comprise a total of 39,718,985 single nucleotide polymorphisms (SNPs) and 5,275,473 insertions/deletions (InDels) were obtained.

The data generated from this study are expected to aid in (1) Assessing genetic diversity to evaluate breed variability (e.g., calculating metrics such as nucleotide diversity and observed/expected heterozygosity); (2) Identifying signatures of positive selection to study genomic selection (e.g., comparing Gamba sheep with other Tibetan sheep); (3) Exploring genome-environment associations to study environmental adaptation (e.g., comparing high-altitude and low-altitude sheep breeds); (4) Exploring genome-trait associations for phenotypic and production traits (e.g., conducting genome-wide association studies); (5) Checking genetic variants from regions of interest (e.g., examining candidate genes or QTLs associated with production traits); (6) Constructing the Pan-genome of Tibetan sheep breeds; (7) Developing SNP genotyping chips for use in breeding programs; and (8) Studying the domestication histories of sheep in conjunction with sequencing data from other countries/regions. Overall, the large-scale whole-genome sequencing of Tibetan sheep from the Himalayan region provides a valuable resource for understanding the genetics of Tibetan sheep and enriching global sheep genome sequence databases.

Methods

Sample collection

This experiment was approved by the Animal Care and Use Committee of Southwest University (No. IACUC-20240920-09). A total of 301 Gamba sheep aged 1–5 years old (no breeding records, determined by experimenter based on body size and tooth wear) were selected from five towns (Gamba Town, n = 20; Longzhong Town, n = 100; Changlong Town, n = 52; Zhike Town, n = 35; Kongma Town, n = 94) of Gamba country, Tibet, China (Fig. 1a,b, Supplementary Table S1). All sheep were grazed naturally and supplemented with dry hay in barns. From each sheep, 3–5 mL jugular venous blood was sampled by a team of trained personnel and blood samples were anticoagulated with EDTA and stored at −20 °C for subsequent processing.

Fig. 1
figure 1

Tibetan sheep WGS sampling locations and study workflow. (a) The five sampling sites were situated in the Himalayan mountainous region, specifically on the border between China’s Gamba and India’s Sikkim. (b) The Gamba sheep flock, from which samples were collected, was housed in pens at an altitude of 4700 meters, however, during grazing, the flock can reach elevations as high as 4800–5000 meters. (c) Workflow: Blood samples were collected from 301 Gamba sheep and subjected to whole-genome sequencing, generating a total of 12.3 Tb of sequencing data. The genome processing pipeline encompassed quality control, reads mapping, variant calling, variant filtering, and variant annotation.

Whole genome sequencing

The 301 blood samples were prepared for DNA extraction using the genomic DNA extraction kit (QT-1001, IGENEBOOK, China). Before library construction, the concentration, integrity, and purity of each genomic DNA sample were assessed using 1.5% agarose gel electrophoresis and a NanoDrop spectrophotometer (Thermo Scientific, USA). Qualified genomic DNA samples were randomly broken into fragments of about 150 bp, end-repaired, and ligated to adapters, followed by purification with magnetic beads and PCR amplification using the KAPA HiFi HotStart DNA Polymerase (Kapa Biosystems, USA). Then, the libraries were denatured to single-stranded DNA, circularized, digested to linear DNA, and quantified using a Qubit Fluorometer (Thermo Scientific, USA). The qualified DNA library was sequenced by the IGENEBOOK company (Wuhan, China) using the DNBSEQ-T7 platform.

Genomic alignment and variant calling

This study employed a rigorous genomic analysis pipeline, which mainly includes quality control, reads mapping, variant calling, variant filtering, and variant annotation (Fig. 1c). The original off-machine data in fastq files were quality-controlled using the fastp tool (v0.23.423), and sequence alignment and variant detection were performed using the Sentieon Genomics software (v20230824,25). In brief, the clean reads were aligned to the sheep reference genome sequence (ARS-UI_Ramb_v3.022, https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_016772045.2/) using the bwa software (v0.7.1726). The BAM files were sorted, and duplicates were marked using the Picard package (v2.25, https://broadinstitute.github.io/picard) and the mapping rate and genome coverage were calculated using Samtools (v1.1327) and PanDepth (v2.2528), respectively. The Sentieon haplotyper module was used to call variants for each sample to independently generate a genomic Variant Call Format (gVCF) file, while the joint variant calling was carried out by the Sentieon GVCFtyper module from the 301 gVCF files, to finally create a common VCF file. The SNP and InDel variates were extracted and hard filtrated using the SelectVariants module in GATK (v4.1.8.129), and were then further filtered using VCFtools (v0.1.1630).

Variant annotation and genetic kinship analysis

The filtered SNPs and InDels were functionally annotated using the snpEff software (v.5.131), and the variant locations of the intronic, untranslated, upstream, downstream, and intergenic regions were calculated. The variant counts and cumulative proportions were calculated to observe variant depth, thereby reflecting variant quality. The kinship matrix was constructed based on whole-genomic SNPs using the GEMMA software (v0.98.532) and visualized using the R package heatmap (v1.0.12). The kinship result was used to observe whether samples were independent of each other, thereby reflecting the sample quality.

Data Records

The original 301 sheep whole genome sequencing data in FASTQ format have been deposited in the Genome Sequence Archive33 on the China National Center for Bioinformation (CNCB) platform, under accession number CRA02448334 (https://ngdc.cncb.ac.cn/gsa/browse/CRA024483). The variation data generated including the final SNP.vcf and InDel.vcf files were deposited in the Genome Variation Map35 on the CNCB platform, under accession number GVM00101336 (https://ngdc.cncb.ac.cn/gvm/getProjectDetail?project=GVM001013).

Technical Validation

Quality control for sequencing data

Based on the high-throughput sequencing of 301 Gamba sheep, we obtained 12,289.3 Gb of raw data, where for each sample, 188 to 593 million reads and 28.2 Gb to 88.9 Gb sequencing yield (number of bases generated) were obtained. As shown in Table 1 and Fig. 2a, the sequencing depth for the samples varied from 9.5X to 30.5X (averaging 13.8X), with a GC content average of 43.0%, of which 96.0–99.3% and 88.2–96.8% of the bases had a minimum Phred scaled quality score of 20 (Q20, sequencing error rate < 0.01) and 30 (Q30, sequencing error rate < 0.001), indicating a high expected base calling accuracy. The quality reports (Fig. 2b,c) confirmed the overall high-quality scores of all the sequencing reads. As shown in Fig. 2d,e, the genome coverage and properly mapped rates of the sequence reads against the sheep reference genome (ARS-UI_Ramb_v3.0) was 99.6% (from 99.4% to 99.7%) and 94.0% (from 78.2% to 98.8%), respectively. These indicators confirmed the high quality of the sequencing data from multiple aspects.

Table 1 Sequencing data statistics of 301 Tibetan sheep.
Fig. 2
figure 2

Statistics of the high-throughput sequencing data of 301 Tibetan sheep and their alignment to the sheep reference genome. (a) Boxplots showing the sequencing yield, reads, depth, phred quality scores (Q20 and Q30), and GC content of 301 indigenous Tibetan sheep. (b) Mean quality value across each base position in a 150 bp read. (c) The sequence quality score plot shows that the quality scores of almost all reads fell between 35 and 40. (d) The genomic coverage of sequencing reads on the 26 autosomes of the sheep reference genome. (e) The mapping rate of sequencing reads to the sheep reference genome. Each circle or line represents one sample of the 301 Tibetan sheep dataset.

Quality control of SNP and InDel data

Joint genotyping of all samples originally identified 48,691,893 SNPs and 7,449,956 InDels. To ensure variant quality and minimize false positives, variant filtration was performed using the GATK software29. A series of statistical metrics, including Mapping Quality (MQ), Quality by Depth (QD), Fisher Strand (FS), Strand Odds Ratio (SOR), Mapping Quality Rank Sum Test (MQRankSum), and Read Position Rank Sum Test (ReadPosRankSum) were used to evaluate variant quality. These metrics assess aspects such as coverage depth, and alignment quality at variant positions, and detect strand bias detection collectively filtering potential false positives and ensuring accurate variant calling. The SNPs and InDels were filtered out when the missing genotype rate of a variant exceeded 10% in all samples and, a final 39,718,985 SNPs and 5,275,473 InDels were retained (Fig. 3a,b).

Fig. 3
figure 3

Statistics of variants depth and annotation in WGS data from 301 Tibetan sheep. After variant annotation, the proportions of SNPs (a) and InDels (b) at specific chromosomal location categories were statisticed. The cumulative distribution statistics demonstrated the high quality of the SNP (c) and InDel (d) data. Each line represents one sample of the 301 Tibetan sheep dataset.

Summary statistics of SNPs and InDels

High-quality variants were distributed across the genome with an average density of 1 SNP every 67 bases and 1 InDel every 456 bases (Table 2). Although the number of SNPs was approximately seven-fold higher than that of InDels, their variant classes were similar. As shown in Fig. 3a,b, most SNPs and InDels were located in intronic regions, accounting for about 70% of the total, while only 1.7% of SNPs and 1.2% of InDels were located in Exons, and approximately 2% of variants were located in the UTR regions. The arithmetic average of variant counts for all SNPs and InDels were 11.8 and 11.6 respectively (Fig. 3c,d), while the cumulative depth distribution plots illustrated the high quality of the identified SNPs and InDels.

Table 2 Statistics of the final SNPs and InDels identified from each chromosome in the sheep genome.

Genetic kinship of all samples

Based on whole genomic SNPs, the relatedness coefficients for any two animals were calculated. Figure 4a shows the heatmap of the kinship matrix, and no animals were directly related to one another, with 90% and 95% of the relatedness coefficients lower than 0.011 and 0.022 (Fig. 4b). The results of the genetic kinship analysis demonstrated the independence of the samples and the high quality of the selected samples in the whole-genome resequencing dataset.

Fig. 4
figure 4

Genetic kinship analysis of 301 Tibetan sheep WGS data indicated satisfactory sample independence. (a) Genetic relationship matrix of 301 sheep based on genome-wide SNPs; (b) The histogram frequency distribution of the pairwise kinship coefficients.

Usage Notes

The large-scale whole-genome sequencing dataset was derived from Gamba sheep, a national breed from the Himalayan region. Tibetan sheep is a general term that refers broadly to all indigenous sheep breeds from the Tibetan Plateau and as one of the Tibetan sheep breeds, Gamba sheep significantly enrich the global genomic resources of sheep. The study released both raw sequencing data and processed variant files. Notably, the choice of reference genome can to some extent affect the variant quality and downstream analysis37,38. The variants we presented were based on the reference genome of the Rambouillet breed (ARS-UI_Ramb_v3.022), although a Tibetan sheep reference genome, CAU_O.aries_1.039 (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_017524585.1/) was assembled by Prof. Li’s team previously. It may be more appropriate to align data to the CAU_O.aries_1.0 reference when conducting studies only focused on Tibetan sheep. This year, Prof. Li’s team also published the T2T-sheep1.040 (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_040805955.1/), the first Telomere-to-Telomere genome of the renowned Hu sheep breed. Compared to traditional reference genomes, the T2T reference genome fills gaps in repetitive sequences, particularly in complex regions such as centromeres and subtelomeres. These genome versions provide greater flexibility for addressing diverse research objectives related to Tibetan sheep in future studies.