Background & Summary

As one of the three major primitive sheep breeds in China, Tibetan sheep (Ovis aries) have lived on the Qinghai-Tibet Plateau for thousands of years, uniquely adapted to its high-altitude, cold, and strong ultraviolet conditions. Their excellent survival traits also include coarse feed tolerance, strong disease resistance, and robust foraging ability. The Tibetan sheep breed, shaped by long-term natural selection and artificial breeding, serves as a vital livelihood resource for local farmers and herders, and contributes to the sustainable growth and high-quality development of the pastoral economy.

Sheep, which are among the earliest domesticated animals, have maintained a close relationship with humans, especially nomadic people. Research indicates that Tibetan sheep originated from ancient northern Chinese sheep around 3,100 years ago1, diverging approximately 2,000‒2,600  years ago2. A small group of Tibetan sheep continued to expand southwestward, reaching central Tibet about 1,300 years ago1, while the remaining populations settled across various regions of Qinghai, gradually adapting to local geographical conditions and evolving into distinct breeds. Statistics show that China’s Tibetan sheep population stands at 32.5 million head, accounting for 11% of the total sheep population3,4. However, because of environmental constraints, the Tibetan breed has long been trapped in a cycle of low-level development and low-efficiency production. Despite the publication of the Tibetan sheep genome sequence5,6, most research remains focused on high-altitude adaptability, mutton quality, and nutrient metabolism7,8,9. Omics technologies have revealed that significant convergent evolution of the EPAS1 and EGLN1 genes has contributed to the breed’s adaptability to high-altitude environments, along with the identification of additional novel adaptive genes10,11,12. Plateau adaptability is a complex, polygenic trait13,14,15, with distinct local adaptation mechanisms observed among different Tibetan sheep populations and subtypes inhabiting varying altitudes. Furthermore, the lack of phenotypic and genomic data has hindered efforts to improve key economic traits in Tibetan sheep through genetic improvement. To bridge productivity gaps and accelerate breeding progress, it is imperative that we elucidate the genetic mechanisms underlying the formation of important economic traits in these sheep.

Whole-genome sequencing (WGS) has become a standard tool in livestock genetic breeding research, enabling the detection of genome-wide single nucleotide polymorphisms (SNPs), insertions and deletions (InDels), copy number variations (CNVs), and structural variations (SVs). This approach enables the identification of causal variations related to growth, reproduction, adaptability, and disease resistance. Here, we provide WGS data from 220 Tibetan sheep across 11 populations spanning an altitudinal gradient of 2,887 m to 4,643 m, marking the most comprehensive collection of whole-genome sequences from this breed to date. After aligning the sequencing data with the Tibetan sheep reference genome, a total of 21,099,381 high-quality SNPs were identified. We anticipate that this dataset will play an important role in assessing genetic diversity, gene flow, and regions of positive selection, as well as identifying candidate genes associated with economic traits in the Tibetan sheep population.

Methods

Sample collection

All animal experiments were performed under the guidance of ethical regulations from the Institutional Animal Care and Use Committee of Lanzhou Institute of Husbandry and Pharmaceutical Science, Chinese Academy of Agricultural Sciences (Approval No. NKMYD201805; Approval Date: 18 October 2018). For this WGS analysis, we selected 11 Tibetan sheep populations that are grazed year-round in different agroclimatic zones, representing the diverse environments of the Qinghai-Tibet Plateau (Table 1). Twenty unrelated adult samples were collected from each population, with 5 mL of blood drawn from the jugular vein before morning grazing and stored in EDTA tubes at −20 °C.

Table 1 Details of Tibetan sheep populations.

DNA extraction and quality control

The blood samples were thawed at room temperature for 30 min, and genomic DNA was extracted using a blood genomic DNA extraction kit (TIANGEN, Beijing, China), in accordance with the manufacturer’s instructions. Agarose gel (1%) electrophoresis was used to detect DNA degradation and contamination in the samples. DNA purity was assessed using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, MA, USA), and DNA concentrations were determined using a Qubit® 3.0 Fluorometer (Invitrogen, CA, USA). Qualifying DNA samples were sent to Guangzhou GeneDenovo Biotechnology Co., Ltd. (Guangzhou, China) for WGS.

Library preparation and sequencing

Following the manufacturer’s instructions, sequencing libraries for all samples were generated using the library construction kit from Illumina (CA, USA). Appropriate amounts of DNA were enzymatically fragmented into short segments, end-repaired, and dA-tailed prior to ligation of sequencing adapters. The DNA fragments were purified using AMPure XP beads (Merck, Shanghai, China), and fragments in the range of 300–400 bp were selected for PCR amplification. The size and concentration of the libraries were measured using a Qubit® 3.0 Fluorometer and an Agilent 2100 Bioanalyzer (Agilent, CA, USA). The effective concentration of each library was accurately quantified using the Bio-RAD CFX 96 Real-Time PCR Detection System (Bio-Rad, CA, USA). Libraries that passed quality control were sequenced on the Hiseq X10 PE150 platform (Illumina, CA, USA).

Sequence data pre-processing and mapping

Raw image data obtained from sequencing were converted into raw sequencing reads through base calling, and the results were stored in FASTQ file format. The fastp software (v0.23.4) was used for quality control of the sequencing data, including filtering low-quality reads, trimming low-quality bases from the 3’ end, and removing adapter sequences. Additionally, statistics on quality score distribution, GC content, error rate distribution, and N content were generated.

High-quality filtered reads were aligned to the Tibetan sheep reference genome (GCA_017524585.1) using the BWA-MEM algorithm (v0.7.17-r1188). The resulting Binary Alignment Map (BAM) files were sorted using Samtools (v1.17), and PCR duplicate reads were marked using the MarkDuplicates module in the Genome Analysis Toolkit (GATK, v4.5.0.0).

Variant calling, filtering and annotation

As shown in Fig. 1, variant calling and filtering were performed using GATK (v4.5.0.0). For SNP calling, Genomic Variant Call Format (GVCF) files were generated using the HaplotypeCaller module with the “-ERC GVCF” option. To improve scalability and accelerate joint genotyping, GVCF files were consolidated into a GenomicsDB datastore. Next, the GenotypeGVCFs module was applied to joint calling to produce population-based Variant Call Format (VCF) files. Biallelic SNPs were obtained using the SelectVariants module with the “-select-type SNP” and “–restrict-alleles-to BIALLELIC” parameters. To reduce false-positive SNPs, we applied the VariantFiltration module with the following quality control parameters:–filter-name “QD_filter” -filter “QD < 2.0”;–filter-name “FS_filter” -filter “FS > 60.0”;–filter-name “MQ_filter” -filter “MQ < 40.0”;–filter-name “SOR_filter” -filter “SOR > 3.0”;–filter-name “MQRankSum_filter” -filter; “MQRankSum < −12.5”;–filter-name “ReadPosRankSum_filter” -filter “ReadPosRankSum < −8.0”; and–cluster-size 3–missing-values-evaluate-as-failing. Additional SNP filtering was performed using PLINK (v1.9) with the following three criteria: SNP call rate < 0.1; minor allele frequency (MAF) < 0.01; and only SNPs on autosomes were retained. Remaining SNPs were annotated based on genomic position using ANNOVAR.

Fig. 1
figure 1

Overview of the sequence alignment, variant calling and variant filtration process.

SNP validation

To assess the accuracy of identified SNPs, variants were compared against public datasets from the Database of Single Nucleotide Polymorphisms (dbSNP; https://ftp.ncbi.nih.gov/snp/organisms/archive/sheep_9940/VCF/) and the iSheep database (https://ngdc.cncb.ac.cn/isheep/download). To convert physical coordinates from these public datasets were to the reference genome used in this study, we employed the LiftOver tool from the University of California Santa Cruz16. The overlap between the variants identified in this study and public datasets was then calculated to determine the proportion of newly discovered SNPs.

Data Records

Whole-genome sequence data (FASTQ format) from 220 Tibetan Sheep samples representing 11 populations analyzed herein have been deposited in the NCBI Sequence Read Archive (SRA) and have been assigned BioProject accession number PRJNA1138910 (https://www.ncbi.nlm.nih.gov/sra/SRP527227)17. The final VCF files have been deposited in the European Variation Archive (EVA) under accession number PRJEB10094218.

Technical Validation

Quality control of sequencing data

Quality control of raw WGS data is the foundation for ensuring accuracy and reliability in downstream analysis19. For each individual, we obtained 13.78–24.29 Gb of sequenced bases (average: 16.75 Gb), with 85.32%–94.27% (average: 91.86%) of the bases having a Phred scaled quality score of 30 (Fig. 2, Table 2). The genome coverage ranged from 5.10X to 9.00X (average: 6.20X). To comprehensively identify genetic variations, increasing the sequencing depth can be used to improve genome coverage and variant detection rates. High-depth resequencing (30X) is considered the “gold standard”20. However, if funding is limited, sequencing fewer samples at high depth may not provide adequate detection of all genetic variations19,21. Indeed, a WGS study of pigs found that sequencing depths below 4X resulted in more false-positive variants, indicating that 4X is the lower limit for sequencing quality22. In this study, the average sequencing depth was 6.20X, which is above the 4X threshold. Additionally, with low-depth sequencing, a larger sample size reduces the false-positive rate in variant detection23; thus, our sample size of 220 individuals was considered sufficient to accurately identify genetic variations in the Tibetan sheep genome. Although high-depth sequencing can yield more information, recent studies suggest that low-depth sequencing is a more effective approach for large sheep populations24,25,26.

Fig. 2
figure 2

Boxplots showing the average raw base (A), raw Q30 (B) and sequencing depth (C) for Tibetan sheep samples.

Table 2 Summary of whole genome resequencing data.

Fastp enables rapid data preprocessing and quality control of high-throughput sequencing data27. As shown in Fig. 3, MultiQC was used to integrate fastp results and generate quality reports. The duplicate read rate serves as an indicator of the quality of sequencing data, with lower rates indicating better data quality. In this study, the average duplicate read rate was 22.84% and the average unique read rate was 77.16% (Fig. 3A). However, distinct peaks were observed in certain regions of some individual samples, possibly attributable to PCR over-amplification during sequencing28. Figure 3B shows the average quality score for each base position, which was maintained at ~35, indicating very high sequencing quality. Similarly, the per-sequence quality score was consistently ~35, further demonstrating the high quality of the sequencing data (Fig. 3C). The GC content across all samples showed a stable distribution (average: 44.60%, Fig. 3D), suggesting no exogenous genome contamination during sequencing29.

Fig. 3
figure 3

Quality control metrics from FastQC analysis of sequencing data. (A) Unique and duplicated sequence counts. (B) Mean quality score at each base position. (C) Per sequence quality score. (D) Per sequence GC content.

Quality control of SNP data

Using the HaplotypeCaller function in GATK30, a total of 235,803,940 raw SNPs were identified in the 11 Tibetan sheep populations. To exclude low-quality SNPs, we used the VariantFiltration function in GATK, resulting in 213,867,729 SNPs. Finally, SNPs with a minor allele frequency (MAF) <0.05 and a missing rate > 10% were removed, totalling 21,099,381 SNPs for subsequent analyses. As shown in Fig. 4, we identified ~4.58 million SNPs (21.704%) that have not been reported previously in the dbSNP (https://ftp.ncbi.nih.gov/snp/organisms/archive/sheep_9940/VCF/) and the iSheep Data (https://ngdc.cncb.ac.cn/isheep/download), which could be due to prior underrepresentation of the sheep breeds studied here. Analysis of SNPs counts by mutation types across populations showed the G:C → A:T mutation to be the most prevalent (Fig. 5).

Fig. 4
figure 4

Venn diagrams for novel variants detected in 11 Tibetan sheep populations.

Fig. 5
figure 5

Statistics for the SNP number of different mutation types.

Variant detection accuracy can be assessed by the ratio of transitions (Ti) to transversions (Tv)31,32. In the absence of selection pressure, Ti/Tv is expected to be 0.5; however, this is rarely observed. A typical Ti/Tv ratio for whole-genome analysis is ~2.0–2.1, while novel variants generally show a ratio of ~1.5. In this study, the observed Ti/Tv ratio was 2.56, suggesting high SNP calling accuracy; ratios exceeding 4 may indicate artifacts33.

In the quality control of genomic data, the heterozygous-to-homozygous (Het/Hom) ratio is used to assess the genetic diversity of samples. Under the assumption of Hardy-Weinberg equilibrium, the expected Het/Hom ratio in human genomic data is 2.029. In this study, the Het/Hom ratio was 0.92, possibly reflecting inbreeding in Tibetan sheep, which can increase the likelihood of homozygosity. Additionally, a genomic evaluation of inbreeding coefficients in the 11 populations revealed severe inbreeding in some34, further demonstrating the high sequence quality in this study.

The SNP density can reflect both the genetic diversity of samples and the distribution of variations in the genome. In this study, there was an average of one SNP every 125.62 bp, with the highest densities observed on the sex chromosomes and more uniform distribution on the autosomes (Table 3). To understand the functions of these SNPs, annotation was performed using the ANNOVAR software35. Most SNPs were found to be distributed in intronic and intergenic regions (Table 4).

Table 3 Summary statistics of SNPs in each chromosome.
Table 4 Annotation result for SNPs in the Tibetan sheep populations.

Polymorphism information content (PIC) is an indicator used to measure the polymorphism of genetic markers, reflecting the diversity of alleles at a locus. Our analysis showed the highest PIC value in the Tao sheep (TS) group and the lowest in the Zashijia sheep (ZSJ) group. (Fig. 5). Nucleotide diversity (π) is an indicator used to measure the degree of nucleotide variation, reflecting the average number of nucleotide substitutions in a population. In our analysis, the Tianjun white Tibetan sheep (WT) group had the highest π value, while the Zashijia sheep (ZSJ) group had the lowest (Fig. 6).

Fig. 6
figure 6

Estimation of genomic PIC (A) and π (B) based on SNPs of 11 Tibetan sheep populations. Each bar represents a Tibetan sheep population, and the data is presented as mean ± standard deviation.