Background & Summary

Sheep body size is closely correlated with meat production1, fat deposition2, and reproductive performance3. The consumption of meat continues to increase globally; the larger body size will bring higher profits to sheep farmers4. Hu sheep have excellent reproductive performance. It has many other advantages, such as early fast growth and development, low fatty content and high carcass yield5. In recent years, because of the increase demand for mutton and the decrease demand for sheepskin, the Hu sheep industry needs to be upgraded6. However, Hu sheep have smaller statures compared to other meat sheep breeds, necessitating improvement. Sheep body size can be influenced by various factors, especially genetics7. Whole genome sequencing provides a large number of genetic variations, which helps to further explore the underlying molecular genetic mechanisms of body size traits in Hu sheep.

Most studies on economic traits in sheep had focused only on SNPs and neglected SVs. Compared to SNPs, SVs have a more direct impact on phenotype, and can explain more complex genetic variations8. Currently, researches on sheep SV are mainly focused on evolution and development9,10,11, and few studies have systematically analyzed the effects of SV on economic traits in sheep. The field is largely unexplored. Therefore, the effects of SV on body size traits in Hu sheep need to be further investigated. Quantitative trait loci (QTL) and selection signatures have been widely used in the study of livestock traits. The QTL, GWAS and selection signatures analyses complement each other to identify candidate loci more accurately12. In a previous study, we explored the evolutionary history of Hu sheep in conjunction with other Mongolian sheep breeds6.

In this study, we conducted whole genome sequencing on 300 Hu sheep with an average sequencing depth of 16.51X. A total of 9.53 T high-quality data was obtained. Five body size traits were recorded, including body weight (BW), body length (BL), body height (BH), chest circumference (CC) and cannon bone circumference (CBC). Combined with these body size traits, GWAS based on SNPs and SVs were performed, respectively. Furthermore, we had analyzed the selection signatures of SNPs and SVs in Hu sheep population. This dataset contributes to a more comprehensive understanding of genetic variations in Hu sheep, and provides new perspectives on the conservation of Hu sheep genetic resources.

Methods

Ethics statement

The animal study protocol was reviewed and approved by the Institutional Animal Care and Use Committee of Zhejiang University (ZJU25331).

Hu sheep samples collection

In this study, all Hu sheep (n = 300) utilized were collected from the Yihui Ecological Agriculture Co, Huzhou City, Zhejiang Province, China. All Hu sheep were raised in the same conditions. Each phenotype was measured by the same person to minimize measurement error. BW was measured by specific electronic scale. BL was the straight-line distance from the anterior end of the scapula to the posterior end of the sciatic tuberosity; BH was the vertical distance from the highest point of the withers to the ground surface; CC referred to girth measurement of the posterior end of the scapula around thorax; CBC was the circumference of the tibial third of the left forelimb. Table 1 demonstrated the statistical data for above five traits. Two milliliters of blood samples were collected for DNA extraction.

Table 1 Statistical information of five body size traits.

Library construction and sequencing

The magnetic bead method was utilized to extract DNA, and the DNA samples were tested for integrity and purity before being accurately quantified. Only DNA samples that passed the test could be used for library construction. After library construction was completed and passed the test, qualified samples were sequenced. The raw data would be used in the next step of analysis.

Identification of SNPs

Before conducting bioinformatics analysis, the raw data required to be filtered and quality controlled. Using fastp13 for quality control of raw reads. According to the reference genome ARS-UI_Lamb_v2.014 (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_016772045.1/), the raw data were filtered for clean data and index. The BWA15 and samtools16 softwares were used to process clean reads and index, respectively, to obtain Bam file. Duplicates were removed from the Bam file using GATK software17. Genome coverage and sequencing depth were calculated for all samples based on the Bam file. Subsequently, the SNPs were filtered using PLINK18, the code was: PLINK --allow-extra-chr --bfile test --chr 1-26 --chr-set 95 --maf 0.05 --geno 0.1 --hwe 0.000001 --out vcf--recode vcf-iid --snps-only just-acgt. Beagle19 was used to fill the missing SNPs.

Identification of SVs

Six software were used to identify SVs. They were Delly (1.2.6)20, Dysgu (1.6.2)21, GRIDSS2 (2.13.2)22, Manta (1.6.0)23, Wham (1.8.0)24, and Smoove (0.2.8)25. All insertions (INS) and deletions (DEL) identified were clustered and combined. For accuracy of identification, only SVs that supported by more than three software were retained. Only INSs with definite sequences were retained according to the breakpoint records of SVs. A set of candidate SVs was composed of the identified INSs and DELs. Based on the SVs candidate set, a pan-genome graph was created, and SV genotyping was performed on all samples with GraphTyper226 software. Subsequently, further filtration was performed on all SVs: MAF > 0.01 and missing rates < 0.3. Beagle19 was used for genotype phasing of SVs with the default parameters.

GWAS based on SNPs and SVs

The rMVP27 software was used for GWAS. The mixed linear model27 accurately explained kinship and population structure. Therefore, we used this model to conduct GWAS. The formula for this model is as follows:

$$y=X\beta +{Z}_{k}{\gamma }_{k}+{\rm{\xi }}+{\rm{e}}$$

where \(y\) is the phenotype vector, \(X\beta \) is the fixed effects, including population structure, sex, birth year and season, and measurement age of Hu sheep, \({Z}_{k}{\gamma }_{k}\) is the marker effect to be tested, \(\xi \sim N\left(0,K{{\rm{\varnothing }}}^{2}\right)\) represents the polygenic effect, and \(e \sim N\left(0,I{\sigma }^{2}\right)\) is the residual effect. \(K\) is the polygenic effect in the marker-inferred kinship matrix. Manhattan and Q-Q plots were made using CMplot software27. The 1000 permutation test was used to determine the threshold of SNP-GWAS. The threshold of SV-GWAS was determined using the top 5% of −log10(p-value)10.

The Animal QTL Database28 (http://www.animalgenome.org/QTLdb) was used for QTL annotation. The QTL enrichment and gene annotation were performed using the GALLO R package29. Functional annotation of candidate genes was performed in the Herbivore Transcriptome Information Resource Database30 (https://yanglab.hzau.edu.cn/HTIRDB#/).

Selection signature analysis

Integrated haplotype score (IHS) analysis was conducted using selscan software31. PLINK was used for runs of homozygosity (ROH) detection. The following parameters were used for the SNP ROH: PLINK --homozyg-window-threshold 0.05 --homozyg-het 1 --homozyg-window-missing 5 --homozyg-snp 50 --homozyg-kb 500 --homozyg-window-het 1 --homozyg-gap 100 --homozyg-density 50 --homozyg-window-snp 50; the following parameters were used for the SV ROH: PLINK --homozyg-window-missing 5 --homozyg-snp 25 --homozyg --homozyg-gap 1000 --homozyg-het 1 --homozyg-density 150 --homozyg-window-threshold 0.05 --homozyg-window-snp 25 --homozyg-window-het 1 --homozyg-kb 1000. Runs of heterozygosity (ROHet) were identified and analyzed using the detectRUNS R package32. SNP ROHet used the following parameters: minSNP = 10, maxGap = 10^6, minLengthBps = 50000, maxOppRun = 3, maxMissRun = 2; SV ROHet used the following parameters: minSNP = 10, maxGap = 10^6, minLengthBps = 200000, maxOppRun = 3, maxMissRun = 2. The top 0.1% of SNP or SV occurrences were used as the hotspot regions for selection signatures.

Data Records

The raw data used in this study are available in the NGDC database under GSA accession number CRA01783233 (https://ngdc.cncb.ac.cn/gsa/browse/CRA017832). The SNP-VCF and SV-VCF files for this study have been deposited in the European Variation Archive (EVA) at EMBL-EBI under accession number PRJEB9432834.

Technical Validation

Quality control of genomic data

The average sequencing depth of all sample was 16.51X. The average mapping rate of the sequence reads against the reference genome was 97.66%. The Q30 range of clean reads was 86.76% to 95.46%, and GC range was 41.52% to 44.48% (Fig. 1, Table S1). These indicators further confirmed the high quality of the sequencing data.

Fig. 1
figure 1

Boxplot showing the sequencing depth, Q30 content and GC content of Hu sheep samples (n = 300).

Quality control of SNPs and SVs

Following quality control, 23274312 SNPs were obtained from 26 autosomes. The high-density of SNP in 1 mbyte (Mb) was shown in Fig. 2. Furthermore, 64759 SVs were obtained, including 42160 DELs and 22599 INSs (Fig. 3a,b). All SVs were greater than 50 bp, with the largest SV was 62,337 bp DEL.

Fig. 2
figure 2

The number of SNPs within 1MB window size.

Fig. 3
figure 3

The number and type of SVs. The number of SVs within 1MB window size (a). The type of SVs (b).