Background & Summary

Goats, one of the earliest domesticated livestock species, are well-known for their adaptability and wide geographic distribution, making them vital in the development of human civilization1,2. It is believed that goats were domesticated around 11,000 years ago in the Near East3. Until to now, various goat breeds with distinct characteristics tailored for meat or milk production have emerged though selective breeding. Compared to cattle, dairy goats are easier to manage and require lower initial investment, making them particularly favored by small-scale farmers and pastoralists4. Dairy goat farming has become a popular economic activity in developing areas, with many farmers choosing to engage in it to support their livelihoods5. Globally and especially in developing countries, the production and consumption of goat milk have significant economic and food-related benefits due to its health advantages. Systematic research on genetics and phenotype shows promise in improving milk productivity, accelerating breeding progress, and optimizing the economic benefits derived from goats.

Goat milk is the most globally consumed among farm animal milks while its products play an important role in economic viability in many parts of the world, especially in developing countries6,7. Asia and Africa are the primary regions for goat distribution and milk production6. Among the 576 breeds of goats, the Saanen breed is known as the highest milk production performance in the world7. Many local sub-breeds are the result of crossbreeding of Saanens with local goats, including the Banat White in Romania, British Saanen, French Saanen, Israeli Saanen, Russian White, Laoshan and Guanzhong Dairy Goats in China, and Yugoslav Saanen6. Until 2020, the Saanen goats and their sub-breeds are found in more than eighty countries (http://www.fao.org/dad-is/zh/)8. Although Saanen dairy goats are a significant genetic resource to many local sub-breeds worldwide, their genetic improvement has been hindered by a lack of systematic characterization at the genetic level.

Whole-genome sequencing and variant analysis play a crucial role in understanding the genetic diversity of dairy goat populations and can facilitate genetic improvements for enhanced milk production9,10. It is evident that a considerable number of studies have utilized SNP chips to investigate dairy goats. This includes studies on Canadian Alpine (833 individuals) and Saanen goats (874 individuals)11, as well as Sudanese breeds such as Nubian (24 individuals), Desert (24 individuals), Taggar (24 individuals), and Nilotic (24 individuals) goats12, and mixed-breed dairy goats (2,381 individuals)13 and New Zealand dairy goats (4,840 individuals)14. However, genomics sequencing studies specific to dairy goats remain relatively sparse, with notable examples including the French Alpine goat (44 individuals) and French Saanen goat (37 individuals)15, as well as Saanen (5 individuals)16 and Guanzhong dairy goats (20 individuals)17. However, there is also a scarcity of whole-genome sequencing data available for Saanen goats18, with only a small amount of whole-genome sequencing data publicly accessible in the database19. To bridge this knowledge gap, this study conducted whole-genome sequencing on Saanen dairy goats, with the objective of unlocking their genetic and breeding research potential.

Currently, the genomic resource data of Saanen goats are relatively scarce in public. This study presents a dataset of whole-genome sequencing for 298 Saanen dairy goats, comprising ~18 Tb of raw sequence data. It is by far the biggest dataset of whole genome sequences for dairy goats available in public data resource. It identifies over 14 million SNPs and 1.34 million insertions-deletions (InDels) across chromosomes 1–29 by mapping the sequencing data against the updated Saanen goats reference genome (No.: Genome assembly ASM4283598v1, https://www.ncbi.nlm.nih.gov/datasets/genome/?bioproject=PRJNA1085880)20. Sequencing has been performed at a high depth (average 14.6 X), increasing the power and resolution of genomic analyses. To ensure the accuracy of this genetics analysis, stringent experimental and quality control processes were employed. Here, we present the entire process we used to achieve accurate quality control measurements and procedures from raw data to the final variant call format (VCF) file generation. The dataset can further fill the gaps in the genomic resources of Saanen goats. This dataset has various research applications, including mutation detection, exploration of genomic structure and function, inference of genetic relationships among populations, migration history, and gene flow patterns. Furthermore, it facilitates the identification of candidate genes associated with productive traits and developing of SNP genotyping arrays tailored for dairy goat breed identification and breeding purposes. Therefore, this dataset is a valuable addition to the global dairy goat genomic databases, and plays a crucial role in studying goat domestication history and population genetics.

Methods

Sample collection

Through pedigree and other information, 298 non - littermate individuals of healthy Saanen dairy goat ewes aged 2–3 years were selected. Blood samples from 298 goats were collected from two farms in Zhejiang and Shaanxi provinces, China. A volume of 3 mL of blood was drawn from the jugular vein of each individual and stored in anticoagulant tubes at −20 °C. All procedures associated with the dairy goats used in this study were approved by the Animal Use Committee of Zhejiang University (No. ZJU20250120).

DNA extraction and quality control

The workflow, depicted in Fig. 1, illustrates the process from sample collection to variant filtering. The extraction of DNA from blood samples was performed using the CWE9600 Magbead Blood DNA Kit from Cowin Biotech Co. (Jiang Su,China), employing a magnetic bead-based method. The quality control procedures for DNA involved steps including assessing DNA degradation and integrity using 1.5% agarose gel electrophoresis (Biowest agarose, Spain), evaluating DNA purity and concentration with a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, Waltham,USA).

Fig. 1
figure 1

Sample quality control, data processing, and variant filtering general workflow. This process is consistent with the variant calling scheme recommended by GATK.

Library construction and sequencing

A universal library was constructed using the NadPrep® DNA library construction kit (Nanodigmbio,Nanjing,China) and the Bioruptor Pico (Diagenode, Belgium). This system randomly breaks down DNA into fragments that are approximately 300–350 bp in size. The fragmented DNA was then subjected to end repair, A-tailing, and adapter ligation. To obtain sequencing libraries, DNA fragments of around 300–350 bp were selected using NadPrep® SP Beads (Nanodigmbio,China). PCR amplification was performed on the selected fragments, followed by purification using NadPrep® SP Beads to obtain sequencing libraries. After the library construction, initial quantification was performed using Qubit 3.0(Invitrogen, USA). Additionally, fragment size analysis was conducted using the Bioanalyzer® (Agilent, USA) to confirm the expected fragment sizes. Once the size fragment sizes were confirmed, sequencing was carried out on a DNBSEQ-T7 sequencer (MGI Tech, Shenzhen,China) using a PE150 sequencing strategy.

Data quality control, mapping, and variant calling

The raw reads were subjected to quality control using FASTP v0.23.4 software21. The following criteria were applied: 1) Removal of reads containing adapters; 2) Discarding paired reads if the percentage of ‘N’ bases exceeded 1% of the read length; 3) Elimination of paired reads if more than 50% of the bases had a quality score (Q) ≤ 522,23,24,25. The alignment of clean reads to the Saanen goat reference genome (No.: GCA_042835985.1) was performed using the Burrows-Wheeler Aligner (BWA) v0.7.17 software26. The SAM/BAM files generated from the alignment were processed using SAMtools v1.1027,28, which involved calculating sequencing depth, marking duplicates, and removing them using Picard v2.20.1. Variants were called using GATK v4.1.529. Variant filtration was conducted using the Variant Filtration module. Specific codes for each mapping and variant calling step can be found in the “Code Availability” section. Lastly, VCFtools v0.1.17(--max-missing 0.2) was used to for final variant quality control, filtering out all SNPs with a missing genotype rate exceeding 20% in the samples. A total of 14 million biallelic SNPs and 1.34 million biallelic InDels identified on autosomes after quality control were retained for subsequent analysis.

Data Records

The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive30 in National Genomics Data Center31, China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA017705)32. The variation data have been deposited in the European Variation Archive (EVA, PRJEB86789)33. The relationship between the goat ID in the VCF files and the GSA database was shown in Supplementary Table S1.

Technical Validation

Quality control of sequencing data

For each individual, we obtained raw sequencing data ranging from 28.01 to 86.83 Gb (Fig. 2). On average, the raw base is 41.18 G and the clean base is 41.07 G (Table 1). Approximately 92.3% of the data reached a Phred quality score of 30, indicating a sequencing accuracy of 99.9%34,35. The average sequencing depth was 14.64 X (Fig. 2). Additionally, we achieved a 99.82% effective rate and an average genome alignment rate of 99.9% (Table 1). These metrics demonstrate the high quality of the sequencing data in terms of both volume and quality scores36.

Fig. 2
figure 2

Boxplots showing the average sequencing depth, raw base and raw Q30 for Chinese Saanen samples (n = 298).

Table 1 Summary of sequencing data. Presented in the table is the mean value across all individuals.

Quality control of SNPs and InDels data

The program employed a unified analysis strategy, identifying a total of 29.8 million raw SNPs and 3.49 million raw InDels within the goat population. During this process, low-quality variants were filtered using the Variant Filtration module in GATK software, as detailed previously37. We applied various statistical metrics including Mapping Quality (MQ), Quality by Depth (QD), Fisher Strand(FS), and Strand Odds Ratio(SOR) to assess variant quality relative to coverage depth, alignment quality at variant positions, detection of strand bias, and comparison of reference and alternate alleles in terms of alignment quality and read position. These parameters collectively helped in gradually filtering out potential false positive variants, ensuring the accuracy of variant calling. The depth distribution of SNPs and the distance distribution of SNPs are illustrated in Fig. 3a,b, respectively. Finally, a total of 14,597,388 SNPs and 1,348,195 million InDels were identified. On average, there are approximately 6 SNPs within 1 KB region, and approximately 1 indel within 2 KB region (Table 2). All SNPs and InDels underwent annotation. In SNPs, approximately 0.7% were detected in exonic regions, ~19.4% in intronic regions, ~78.3% in intergenic regions, and ~1.6% in Up/Downstream and other small variants (Table 3). For InDels, the distribution was ~0.2% exonic, ~20.2% intronic, ~77.6% intergenic, and ~2.0% Up/Downstream and other small variants (Table 3). The density distribution of SNPs and InDels are depicted in Fig. 4a,b, respectively. Throughout the entire workflow from sequencing to variant filtering, rigorous and scientific methods were applied38,39, with all metrics confirming the high quality of the sequencing results.

Fig. 3
figure 3

Statistics for the SNP Fraction in depth (a), neighbouring SNP distance (b). The different colour indicated the various individual.

Table 2 Summary statistics of SNPs and InDels in each chromosome.
Table 3 SNPs and InDels across different annotation categories.
Fig. 4
figure 4

Distribution of SNP and InDels across the whole-genome of 298 Chinese Sannen goat. (a) SNP density statistics across the whole-genome. (b) SNP and InDel density statistics across the whole-genome.

Usage Notes

In the current study, we employed comprehensive whole-genome sequencing techniques coupled with an array of advanced data analysis methods to identify single SNPs and InDels across the entirety of the Sannen dairy goat genome. We aligned the sequencing data to the reference genome of Saanen goats (No.: GCA_042835985.1), ensuring the reliability of the generated variant data. Our approach prioritized stringent variant filtering criteria, excluding variants with minor allele frequencies below 0.5%. It is noteworthy that while rare and low-frequency variants are often overlooked, they hold significant promise in elucidating the genetic architecture of complex traits in both human populations and diverse plant and animal species. Despite their potential, systematic investigations into these variants remain relatively limited40,41. Notably, our study focused exclusively on autosomal variants, omitting analyses of variants located on sex chromosomes, copy number variations (CNVs), and structural variations (SVs)42. Future investigations could delve deeper into understanding the heritability of traits through detailed analyses of these variants, thereby uncovering novel genomic loci contributing to phenotypic diversity and disease susceptibility43.