Background & Summary

Goats are among the earliest domesticated ruminants. Archaeological and genetic findings have demonstrated that domestic goats were derived from the wild goat, known as the bezoar (Capra aegagrus), in the Fertile Crescent region of Western Asia during the Neolithic Age, approximately 10,000 years ago1,2,3. Goats hold a significant place in many cultures around the world, often being used in various religious and traditional ceremonies4. They are widely raised globally, capable of adapting to harsh environments such as deserts and semi-desert environments5. Goats provide a reliable source of cashmere, meat, milk and skin. They also play a crucial role in sustainable farming and food production in many parts of the world. According to a report by the Food and Agriculture Organization of the United Nations (FAO; www.fao.org/home/en), there are over 1.15 billion goats worldwide, with more than 580 distinct goat breeds. Of these, 94.93% of the world’s goat population is in Asia and Africa (FAO, 2022), mainly distributed in developing and underdeveloped areas such as China, India, Iran, and Afghanistan.

Cashmere, with its unique qualities, has become a high-end and precious spinning raw material, earning it the reputation of ‘soft gold’ and ‘fiber gem’6. The Inner Mongolian Arbas Cashmere goat is a world-class dual-purpose breed producing cashmere and meat, formed through natural selection and artificial breeding, possessing excellent traits such as high cashmere yield, fine cashmere quality, and large body size. The cashmere produced by it was honored with the ‘Chaigneau’ Quality Wool Award by the European Fiber Committee in 1985. Moreover, the Inner Mongolian Arbas goat is the important paternal origin of most of China’s goat breeds, its export is strictly prohibited, and it is classified as a first-class protected species of genetic resources in China.

To date, five high-quality goat genome sequences have been published, but the genome of the cashmere goat is lacking7,8,9,10,11. Among these, the chromosomal-level genome assembled from a male Saanen dairy goat is of the highest quality7. This genome identified approximately two-thirds (20/29) of the proximal and distal centromeric and telomeric repeats of the autosomes. Nonetheless, some autosomes and the X, Y chromosome in the Saanen genome remain incomplete. The main challenge in these unresolved regions of the genome is the inability to fully assemble highly repetitive sequences12.

In this study, we undertook the de novo sequencing of a cashmere goat, specifically a highly economically valuable Inner Mongolian Arbas White Cashmere buck, via the integration of 72.89 Gb PacBio high-fidelity (HiFi) sequences, 299.91 Gb Oxford Nanopore Technologies (ONT) ultra-long data, 314.29 Gb high-throughput chromosome conformation capture (Hi-C) data and 226.97 Gb paired-end sequences (Tables 1, 2). We utilized a recently published, enhanced four-step assembly strategy13 to generate an assembled sequence of 2.76 Gb, with a contig N50 of 95.22 Mb (Table 3). The assembled sequences were anchored onto 29 autosomes and both sex chromosomes (X and Y) (Fig. 1a). Apart from chromosome X, which consists of three contigs, each chromosome contains only one exceptionally large contiguous contig (Fig. 1b). The length of the Y chromosome has increased from the previously reported 9.60 Mb in the Saanen genome to 11.92 Mb. We identified 35 telomeric structures located in the distal-end sequences of the 29 chromosomes, enriched with 6-bp repeat units (TTAGGG/CCCTAA) (Fig. 1b). This genome comprises 1,333.29 Mb of repetitive sequences that constituted 48.26% of total genome bases, and a total of 22,480 protein coding genes, 11,248 microRNAs (miRNA), 34,806 transfer RNA (tRNA), 756 ribosomal RNAs (rRNA), and 2003 small nuclear RNAs (snRNA) genes. Evaluation of assembly quality reveals that our new assembly demonstrates superior continuity, completeness, and accuracy compared to other goat genomes (Table 3). This is the first cashmere goat genome, providing an optimal genomic reference for studying the genetic basis of economically cashmere traits in goats.

Table 1 Summary of paired-end sequencing data.
Table 2 Summary of long-reads sequencing data.
Table 3 Global genome comparisons between cashmere goat and other goats.
Fig. 1
figure 1

Characterization of the cashmere goat genome. (a) Chromatin interactions in each chromosome at a resolution of 5 Mb. The dark red dots show the high probability of interaction, and the light dots show the low probability of interaction. (b) Distribution of telomere sequences (6-bp-unit repeats) and gap regions on each chromosome.

Methods

Ethics statement

In this study, we collected whole blood in strict accordance with the International Guiding Principles for Biomedical Research Involving Animals. This procedure was approved by the Special Committee on Scientific Research and Academic Ethics at Inner Mongolia Agricultural University, which is responsible for the approval of Biomedical Research Ethics [Approval No. (2020) 056]. These activities did not require any specific permissions and did not involve any endangered or protected species.

Sample collection

This experiment was conducted at the Inner Mongolia Yiwei White Cashmere Goat Co., Ltd. (Ordos, Inner Mongolia, China). A healthy adult male goat (3-year-old), exhibiting the best production performances, records, and phenotypic observations within the population, was selected. We collected 10 ml of venous blood from the cashmere goat using EDTA anticoagulant blood vessels. The sample was gently mixed manually ten times to ensure thorough mixing of the anticoagulant and blood. It was then divided and stored in a Corning 2 ml freezer tube. Following rapid freezing with liquid nitrogen, the sample was stored in a −80 °C freezer for subsequent experiments.

For DNA extraction, 500–1000 µl of whole blood was added to a preheated (56 °C) 2 ml Eppendorf tube containing 1 ml of lysis buffer (100 µL of 20 mg/mL Proteinase K and 100 µL of 20% SDS), and incubated at 56 °C for 60–120 minutes. The tube was then centrifuged at maximum speed for 10 minutes to collect the supernatant after cooling to room temperature. The supernatant was transferred to a new 2.0 ml tube, mixed with an equal volume of phenol/chloroform/isoamyl alcohol (25:24:1), and centrifuged at maximum speed for 10 minutes. The supernatant was then transferred to a new 1.5 ml tube and 2/3rd volume of isopropyl alcohol was added (with 1/10th volume of 3 M sodium acetate if necessary). The mixture was inverted at least 3 times and left at −20 °C for 2 hours for precipitation. After centrifuging the tube at 18213 × g for 10 minutes, the DNA pellet was washed with 1 ml of 75% ethanol. The pellet was resuspended by centrifuging at maximum speed for 5 minutes at room temperature, and the supernatant was completely removed. The DNA pellet was air-dried in a biosafety cabinet for a few minutes and then dissolved in 25 µL to 100 µL of TE Buffer. The DNA concentration was measured using a Qubit Fluorometer, and the sample integrity and purity were assessed by agarose gel electrophoresis.

Paired-end library preparation, sequencing and quality control

Between 1–1.5 μg of genomic DNA was randomly fragmented by Covaris, after which the fragmented DNA was selected for an average size of 200–400 bp using the Agencourt AMPure XP-Medium kit. These selected fragments underwent end-repair, 3′ adenylation, adapter ligation, and PCR amplification. The products were then recovered using the AxyPrep Mag PCR Clean-Up Kit. The double-stranded PCR products were heat denatured and circularized by the splint oligo sequence in MGIEasy Circularization Module (CAT#1000005260, MGI). The single-strand circular DNA (ssCir DNA) was formed as the final library and qualified by quality control. The qualified libraries were subsequently sequenced on the DNBSEQ-T7 platform, resulting in a total of 246.45 Gb raw sequences (Table 1). After undergoing filtration using the default parameters of fastp software (v0.23.4)14, the remaining 226.97 Gb (92.10%) of high-quality data was utilized for genome survey and genome assessment analyses.

Nanopore library construction and sequencing

We procured DNA from 500 μL of whole blood using the Qiagen DNeasy kit in accordance with the manufacturer’s guidelines. The DNA was eluted into 50 μL and subsequently concentrated to approximately 25 ng/μL using a Zymo DNA Clean and Concentrator Kit, resulting in a final elution volume of roughly 50 μL post-concentration. Nanopore sequencing libraries were prepared using a 1D genomic ligation kit (SQK-LSK108), following the manufacturer’s instructions but with minor modifications: the dA-tailing and FFPE repair steps were combined by using 46.5 μL of input DNA, 0.5 μL NAD+, 3.5 μL Ultra II EndPrep buffer and FFPE DNA repair buffer, and 3.0 μL of Ultra II EndPrep Enzyme and FFPE Repair Mix, resulting in a total reaction volume of 60 μL. The subsequent thermocycler conditions were adjusted to 60 minutes at 20 °C and 30 minutes at 65 °C. The remainder of the protocol was executed as per the manufacturer’s instructions. ~15 μl of the resultant library was loaded onto a PromethION with an R9.4.1 flowcell and run for 48 hours using MinKNOW version 2.0. Fastq files were derived from raw Nanopore data using Albacore (v2.3.1). A total of 6,976,996 approved reads (299.91 Gb), with an average read length of 42,985 bp and a read length N50 of 80,219 bp, were obtained for further genome assembly (Table 2).

Pacbio HiFi library preparation and sequencing

The SMRTbell library was prepared in according to the PacBio protocol. Briefly, over 5 μg of sheared and concentrated genomic DNA was processed using g-TUBEs (Covaris, USA) to achieve the desired fragment size for the library. Single-strand overhangs were removed, and the DNA fragments were repaired, end-polished, and ligated with stem-loop adapters. Link-failed fragments were further removed by exonuclease, and target fragments were selected using the BluePippin system (Sage Science, USA). The resulting SMRTbell library was purified using AMPure PB beads, and the fragment sizes were verified using an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). Consequently, two SMRTbell libraries, each approximately 40 kb in size, were sequenced using PacBio RSII equipment. A total of 4,303,402 HiFi reads were obtained, with an average read length of 16,939 bp and a read length N50 of 17,091 bp, which were then used for further genome assembly (Table 2).

Hi-C library preparation and sequencing

Firstly, the restriction enzyme DpnII was used to digest genomic DNA from blood tissue after conformation fixing with formaldehyde and repairing the 5′ overhangs using biotinylated residues. After in situ ligation of the blunt-end fragments, the isolated DNA was reverse cross-linked, purified, and filtered to retain biotin-containing fragments. Subsequently, DNA end repair, adaptor ligation, and PCR amplification were performed. Finally, a library was constructed using the NEBNext Ultra II DNA library Prep Kit for Illumina (NEB) according to the manufacturer’s instructions. for sequencing on a NovaSeq platform to generate short paired-end reads of 150 bp in length. A total of 314.29 Gb of clean data was obtained for further genome assembly (Table 1).

RNA library preparation and sequencing

RNA, extracted from five distinct tissues form the buck used for genome assembly, was employed for the creation of mRNA-seq libraries. We extracted RNA from five tissues (i.e., heart, kidney, lung, liver and spleen) of the cashmere goat and constructed a pooled library to perform full-length transcriptome sequencing. The pooled RNA was initially tested using the following methods:1) Agarose gel electrophoresis was used to exclude degraded and contaminated RNA; 2) The purity of the RNA (OD260/280 ratio) was ensured using Nanodrop; 3) The Qubit system was used for precise quantification of RNA concentration; 4) Finally, the integrity of the RNA was accurately assessed using an Agilent 2100. After passing the quality check, ~3 μg of pooled RNA was transcribed into cDNA using the Clontech SMARTer PCR cDNA Synthesis Kit (Takara Biotechnology, Dalian, China), which was then amplified to generate double-stranded cDNA.The BluePippin™ Size Selection System (Sage Science, Beverly, MA, USA) was used to select cDNA fragments of <4 kb and >4 kb. For each SMRTbell library, ~1 μg of cDNA was selected for construction using the Pacific Biosciences SMRTbell Template Prep Kit (PacBio, CA, USA) according to user manual. Finally, the SMRT cells were sequenced on the PacBio Sequel platform. This process generated 130,252,514 subreads, amounting to a total of 321.81 Gb, with a read N50 of 2,651 bp. Following the steps of circular consensus sequence calling, primer removal and demultiplexing, as well as refining and clustering for parallel polishing, a total of 121,692 high-quality consensus transcript sequences were derived from the primary PacBio BAM file of full-length transcriptome sequencing. These sequences were then used for further genome annotation.

The mRNA-seq libraries for short-read transcriptome sequencing were constructed using the NEBNext® UltraTM RNA Library Prep Kit for Illumina® (NEB, USA), with all procedures strictly adhering to the manufacturer’s recommendations. All libraries underwent sequencing on an Illumina NovaSeq6000 platform, employing PE-150 sequencing. Following the removal of low-quality reads and adaptor sequences by fastp (v0.23.4)14, we procured a total of 31.29 Gb high-quality data for five tissues of the cashmere goat (Table 4).

Table 4 The summary of five RNA-seq data of the cashmere goat.

Genome survey

Before proceeding with genome assembly, we conducted a k-mer analysis using high-quality paired-end reads to estimate the genome size and heterozygosity. Specifically, the paired-end reads were analyzed through a 17-mer frequency distribution using the KMC software15 with the parameter “-k17 -ci1 -cs1000000”. This process generated spectrum data containing a total of 94,711,895,031 k-mers (Table 5), which were subsequently analyzed using FindGSE software16. The cashmere goat genome size was estimated using the formula: G = Knum/Kdepth, where G represents the genome size, Knum is the total number of 17-mers, and Kdepth denotes the 17-mer depth. As a result, we estimated the genome size to be 2.79 Gb, with a heterozygosity rate of 0.40% (Table 5).

Table 5 The result of the K-mer analysis.

Genome assembly using an improved four-step assembly strategy

We borrowed the previously improved assembly strategy13 to generate complete assembly for the cashmere goat. Our assembly pipeline consists of four main steps: Firstly, the ONT ultra-long reads were used to assemble initial contigs by applying the ‘correct-then-assemble’ strategy in the package NextDenovo (v2.5.2; https://github.com/Nextomics/NextDenovo) with the parameters ‘read_cutoff = 1k, seed_cutoff = 32k, blocksize = 3 g’. Subsequently, the initial contigs were corrected using the paired-end reads using the ‘best’ algorithm module in the package NextPolish v1.4.117. Consequently, we obtained the ContigV1 genome assembly with a total of 222 contigs and the contig N50 of 84.04 Mb (Table 6). Secondly, we applied the single-ended model in Bowtie2 software to map the Hi-C data onto the preceding ContigV1 assembly18. After discarding the invalid self-ligated and unligated fragments within the uniquely mapped pairs using the HiCUP pipeline (version 0.8.0)19, the valid interaction pairs were used to compute the linkage frequency among all contigs by applying an agglomerative hierarchical clustering algorithm20, for clustering of the linked contigs. This process resulted in 31 linked groups. We then utilized the nucmer program in the MUMmer4 package21 to generate synteny results between the sequences of the 31 linkage groups and the chromosomal sequences of the Saanen goat (NCBI Accession Number: GCA_015443085.1). This step enabled us to further correct the contigs within the cashmere goat linkage groups associated with the X and Y sex chromosomes. Thirdly, we mapped the HiFi reads to sequences of 31 linked groups using Minimap222, and extracted the best optimal alignment reads for each linked group. Following this, each set of classified mapped reads was utilized for local assembly using the Hifiasm package (0.19.5-r587) with default parameters23. This process generated a new set of contigs, which we named ContigV2 assembly (Table 6). The ContigV2 assembly comprises 33 contigs, with a contig N50 of 95.22 Mb. This local assembly strategy can effectively circumvent the false overlap relationships that may be induced by the genome’s repetitive sequences during the construction of the string graph in the assembly process24. Ultimately, the contigs in ContigV2 assembly were clustered, ordered and oriented using the ALLHiC algorithm25, and anchored into 31 chromosomes. Any placement and orientation errors that displayed distinct chromatin interaction patterns were manually rectified. We identified the telomere regions in all chromosomes by searching for 6-bp repeat units (TTAGGG/CCCTAA).

Table 6 Assembly summary in different steps when using the improved assembly pipeline.

Genome assessment

We employed three distinct methods to assess the quality of the assembled genome. For the assessment of the assembly accuracy, we utilized a k-mer-based approach implemented in the Merqury26 to calculate the quality value (QV) with k-mer size of 21, using the paired-end reads. For the assessment of assembly completeness, we conducted BUSCO analysis to evaluate the completeness of the genome by searching against 9,226 BUSCOs of mammalia_odb10 (version 5.4.2)27, and realigned the paired-end reads to the assembled genome using BWA software28 to calculate the realignment ratio and coverage depth.

Annotation of repetitive sequences, protein-coding gene and noncoding RNA gene structure

We employed both homologous searching and ab initio prediction methods to annotate the repeated sequences within the cashmere goat genome. Notably, we first used LTR_FINDER v1.0.729, PILER v3.3.030, RepeatScout v1.0.531, and RepeatModeler v1.0.832 for the de novo construction of candidate libraries of repetitive elements within the goat genome. Subsequently, the de novo libraries of repeat sequences, in conjunction with the Repbase database, were utilized to search the cashmere goat assembly for repeated sequence annotation using RepeatMasker (v4.0.5)33. Additionally, RepeatProteinMask (v4.0.5) with default parameters was used to predict the transposable element based on the RepeatPeps database. Following these processes, we combined these results to identify 1333.29 Mb (48.26%) of repeat sequence of the cashmere goat assembly (Table 7).

Table 7 The summary of the repetitive sequences in the cashmere goat genome.

We utilized homologous-, de novo-, and transcriptome-based approaches to predict the protein-coding genes within the cashmere goat genome. For homologous-based gene prediction, we aligned the protein sequences from five mammalian genomes, including Homo sapiens (GCF_000001405.39), Mus musculus (GCF_000001635.27), Bos taurus (GCF_000003205.7), Ovis aries (GCF_016772045.1), Capra hircus (GCF_001704415.1), against the cashmere goat genome using TBLASTN (version 2.2.29+) with an e-value cut-off of 1e-534. All remaining blast hits were concatenated by Solar software (version 0.9.6). We extracted the corresponding genomic region, including 1,000 bp upstream and downstream of each candidate gene, to predict the precise gene structure using wise2 (v2.4.1)35. The resulting predictions were designated as the ‘Homology set.’ For transcriptome-based prediction, RNA-seq data were assembled and generated transcript sequences with Trinity (v2.1.1)36. We aligned the transcript sequences against the cashmere goat genome using the Program to Assemble Spliced Alignment (PASA)37, where effective alignments were clustered based on genome mapping location and assembled into gene structures. The gene models created by PASA were labeled as the PASA Trinity set. In addition, RNA-seq reads were directly mapped to the cashmere goat genome using TopHat (v2.0.13)38, and the mapped reads were assembled into gene models (Cufflinks-set) by Cufflinks (v2.1.1)39. For de novo gene prediction, we employed Augustus (v2.5.5)40, GeneID (v1.4)41, GeneScan (v1.0)42, GlimmerHMM (v3.0.1)43, and SNAP (version 2013-11-29)44 to predict genes in the repeat-masked genome. Specific parameters of Augustus, SNAP, and GlimmerHMM were trained with the gene models in the PASA Trinity set. Finally, all gene models from the above sets were integrated by EVidenceModeler (v1.1.1), with the following weights assigned to each type of evidence: PASA-T-set > Homology-set = Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. Additionally, we filtered out genes that were less than 50 amino acids in length, only supported by ab initio evidence, and with an expression value of less than 1. As result, we obtained 22,480 protein-coding gene in the cashmere goat genome (Table 8).

Table 8 The summary of gene structure in the cashmere goat genome.

We annotated the function of protein-coding genes within the cashmere goat genome using the SwissProt45, KEGG pathway46, NR (from NCBI), and InterPro databases, leveraging a homologous searching method. Notably, we obtained Pfam domain and Gene Ontology (GO) information from the InterPro database and predicted these using the InterProScan tool47, which is based on conserved protein domains and functional sites. For the other databases, we used BLATP with an e-value cut-off of 1e-434. Consequently, we found that 99.12% of the protein-coding genes were supported by functional databases (Table 9).

Table 9 The summary of gene function annotation.

We predicted the gene structures of noncoding RNAs in the cashmere goat genome. Specifically, we used the t-RNAscan-SE tool (v1.3.1) to predict tRNAs48. We predicted ribosomal RNA (rRNA) sequences by searching against the invertebrate rRNA database using BLAST with an E-value cut-off of 1e-1049. Additionally, we annotated small nuclear and nucleolar RNAs, as well as miRNAs, using Infernal (v1.1rc4) based on the Rfam database50. As a result, we identified a total of 11,248 microRNAs (miRNA), 34,806 transfer RNA (tRNA), 756 ribosomal RNAs (rRNA), and 2,003 small nuclear RNAs (snRNA) genes (Table 10). The abundance of these different categories of noncoding RNA in the Cashmere goat genome is similar to that in the Saanen goat genome.

Table 10 The summary of noncoding RNA genes.

Technical Validation

The assessment of the cashmere goat assembly

The final genome size is closely aligned with the estimated result (2.79 Gb) from K-mer analysis (Fig. 2a and Table 5). Our assembled goat genome exhibits excellent completeness, as evidenced by the coverage of 99.97% short-reads across 99.51% of the genome, and recovery of 95.9% of BUSCOs (Benchmarking Universal Single-Copy Orthologs)27 in 9,226 conserved mammalian genes from the mammalia_odb10 database (Tables 3, 11). Our BUSCOs metric results surpass the average BUSCOs values of the genomes of the most recently published approximately 60 vertebrates13,51,52,53. Furthermore, we used a reference-free and k-mer based approach and estimated a high assembly quality value (QV) of 43.68 (Table 3), exceeding the Vertebrate Genome Project (VGP) standard of QV4026,54, suggesting a superior accuracy in our assembly.

Fig. 2
figure 2

Genome survey and assembly quality in cashmere goat. (a) The frequency distribution of the 17-mers. (b) Genome contiguity of the cashmere goat (red line) compared to two other chromosome-level goat assemblies (broken lines). X-axis shows contig N50 and N90 values. Horizontal black dashed lines indicate cumulative contig size, with combined lengths shown as a X% of total genome length.

Table 11 The summary of the assembly completeness assessment for the cashmere goat genome.

The comparation between the cashmere goat and other published goat genomes

Compared to the published goat genomes7, our newly assembly possesses the best contiguity, completeness, and accuracy (Table 3 and Fig. 2b). We used the LASTZ program55 to analyze the one-to-one synteny blocks (>1 Kb) between our assembled genome and the Saanen goat genome, and found that 95% of cashmere goat assembly synteny to 99% the second-best Saanen genome assembly (Fig. 3a), as well as a total of 142.26 Mb and 19.19 Mb unique sequences in the Cashmere and Saanen goat assemblies, respectively (Table 12). Remarkably, unique sequences greater than 1 Mb in length totaled 110.52 Mb, accounting for the majority (77.83%), almost all of which were distributed around the proximal telomeres of the 18 chromosomes (Fig. 3a). Among these, the longest unique sequence is 17 Mb, which is on chromosome 17. We mapped the three kinds (HiFi, ONT ultra-long and paired-end reads) of sequencing data of assembled cashmere goat onto its assembly, and found that these unique sequences were almost as well covered by the sequencing data as other regions in the vicinity (Fig. 3b). This reflects the accuracy of the assembly sequences.

Fig. 3
figure 3

Comparisons between the cashmere and Saanen goat genomes. (a) Chord diagram depicting genome synteny. (b) The read coverage distribution in the breakpoint regions near the unique sequences of the cashmere goat genome.

Table 12 Genomic syntenic result of the cashmere goat genome against Saanen_v1 genome.

Data Records

The raw data, including PacBio, ONT, Hi-C, paired-end sequencing, and RNA-seq data have been deposited into the NCBI sequence read archive (SRA) under accession code: SRR28823973 - SRR28823996, SRR30599194 - SRR3059919956. The genome assembly has been deposited in the GenBank database under accession number GCA_040822015.157. The genome sequences and annotation file (i.e., the GFF file and the FASTA file record coding sequences and protein sequences) of the cashmere goat are available at Figshare (https://doi.org/10.6084/m9.figshare.25697928.v1)58.