Chromosome-level genome assembly of the cashmere goat

Wang, Zhiying; Lv, Qi; Li, Wenze; Huang, Wanlong; Gong, Gao; Yan, Xiaochun; Liu, Baichuan; Chen, Oljibilig; Wang, Na; Zhang, Yanjun; Wang, Ruijun; Li, Jinquan; Tian, Shilin; Su, Rui

doi:10.1038/s41597-024-03932-7

Download PDF

Data Descriptor
Open access
Published: 09 October 2024

Chromosome-level genome assembly of the cashmere goat

Zhiying Wang^1,2,3^na1,
Qi Lv^1,2,3^na1,
Wenze Li^1,2,3^na1,
Wanlong Huang⁴^na1,
Gao Gong ORCID: orcid.org/0000-0003-4655-5434^1,2,3,
Xiaochun Yan^1,2,3,
Baichuan Liu⁵,
Oljibilig Chen⁵,
Na Wang⁵,
Yanjun Zhang^1,2,3,
Ruijun Wang^1,2,3,
Jinquan Li^1,2,3,
Shilin Tian ORCID: orcid.org/0000-0001-8958-1806⁴ &
…
Rui Su^1,2,3

Scientific Data volume 11, Article number: 1107 (2024) Cite this article

2966 Accesses
3 Citations
Metrics details

Subjects

Abstract

The goat, an early domesticated ruminant, is a reliable source of cashmere, meat and milk in global agricultural production. Despite this, the genome of cashmere-rich goats has yet to be characterized. Here, we assembled the nearly complete genome of a cashmere goat from a highly economically valuable Inner Mongolian Cashmere buck, utilizing a combination of PacBio HiFi, ONT ultra-long reads, and Hi-C technologies. The size of this genome is 2.76 Gb, with a contig N50 of 95.22 Mb. All assembled sequences were anchored onto 29 autosomes and both sex chromosomes, with only two gaps present on the X chromosome. We identified 1,333.29 Mb (48.26%) of repetitive sequences and predicted 22,480 protein-coding genes. Assembly quality assessment of the genome demonstrated that our assembled cashmere goat genome surpasses the continuity, completeness, and accuracy of other published goat genomes. Taken together, we provided the first cashmere goat assembly, bridging the gap in the genome of important economic breeds of domestic goats, and providing a valuable reference resource for goat genetics and genome research.

Telomere-to-telomere genome assembly of a male goat reveals variants associated with cashmere traits

Article Open access 20 November 2024

Telomere-to-telomere genome assembly of the goose Anser cygnoides

Article Open access 07 July 2024

Comparative genomic analysis uncovers candidate genes related with milk production and adaptive traits in goat breeds

Article Open access 30 May 2023

Background & Summary

Goats are among the earliest domesticated ruminants. Archaeological and genetic findings have demonstrated that domestic goats were derived from the wild goat, known as the bezoar (Capra aegagrus), in the Fertile Crescent region of Western Asia during the Neolithic Age, approximately 10,000 years ago^1,2,3. Goats hold a significant place in many cultures around the world, often being used in various religious and traditional ceremonies⁴. They are widely raised globally, capable of adapting to harsh environments such as deserts and semi-desert environments⁵. Goats provide a reliable source of cashmere, meat, milk and skin. They also play a crucial role in sustainable farming and food production in many parts of the world. According to a report by the Food and Agriculture Organization of the United Nations (FAO; www.fao.org/home/en), there are over 1.15 billion goats worldwide, with more than 580 distinct goat breeds. Of these, 94.93% of the world’s goat population is in Asia and Africa (FAO, 2022), mainly distributed in developing and underdeveloped areas such as China, India, Iran, and Afghanistan.

Cashmere, with its unique qualities, has become a high-end and precious spinning raw material, earning it the reputation of ‘soft gold’ and ‘fiber gem’⁶. The Inner Mongolian Arbas Cashmere goat is a world-class dual-purpose breed producing cashmere and meat, formed through natural selection and artificial breeding, possessing excellent traits such as high cashmere yield, fine cashmere quality, and large body size. The cashmere produced by it was honored with the ‘Chaigneau’ Quality Wool Award by the European Fiber Committee in 1985. Moreover, the Inner Mongolian Arbas goat is the important paternal origin of most of China’s goat breeds, its export is strictly prohibited, and it is classified as a first-class protected species of genetic resources in China.

To date, five high-quality goat genome sequences have been published, but the genome of the cashmere goat is lacking^7,8,9,10,11. Among these, the chromosomal-level genome assembled from a male Saanen dairy goat is of the highest quality⁷. This genome identified approximately two-thirds (20/29) of the proximal and distal centromeric and telomeric repeats of the autosomes. Nonetheless, some autosomes and the X, Y chromosome in the Saanen genome remain incomplete. The main challenge in these unresolved regions of the genome is the inability to fully assemble highly repetitive sequences¹².

In this study, we undertook the de novo sequencing of a cashmere goat, specifically a highly economically valuable Inner Mongolian Arbas White Cashmere buck, via the integration of 72.89 Gb PacBio high-fidelity (HiFi) sequences, 299.91 Gb Oxford Nanopore Technologies (ONT) ultra-long data, 314.29 Gb high-throughput chromosome conformation capture (Hi-C) data and 226.97 Gb paired-end sequences (Tables 1, 2). We utilized a recently published, enhanced four-step assembly strategy¹³ to generate an assembled sequence of 2.76 Gb, with a contig N50 of 95.22 Mb (Table 3). The assembled sequences were anchored onto 29 autosomes and both sex chromosomes (X and Y) (Fig. 1a). Apart from chromosome X, which consists of three contigs, each chromosome contains only one exceptionally large contiguous contig (Fig. 1b). The length of the Y chromosome has increased from the previously reported 9.60 Mb in the Saanen genome to 11.92 Mb. We identified 35 telomeric structures located in the distal-end sequences of the 29 chromosomes, enriched with 6-bp repeat units (TTAGGG/CCCTAA) (Fig. 1b). This genome comprises 1,333.29 Mb of repetitive sequences that constituted 48.26% of total genome bases, and a total of 22,480 protein coding genes, 11,248 microRNAs (miRNA), 34,806 transfer RNA (tRNA), 756 ribosomal RNAs (rRNA), and 2003 small nuclear RNAs (snRNA) genes. Evaluation of assembly quality reveals that our new assembly demonstrates superior continuity, completeness, and accuracy compared to other goat genomes (Table 3). This is the first cashmere goat genome, providing an optimal genomic reference for studying the genetic basis of economically cashmere traits in goats.

Table 1 Summary of paired-end sequencing data.

Full size table

Table 2 Summary of long-reads sequencing data.

Full size table

Table 3 Global genome comparisons between cashmere goat and other goats.

Full size table

Methods

Ethics statement

In this study, we collected whole blood in strict accordance with the International Guiding Principles for Biomedical Research Involving Animals. This procedure was approved by the Special Committee on Scientific Research and Academic Ethics at Inner Mongolia Agricultural University, which is responsible for the approval of Biomedical Research Ethics [Approval No. (2020) 056]. These activities did not require any specific permissions and did not involve any endangered or protected species.

Sample collection

This experiment was conducted at the Inner Mongolia Yiwei White Cashmere Goat Co., Ltd. (Ordos, Inner Mongolia, China). A healthy adult male goat (3-year-old), exhibiting the best production performances, records, and phenotypic observations within the population, was selected. We collected 10 ml of venous blood from the cashmere goat using EDTA anticoagulant blood vessels. The sample was gently mixed manually ten times to ensure thorough mixing of the anticoagulant and blood. It was then divided and stored in a Corning 2 ml freezer tube. Following rapid freezing with liquid nitrogen, the sample was stored in a −80 °C freezer for subsequent experiments.

For DNA extraction, 500–1000 µl of whole blood was added to a preheated (56 °C) 2 ml Eppendorf tube containing 1 ml of lysis buffer (100 µL of 20 mg/mL Proteinase K and 100 µL of 20% SDS), and incubated at 56 °C for 60–120 minutes. The tube was then centrifuged at maximum speed for 10 minutes to collect the supernatant after cooling to room temperature. The supernatant was transferred to a new 2.0 ml tube, mixed with an equal volume of phenol/chloroform/isoamyl alcohol (25:24:1), and centrifuged at maximum speed for 10 minutes. The supernatant was then transferred to a new 1.5 ml tube and 2/3rd volume of isopropyl alcohol was added (with 1/10th volume of 3 M sodium acetate if necessary). The mixture was inverted at least 3 times and left at −20 °C for 2 hours for precipitation. After centrifuging the tube at 18213 × g for 10 minutes, the DNA pellet was washed with 1 ml of 75% ethanol. The pellet was resuspended by centrifuging at maximum speed for 5 minutes at room temperature, and the supernatant was completely removed. The DNA pellet was air-dried in a biosafety cabinet for a few minutes and then dissolved in 25 µL to 100 µL of TE Buffer. The DNA concentration was measured using a Qubit Fluorometer, and the sample integrity and purity were assessed by agarose gel electrophoresis.

Paired-end library preparation, sequencing and quality control

Between 1–1.5 μg of genomic DNA was randomly fragmented by Covaris, after which the fragmented DNA was selected for an average size of 200–400 bp using the Agencourt AMPure XP-Medium kit. These selected fragments underwent end-repair, 3′ adenylation, adapter ligation, and PCR amplification. The products were then recovered using the AxyPrep Mag PCR Clean-Up Kit. The double-stranded PCR products were heat denatured and circularized by the splint oligo sequence in MGIEasy Circularization Module (CAT#1000005260, MGI). The single-strand circular DNA (ssCir DNA) was formed as the final library and qualified by quality control. The qualified libraries were subsequently sequenced on the DNBSEQ-T7 platform, resulting in a total of 246.45 Gb raw sequences (Table 1). After undergoing filtration using the default parameters of fastp software (v0.23.4)¹⁴, the remaining 226.97 Gb (92.10%) of high-quality data was utilized for genome survey and genome assessment analyses.

Nanopore library construction and sequencing

We procured DNA from 500 μL of whole blood using the Qiagen DNeasy kit in accordance with the manufacturer’s guidelines. The DNA was eluted into 50 μL and subsequently concentrated to approximately 25 ng/μL using a Zymo DNA Clean and Concentrator Kit, resulting in a final elution volume of roughly 50 μL post-concentration. Nanopore sequencing libraries were prepared using a 1D genomic ligation kit (SQK-LSK108), following the manufacturer’s instructions but with minor modifications: the dA-tailing and FFPE repair steps were combined by using 46.5 μL of input DNA, 0.5 μL NAD+, 3.5 μL Ultra II EndPrep buffer and FFPE DNA repair buffer, and 3.0 μL of Ultra II EndPrep Enzyme and FFPE Repair Mix, resulting in a total reaction volume of 60 μL. The subsequent thermocycler conditions were adjusted to 60 minutes at 20 °C and 30 minutes at 65 °C. The remainder of the protocol was executed as per the manufacturer’s instructions. ~15 μl of the resultant library was loaded onto a PromethION with an R9.4.1 flowcell and run for 48 hours using MinKNOW version 2.0. Fastq files were derived from raw Nanopore data using Albacore (v2.3.1). A total of 6,976,996 approved reads (299.91 Gb), with an average read length of 42,985 bp and a read length N50 of 80,219 bp, were obtained for further genome assembly (Table 2).

Pacbio HiFi library preparation and sequencing

The SMRTbell library was prepared in according to the PacBio protocol. Briefly, over 5 μg of sheared and concentrated genomic DNA was processed using g-TUBEs (Covaris, USA) to achieve the desired fragment size for the library. Single-strand overhangs were removed, and the DNA fragments were repaired, end-polished, and ligated with stem-loop adapters. Link-failed fragments were further removed by exonuclease, and target fragments were selected using the BluePippin system (Sage Science, USA). The resulting SMRTbell library was purified using AMPure PB beads, and the fragment sizes were verified using an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). Consequently, two SMRTbell libraries, each approximately 40 kb in size, were sequenced using PacBio RSII equipment. A total of 4,303,402 HiFi reads were obtained, with an average read length of 16,939 bp and a read length N50 of 17,091 bp, which were then used for further genome assembly (Table 2).

Hi-C library preparation and sequencing

Firstly, the restriction enzyme DpnII was used to digest genomic DNA from blood tissue after conformation fixing with formaldehyde and repairing the 5′ overhangs using biotinylated residues. After in situ ligation of the blunt-end fragments, the isolated DNA was reverse cross-linked, purified, and filtered to retain biotin-containing fragments. Subsequently, DNA end repair, adaptor ligation, and PCR amplification were performed. Finally, a library was constructed using the NEBNext Ultra II DNA library Prep Kit for Illumina (NEB) according to the manufacturer’s instructions. for sequencing on a NovaSeq platform to generate short paired-end reads of 150 bp in length. A total of 314.29 Gb of clean data was obtained for further genome assembly (Table 1).

RNA library preparation and sequencing

RNA, extracted from five distinct tissues form the buck used for genome assembly, was employed for the creation of mRNA-seq libraries. We extracted RNA from five tissues (i.e., heart, kidney, lung, liver and spleen) of the cashmere goat and constructed a pooled library to perform full-length transcriptome sequencing. The pooled RNA was initially tested using the following methods:1) Agarose gel electrophoresis was used to exclude degraded and contaminated RNA; 2) The purity of the RNA (OD260/280 ratio) was ensured using Nanodrop; 3) The Qubit system was used for precise quantification of RNA concentration; 4) Finally, the integrity of the RNA was accurately assessed using an Agilent 2100. After passing the quality check, ~3 μg of pooled RNA was transcribed into cDNA using the Clontech SMARTer PCR cDNA Synthesis Kit (Takara Biotechnology, Dalian, China), which was then amplified to generate double-stranded cDNA.The BluePippin™ Size Selection System (Sage Science, Beverly, MA, USA) was used to select cDNA fragments of <4 kb and >4 kb. For each SMRTbell library, ~1 μg of cDNA was selected for construction using the Pacific Biosciences SMRTbell Template Prep Kit (PacBio, CA, USA) according to user manual. Finally, the SMRT cells were sequenced on the PacBio Sequel platform. This process generated 130,252,514 subreads, amounting to a total of 321.81 Gb, with a read N50 of 2,651 bp. Following the steps of circular consensus sequence calling, primer removal and demultiplexing, as well as refining and clustering for parallel polishing, a total of 121,692 high-quality consensus transcript sequences were derived from the primary PacBio BAM file of full-length transcriptome sequencing. These sequences were then used for further genome annotation.

The mRNA-seq libraries for short-read transcriptome sequencing were constructed using the NEBNext® UltraTM RNA Library Prep Kit for Illumina® (NEB, USA), with all procedures strictly adhering to the manufacturer’s recommendations. All libraries underwent sequencing on an Illumina NovaSeq6000 platform, employing PE-150 sequencing. Following the removal of low-quality reads and adaptor sequences by fastp (v0.23.4)¹⁴, we procured a total of 31.29 Gb high-quality data for five tissues of the cashmere goat (Table 4).

Table 4 The summary of five RNA-seq data of the cashmere goat.

Full size table

Genome survey

Before proceeding with genome assembly, we conducted a k-mer analysis using high-quality paired-end reads to estimate the genome size and heterozygosity. Specifically, the paired-end reads were analyzed through a 17-mer frequency distribution using the KMC software¹⁵ with the parameter “-k17 -ci1 -cs1000000”. This process generated spectrum data containing a total of 94,711,895,031 k-mers (Table 5), which were subsequently analyzed using FindGSE software¹⁶. The cashmere goat genome size was estimated using the formula: G = K_num/K_depth, where G represents the genome size, K_num is the total number of 17-mers, and K_depth denotes the 17-mer depth. As a result, we estimated the genome size to be 2.79 Gb, with a heterozygosity rate of 0.40% (Table 5).

Table 5 The result of the K-mer analysis.

Full size table

Genome assembly using an improved four-step assembly strategy

We borrowed the previously improved assembly strategy¹³ to generate complete assembly for the cashmere goat. Our assembly pipeline consists of four main steps: Firstly, the ONT ultra-long reads were used to assemble initial contigs by applying the ‘correct-then-assemble’ strategy in the package NextDenovo (v2.5.2; https://github.com/Nextomics/NextDenovo) with the parameters ‘read_cutoff = 1k, seed_cutoff = 32k, blocksize = 3 g’. Subsequently, the initial contigs were corrected using the paired-end reads using the ‘best’ algorithm module in the package NextPolish v1.4.1¹⁷. Consequently, we obtained the ContigV1 genome assembly with a total of 222 contigs and the contig N50 of 84.04 Mb (Table 6). Secondly, we applied the single-ended model in Bowtie2 software to map the Hi-C data onto the preceding ContigV1 assembly¹⁸. After discarding the invalid self-ligated and unligated fragments within the uniquely mapped pairs using the HiCUP pipeline (version 0.8.0)¹⁹, the valid interaction pairs were used to compute the linkage frequency among all contigs by applying an agglomerative hierarchical clustering algorithm²⁰, for clustering of the linked contigs. This process resulted in 31 linked groups. We then utilized the nucmer program in the MUMmer4 package²¹ to generate synteny results between the sequences of the 31 linkage groups and the chromosomal sequences of the Saanen goat (NCBI Accession Number: GCA_015443085.1). This step enabled us to further correct the contigs within the cashmere goat linkage groups associated with the X and Y sex chromosomes. Thirdly, we mapped the HiFi reads to sequences of 31 linked groups using Minimap2²², and extracted the best optimal alignment reads for each linked group. Following this, each set of classified mapped reads was utilized for local assembly using the Hifiasm package (0.19.5-r587) with default parameters²³. This process generated a new set of contigs, which we named ContigV2 assembly (Table 6). The ContigV2 assembly comprises 33 contigs, with a contig N50 of 95.22 Mb. This local assembly strategy can effectively circumvent the false overlap relationships that may be induced by the genome’s repetitive sequences during the construction of the string graph in the assembly process²⁴. Ultimately, the contigs in ContigV2 assembly were clustered, ordered and oriented using the ALLHiC algorithm²⁵, and anchored into 31 chromosomes. Any placement and orientation errors that displayed distinct chromatin interaction patterns were manually rectified. We identified the telomere regions in all chromosomes by searching for 6-bp repeat units (TTAGGG/CCCTAA).

Table 6 Assembly summary in different steps when using the improved assembly pipeline.

Full size table

Genome assessment

We employed three distinct methods to assess the quality of the assembled genome. For the assessment of the assembly accuracy, we utilized a k-mer-based approach implemented in the Merqury²⁶ to calculate the quality value (QV) with k-mer size of 21, using the paired-end reads. For the assessment of assembly completeness, we conducted BUSCO analysis to evaluate the completeness of the genome by searching against 9,226 BUSCOs of mammalia_odb10 (version 5.4.2)²⁷, and realigned the paired-end reads to the assembled genome using BWA software²⁸ to calculate the realignment ratio and coverage depth.

Annotation of repetitive sequences, protein-coding gene and noncoding RNA gene structure

We employed both homologous searching and ab initio prediction methods to annotate the repeated sequences within the cashmere goat genome. Notably, we first used LTR_FINDER v1.0.7²⁹, PILER v3.3.0³⁰, RepeatScout v1.0.5³¹, and RepeatModeler v1.0.8³² for the de novo construction of candidate libraries of repetitive elements within the goat genome. Subsequently, the de novo libraries of repeat sequences, in conjunction with the Repbase database, were utilized to search the cashmere goat assembly for repeated sequence annotation using RepeatMasker (v4.0.5)³³. Additionally, RepeatProteinMask (v4.0.5) with default parameters was used to predict the transposable element based on the RepeatPeps database. Following these processes, we combined these results to identify 1333.29 Mb (48.26%) of repeat sequence of the cashmere goat assembly (Table 7).

Table 7 The summary of the repetitive sequences in the cashmere goat genome.

Full size table

We utilized homologous-, de novo-, and transcriptome-based approaches to predict the protein-coding genes within the cashmere goat genome. For homologous-based gene prediction, we aligned the protein sequences from five mammalian genomes, including Homo sapiens (GCF_000001405.39), Mus musculus (GCF_000001635.27), Bos taurus (GCF_000003205.7), Ovis aries (GCF_016772045.1), Capra hircus (GCF_001704415.1), against the cashmere goat genome using TBLASTN (version 2.2.29+) with an e-value cut-off of 1e-5³⁴. All remaining blast hits were concatenated by Solar software (version 0.9.6). We extracted the corresponding genomic region, including 1,000 bp upstream and downstream of each candidate gene, to predict the precise gene structure using wise2 (v2.4.1)³⁵. The resulting predictions were designated as the ‘Homology set.’ For transcriptome-based prediction, RNA-seq data were assembled and generated transcript sequences with Trinity (v2.1.1)³⁶. We aligned the transcript sequences against the cashmere goat genome using the Program to Assemble Spliced Alignment (PASA)³⁷, where effective alignments were clustered based on genome mapping location and assembled into gene structures. The gene models created by PASA were labeled as the PASA Trinity set. In addition, RNA-seq reads were directly mapped to the cashmere goat genome using TopHat (v2.0.13)³⁸, and the mapped reads were assembled into gene models (Cufflinks-set) by Cufflinks (v2.1.1)³⁹. For de novo gene prediction, we employed Augustus (v2.5.5)⁴⁰, GeneID (v1.4)⁴¹, GeneScan (v1.0)⁴², GlimmerHMM (v3.0.1)⁴³, and SNAP (version 2013-11-29)⁴⁴ to predict genes in the repeat-masked genome. Specific parameters of Augustus, SNAP, and GlimmerHMM were trained with the gene models in the PASA Trinity set. Finally, all gene models from the above sets were integrated by EVidenceModeler (v1.1.1), with the following weights assigned to each type of evidence: PASA-T-set > Homology-set = Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. Additionally, we filtered out genes that were less than 50 amino acids in length, only supported by ab initio evidence, and with an expression value of less than 1. As result, we obtained 22,480 protein-coding gene in the cashmere goat genome (Table 8).

Table 8 The summary of gene structure in the cashmere goat genome.

Full size table

We annotated the function of protein-coding genes within the cashmere goat genome using the SwissProt⁴⁵, KEGG pathway⁴⁶, NR (from NCBI), and InterPro databases, leveraging a homologous searching method. Notably, we obtained Pfam domain and Gene Ontology (GO) information from the InterPro database and predicted these using the InterProScan tool⁴⁷, which is based on conserved protein domains and functional sites. For the other databases, we used BLATP with an e-value cut-off of 1e-4³⁴. Consequently, we found that 99.12% of the protein-coding genes were supported by functional databases (Table 9).

Table 9 The summary of gene function annotation.

Full size table

We predicted the gene structures of noncoding RNAs in the cashmere goat genome. Specifically, we used the t-RNAscan-SE tool (v1.3.1) to predict tRNAs⁴⁸. We predicted ribosomal RNA (rRNA) sequences by searching against the invertebrate rRNA database using BLAST with an E-value cut-off of 1e-10⁴⁹. Additionally, we annotated small nuclear and nucleolar RNAs, as well as miRNAs, using Infernal (v1.1rc4) based on the Rfam database⁵⁰. As a result, we identified a total of 11,248 microRNAs (miRNA), 34,806 transfer RNA (tRNA), 756 ribosomal RNAs (rRNA), and 2,003 small nuclear RNAs (snRNA) genes (Table 10). The abundance of these different categories of noncoding RNA in the Cashmere goat genome is similar to that in the Saanen goat genome.

Table 10 The summary of noncoding RNA genes.

Full size table

Technical Validation

The assessment of the cashmere goat assembly

The final genome size is closely aligned with the estimated result (2.79 Gb) from K-mer analysis (Fig. 2a and Table 5). Our assembled goat genome exhibits excellent completeness, as evidenced by the coverage of 99.97% short-reads across 99.51% of the genome, and recovery of 95.9% of BUSCOs (Benchmarking Universal Single-Copy Orthologs)²⁷ in 9,226 conserved mammalian genes from the mammalia_odb10 database (Tables 3, 11). Our BUSCOs metric results surpass the average BUSCOs values of the genomes of the most recently published approximately 60 vertebrates^13,51,52,53. Furthermore, we used a reference-free and k-mer based approach and estimated a high assembly quality value (QV) of 43.68 (Table 3), exceeding the Vertebrate Genome Project (VGP) standard of QV40^26,54, suggesting a superior accuracy in our assembly.

Table 11 The summary of the assembly completeness assessment for the cashmere goat genome.

Full size table

The comparation between the cashmere goat and other published goat genomes

Compared to the published goat genomes⁷, our newly assembly possesses the best contiguity, completeness, and accuracy (Table 3 and Fig. 2b). We used the LASTZ program⁵⁵ to analyze the one-to-one synteny blocks (>1 Kb) between our assembled genome and the Saanen goat genome, and found that 95% of cashmere goat assembly synteny to 99% the second-best Saanen genome assembly (Fig. 3a), as well as a total of 142.26 Mb and 19.19 Mb unique sequences in the Cashmere and Saanen goat assemblies, respectively (Table 12). Remarkably, unique sequences greater than 1 Mb in length totaled 110.52 Mb, accounting for the majority (77.83%), almost all of which were distributed around the proximal telomeres of the 18 chromosomes (Fig. 3a). Among these, the longest unique sequence is 17 Mb, which is on chromosome 17. We mapped the three kinds (HiFi, ONT ultra-long and paired-end reads) of sequencing data of assembled cashmere goat onto its assembly, and found that these unique sequences were almost as well covered by the sequencing data as other regions in the vicinity (Fig. 3b). This reflects the accuracy of the assembly sequences.

Table 12 Genomic syntenic result of the cashmere goat genome against Saanen_v1 genome.

Full size table

Data Records

The raw data, including PacBio, ONT, Hi-C, paired-end sequencing, and RNA-seq data have been deposited into the NCBI sequence read archive (SRA) under accession code: SRR28823973 - SRR28823996, SRR30599194 - SRR30599199⁵⁶. The genome assembly has been deposited in the GenBank database under accession number GCA_040822015.1⁵⁷. The genome sequences and annotation file (i.e., the GFF file and the FASTA file record coding sequences and protein sequences) of the cashmere goat are available at Figshare (https://doi.org/10.6084/m9.figshare.25697928.v1)⁵⁸.

Code availability

All commands and pipelines used in data processing were executed according to the manual and protocols of the corresponding bioinformatics software.

References

Zeder, M. A. & Hesse, B. The initial domestication of goats (Capra hircus) in the Zagros mountains 10,000 years ago. Science 287, 2254–7 (2000).
Article ADS CAS PubMed Google Scholar
Daly, K. G. et al. Ancient goat genomes reveal mosaic domestication in the Fertile Crescent. Science 361, 85–88 (2018).
Article ADS CAS PubMed Google Scholar
Zheng, Z. et al. The origin of domestication genes in goats. Sci Adv 6, eaaz5216 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Hatziminaoglou, Y. & Boyazoglu, J. The goat in ancient civilisations: from the Fertile Crescent to the Aegean Sea. Small Ruminant Research 51, 123–129 (2004).
Article Google Scholar
MacHugh, D. E. & Bradley, D. G. Livestock genetic origins: goats buck the trend. Proc Natl Acad Sci USA 98, 5382–4 (2001).
Article ADS CAS PubMed PubMed Central Google Scholar
Gong, G. et al. Identification of Genes Related to Hair Follicle Cycle Development in Inner Mongolia Cashmere Goat by WGCNA. Front Vet Sci 9, 894380 (2022).
Article PubMed PubMed Central Google Scholar
Li, R. et al. A near complete genome for goat genetic and genomic research. Genet Sel Evol 53, 74 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dong, Y. et al. Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat Biotechnol 31, 135–41 (2013).
Article CAS PubMed Google Scholar
Du, X. et al. An update of the goat genome assembly using dense radiation hybrid maps allows detailed analysis of evolutionary rearrangements in Bovidae. BMC Genomics 15, 625 (2014).
Article PubMed PubMed Central Google Scholar
Bickhart, D. M. et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Genet 49, 643–650 (2017).
Article CAS PubMed PubMed Central Google Scholar
Siddiki, A. Z. et al. The genome of the Black Bengal goat (Capra hircus). BMC Res Notes 12, 362 (2019).
Article PubMed PubMed Central Google Scholar
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022).
Article CAS PubMed PubMed Central Google Scholar
Tian, S. et al. Comparative analyses of bat genomes identify distinct evolution of immunity in Old World fruit bats. Sci Adv 9, eadd0141 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Article PubMed PubMed Central Google Scholar
Deorowicz, S., Kokot, M., Grabowski, S. & Debudaj-Grabysz, A. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31, 1569–76 (2015).
Article CAS PubMed Google Scholar
Sun, H., Ding, J., Piednoel, M. & Schneeberger, K. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics 34, 550–557 (2018).
Article CAS PubMed Google Scholar
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).
Article CAS PubMed Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–9 (2012).
Article CAS PubMed PubMed Central Google Scholar
Wingett, S. et al. HiCUP: pipeline for mapping and processing Hi-C data. F1000Res 4, 1310 (2015).
Article PubMed PubMed Central Google Scholar
Li, D. et al. Population genomics identifies patterns of genetic diversity and selection in chicken. BMC Genomics 20, 263 (2019).
Article PubMed PubMed Central Google Scholar
Marcais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14, e1005944 (2018).
Article PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 170–175 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Myers, E. W. The fragment assembly string graph. Bioinformatics 21(Suppl 2), ii79–85 (2005).
Article CAS PubMed Google Scholar
Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat Plants 5, 833–845 (2019).
Article CAS PubMed Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245 (2020).
Article CAS PubMed PubMed Central Google Scholar
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–2 (2015).
Article CAS PubMed Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–95 (2010).
Article PubMed PubMed Central Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 35, W265–8 (2007).
Article PubMed PubMed Central Google Scholar
Edgar, R. C. & Myers, E. W. PILER: identification and classification of genomic repeats. Bioinformatics 21(Suppl 1), i152–8 (2005).
Article CAS PubMed Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21(Suppl 1), i351–8 (2005).
Article CAS PubMed Google Scholar
Smit, A. & Hubley, R.R. Open-1.0. Available from. http://www.repeatmasker.org (2008).
Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0. 2013–2015. (2015).
Mount, D. W. Using the basic local alignment search tool (BLAST). CSH Protoc 2007, pdb top17 (2007).
PubMed Google Scholar
Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res 14, 988–95 (2004).
Article CAS PubMed PubMed Central Google Scholar
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644–52 (2011).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14, R36 (2013).
Article PubMed PubMed Central Google Scholar
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7, 562–78 (2012).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 (Suppl 2), ii215–25 (2003).
Article PubMed Google Scholar
Guigo, R. Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol 5, 681–702 (1998).
Article CAS PubMed Google Scholar
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J Mol Biol 268, 78–94 (1997).
Article CAS PubMed Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–9 (2004).
Article CAS PubMed Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
Article PubMed PubMed Central Google Scholar
UniProt Consortium, T. UniProt: the universal protein knowledgebase. Nucleic Acids Res 46, 2699 (2018).
Article PubMed PubMed Central Google Scholar
Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res 42, D199–205 (2014).
Article CAS PubMed Google Scholar
Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res 33, W116–20 (2005).
Article CAS PubMed PubMed Central Google Scholar
Schattner, P., Brooks, A. N. & Lowe, T. M. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res 33, W686–9 (2005).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J Mol Biol 215, 403–10 (1990).
Article CAS PubMed Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–5 (2013).
Article CAS PubMed PubMed Central Google Scholar
Shao, Y. et al. Phylogenomic analyses provide insights into primate evolution. Science 380, 913–924 (2023).
Article ADS CAS PubMed Google Scholar
Jebb, D. et al. Six reference-quality genomes reveal evolution of bat adaptations. Nature 583, 578–584 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Peng, C. et al. Large-scale snake genome analyses provide insights into vertebrate development. Cell 186, 2959–2976 e22 (2023).
Article PubMed Google Scholar
Editorial, N.B. A reference standard for genome biology. Nat Biotechnol 36, 1121 (2018).
Harris, R. S. Improved pairwise alignment of genomic DNA. (2007).
Wang, Z. et al. NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP504456 (2024).
Wang, Z. et al. Genbank https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1104404 (2024).
Wang, Z. et al. Genome annotated files for the reference genome of cashmere goat. Figshare. https://doi.org/10.6084/m9.figshare.25697928.v1 (2024).

Download references

Acknowledgements

This work was financially supported by National Key Research and Development Program of China (2022YFE0113300, 2022YFD1300204, 2022YFD1300201), the Central Government Guides Local Science and Technology Development Fund Projects (2022ZY0211), Inner Mongolia Agricultural University Outstanding Youth Science Fund Cultivation Project (BR230304), China Agriculture Research System of MOF and MARA (No. CARS-39), Scientific and Technological Program of Inner Mongolia Autonomous Region (2023KYPT0021), Key Discipline Key Laboratory Project (NNDWTCS-2023059), Higher Education Reform and Development Project (NNDWTCS-2023058), and the Beijing Nova Program (Z211100002121022 and 20230484446).

Author information

These authors contributed equally: Zhiying Wang, Qi Lv, Wenze Li, Wanlong Huang

Authors and Affiliations

College of Animal Science, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia Autonomous Region, 010018, China
Zhiying Wang, Qi Lv, Wenze Li, Gao Gong, Xiaochun Yan, Yanjun Zhang, Ruijun Wang, Jinquan Li & Rui Su
Sino-Arabian Joint Laboratory of Sheep and Goat Germplasm Innovation, Hohhot, Inner Mongolia Autonomous Region, 010018, China
Zhiying Wang, Qi Lv, Wenze Li, Gao Gong, Xiaochun Yan, Yanjun Zhang, Ruijun Wang, Jinquan Li & Rui Su
Inner Mongolia Key Laboratory of Sheep & Goat Genetics Breeding and Reproduction, Hohhot, Inner Mongolia Autonomous Region, 010018, China
Zhiying Wang, Qi Lv, Wenze Li, Gao Gong, Xiaochun Yan, Yanjun Zhang, Ruijun Wang, Jinquan Li & Rui Su
Novogene Bioinformatics Institute, Beijing, 100015, China
Wanlong Huang & Shilin Tian
Inner Mongolia Yiwei White Cashmere Goat Co., Ltd, Ordos, Inner Mongolia Autonomous Region, 017000, China
Baichuan Liu, Oljibilig Chen & Na Wang

Authors

Zhiying Wang
View author publications
Search author on:PubMed Google Scholar
Qi Lv
View author publications
Search author on:PubMed Google Scholar
Wenze Li
View author publications
Search author on:PubMed Google Scholar
Wanlong Huang
View author publications
Search author on:PubMed Google Scholar
Gao Gong
View author publications
Search author on:PubMed Google Scholar
Xiaochun Yan
View author publications
Search author on:PubMed Google Scholar
Baichuan Liu
View author publications
Search author on:PubMed Google Scholar
Oljibilig Chen
View author publications
Search author on:PubMed Google Scholar
Na Wang
View author publications
Search author on:PubMed Google Scholar
Yanjun Zhang
View author publications
Search author on:PubMed Google Scholar
Ruijun Wang
View author publications
Search author on:PubMed Google Scholar
Jinquan Li
View author publications
Search author on:PubMed Google Scholar
Shilin Tian
View author publications
Search author on:PubMed Google Scholar
Rui Su
View author publications
Search author on:PubMed Google Scholar

Contributions

Zhiying Wang, Qi Lv and Su Rui: conceived the research project. Wenze Li and Wanlong Huang: performed the data analyses and wrote the manuscript. Gao Gong and Xiaochun Yan: draw the figures in this manuscript. Baichuan Liu, Oljibilig Chen and Na Wang: provided the samples. Yanjun Zhang, Ruijun Wang, Jinquan Li assisted in designing the idea of the manuscript. Shilin Tian checked the experimental design ideas, data analysis results, and paper writing. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Shilin Tian or Rui Su.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Z., Lv, Q., Li, W. et al. Chromosome-level genome assembly of the cashmere goat. Sci Data 11, 1107 (2024). https://doi.org/10.1038/s41597-024-03932-7

Download citation

Received: 29 April 2024
Accepted: 24 September 2024
Published: 09 October 2024
DOI: https://doi.org/10.1038/s41597-024-03932-7

This article is cited by

Whole-genome sequencing and variants data of 304 indigenous goats from Southwest China
- Jipan Zhang
- Di Zhou
- Yongju Zhao
Scientific Data (2025)
Transcriptomics data for muscle development in Goats
- Yu Pei
- Rongsheng Xi
- Mei Liu
Scientific Data (2025)

Subjects

Abstract

Similar content being viewed by others

Telomere-to-telomere genome assembly of a male goat reveals variants associated with cashmere traits

Telomere-to-telomere genome assembly of the goose Anser cygnoides

Comparative genomic analysis uncovers candidate genes related with milk production and adaptive traits in goat breeds

Background & Summary

Methods

Ethics statement

Sample collection

Paired-end library preparation, sequencing and quality control

Nanopore library construction and sequencing

Pacbio HiFi library preparation and sequencing

Hi-C library preparation and sequencing

RNA library preparation and sequencing

Genome survey

Genome assembly using an improved four-step assembly strategy

Genome assessment

Annotation of repetitive sequences, protein-coding gene and noncoding RNA gene structure

Technical Validation

The assessment of the cashmere goat assembly

The comparation between the cashmere goat and other published goat genomes

Data Records

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Whole-genome sequencing and variants data of 304 indigenous goats from Southwest China

Transcriptomics data for muscle development in Goats

Search

Quick links