Abstract
The goat, an early domesticated ruminant, is a reliable source of cashmere, meat and milk in global agricultural production. Despite this, the genome of cashmere-rich goats has yet to be characterized. Here, we assembled the nearly complete genome of a cashmere goat from a highly economically valuable Inner Mongolian Cashmere buck, utilizing a combination of PacBio HiFi, ONT ultra-long reads, and Hi-C technologies. The size of this genome is 2.76 Gb, with a contig N50 of 95.22 Mb. All assembled sequences were anchored onto 29 autosomes and both sex chromosomes, with only two gaps present on the X chromosome. We identified 1,333.29 Mb (48.26%) of repetitive sequences and predicted 22,480 protein-coding genes. Assembly quality assessment of the genome demonstrated that our assembled cashmere goat genome surpasses the continuity, completeness, and accuracy of other published goat genomes. Taken together, we provided the first cashmere goat assembly, bridging the gap in the genome of important economic breeds of domestic goats, and providing a valuable reference resource for goat genetics and genome research.
Similar content being viewed by others
Background & Summary
Goats are among the earliest domesticated ruminants. Archaeological and genetic findings have demonstrated that domestic goats were derived from the wild goat, known as the bezoar (Capra aegagrus), in the Fertile Crescent region of Western Asia during the Neolithic Age, approximately 10,000 years ago1,2,3. Goats hold a significant place in many cultures around the world, often being used in various religious and traditional ceremonies4. They are widely raised globally, capable of adapting to harsh environments such as deserts and semi-desert environments5. Goats provide a reliable source of cashmere, meat, milk and skin. They also play a crucial role in sustainable farming and food production in many parts of the world. According to a report by the Food and Agriculture Organization of the United Nations (FAO; www.fao.org/home/en), there are over 1.15 billion goats worldwide, with more than 580 distinct goat breeds. Of these, 94.93% of the world’s goat population is in Asia and Africa (FAO, 2022), mainly distributed in developing and underdeveloped areas such as China, India, Iran, and Afghanistan.
Cashmere, with its unique qualities, has become a high-end and precious spinning raw material, earning it the reputation of ‘soft gold’ and ‘fiber gem’6. The Inner Mongolian Arbas Cashmere goat is a world-class dual-purpose breed producing cashmere and meat, formed through natural selection and artificial breeding, possessing excellent traits such as high cashmere yield, fine cashmere quality, and large body size. The cashmere produced by it was honored with the ‘Chaigneau’ Quality Wool Award by the European Fiber Committee in 1985. Moreover, the Inner Mongolian Arbas goat is the important paternal origin of most of China’s goat breeds, its export is strictly prohibited, and it is classified as a first-class protected species of genetic resources in China.
To date, five high-quality goat genome sequences have been published, but the genome of the cashmere goat is lacking7,8,9,10,11. Among these, the chromosomal-level genome assembled from a male Saanen dairy goat is of the highest quality7. This genome identified approximately two-thirds (20/29) of the proximal and distal centromeric and telomeric repeats of the autosomes. Nonetheless, some autosomes and the X, Y chromosome in the Saanen genome remain incomplete. The main challenge in these unresolved regions of the genome is the inability to fully assemble highly repetitive sequences12.
In this study, we undertook the de novo sequencing of a cashmere goat, specifically a highly economically valuable Inner Mongolian Arbas White Cashmere buck, via the integration of 72.89 Gb PacBio high-fidelity (HiFi) sequences, 299.91 Gb Oxford Nanopore Technologies (ONT) ultra-long data, 314.29 Gb high-throughput chromosome conformation capture (Hi-C) data and 226.97 Gb paired-end sequences (Tables 1, 2). We utilized a recently published, enhanced four-step assembly strategy13 to generate an assembled sequence of 2.76 Gb, with a contig N50 of 95.22 Mb (Table 3). The assembled sequences were anchored onto 29 autosomes and both sex chromosomes (X and Y) (Fig. 1a). Apart from chromosome X, which consists of three contigs, each chromosome contains only one exceptionally large contiguous contig (Fig. 1b). The length of the Y chromosome has increased from the previously reported 9.60 Mb in the Saanen genome to 11.92 Mb. We identified 35 telomeric structures located in the distal-end sequences of the 29 chromosomes, enriched with 6-bp repeat units (TTAGGG/CCCTAA) (Fig. 1b). This genome comprises 1,333.29 Mb of repetitive sequences that constituted 48.26% of total genome bases, and a total of 22,480 protein coding genes, 11,248 microRNAs (miRNA), 34,806 transfer RNA (tRNA), 756 ribosomal RNAs (rRNA), and 2003 small nuclear RNAs (snRNA) genes. Evaluation of assembly quality reveals that our new assembly demonstrates superior continuity, completeness, and accuracy compared to other goat genomes (Table 3). This is the first cashmere goat genome, providing an optimal genomic reference for studying the genetic basis of economically cashmere traits in goats.
Characterization of the cashmere goat genome. (a) Chromatin interactions in each chromosome at a resolution of 5 Mb. The dark red dots show the high probability of interaction, and the light dots show the low probability of interaction. (b) Distribution of telomere sequences (6-bp-unit repeats) and gap regions on each chromosome.
Methods
Ethics statement
In this study, we collected whole blood in strict accordance with the International Guiding Principles for Biomedical Research Involving Animals. This procedure was approved by the Special Committee on Scientific Research and Academic Ethics at Inner Mongolia Agricultural University, which is responsible for the approval of Biomedical Research Ethics [Approval No. (2020) 056]. These activities did not require any specific permissions and did not involve any endangered or protected species.
Sample collection
This experiment was conducted at the Inner Mongolia Yiwei White Cashmere Goat Co., Ltd. (Ordos, Inner Mongolia, China). A healthy adult male goat (3-year-old), exhibiting the best production performances, records, and phenotypic observations within the population, was selected. We collected 10 ml of venous blood from the cashmere goat using EDTA anticoagulant blood vessels. The sample was gently mixed manually ten times to ensure thorough mixing of the anticoagulant and blood. It was then divided and stored in a Corning 2 ml freezer tube. Following rapid freezing with liquid nitrogen, the sample was stored in a −80 °C freezer for subsequent experiments.
For DNA extraction, 500–1000 µl of whole blood was added to a preheated (56 °C) 2 ml Eppendorf tube containing 1 ml of lysis buffer (100 µL of 20 mg/mL Proteinase K and 100 µL of 20% SDS), and incubated at 56 °C for 60–120 minutes. The tube was then centrifuged at maximum speed for 10 minutes to collect the supernatant after cooling to room temperature. The supernatant was transferred to a new 2.0 ml tube, mixed with an equal volume of phenol/chloroform/isoamyl alcohol (25:24:1), and centrifuged at maximum speed for 10 minutes. The supernatant was then transferred to a new 1.5 ml tube and 2/3rd volume of isopropyl alcohol was added (with 1/10th volume of 3 M sodium acetate if necessary). The mixture was inverted at least 3 times and left at −20 °C for 2 hours for precipitation. After centrifuging the tube at 18213 × g for 10 minutes, the DNA pellet was washed with 1 ml of 75% ethanol. The pellet was resuspended by centrifuging at maximum speed for 5 minutes at room temperature, and the supernatant was completely removed. The DNA pellet was air-dried in a biosafety cabinet for a few minutes and then dissolved in 25 µL to 100 µL of TE Buffer. The DNA concentration was measured using a Qubit Fluorometer, and the sample integrity and purity were assessed by agarose gel electrophoresis.
Paired-end library preparation, sequencing and quality control
Between 1–1.5 μg of genomic DNA was randomly fragmented by Covaris, after which the fragmented DNA was selected for an average size of 200–400 bp using the Agencourt AMPure XP-Medium kit. These selected fragments underwent end-repair, 3′ adenylation, adapter ligation, and PCR amplification. The products were then recovered using the AxyPrep Mag PCR Clean-Up Kit. The double-stranded PCR products were heat denatured and circularized by the splint oligo sequence in MGIEasy Circularization Module (CAT#1000005260, MGI). The single-strand circular DNA (ssCir DNA) was formed as the final library and qualified by quality control. The qualified libraries were subsequently sequenced on the DNBSEQ-T7 platform, resulting in a total of 246.45 Gb raw sequences (Table 1). After undergoing filtration using the default parameters of fastp software (v0.23.4)14, the remaining 226.97 Gb (92.10%) of high-quality data was utilized for genome survey and genome assessment analyses.
Nanopore library construction and sequencing
We procured DNA from 500 μL of whole blood using the Qiagen DNeasy kit in accordance with the manufacturer’s guidelines. The DNA was eluted into 50 μL and subsequently concentrated to approximately 25 ng/μL using a Zymo DNA Clean and Concentrator Kit, resulting in a final elution volume of roughly 50 μL post-concentration. Nanopore sequencing libraries were prepared using a 1D genomic ligation kit (SQK-LSK108), following the manufacturer’s instructions but with minor modifications: the dA-tailing and FFPE repair steps were combined by using 46.5 μL of input DNA, 0.5 μL NAD+, 3.5 μL Ultra II EndPrep buffer and FFPE DNA repair buffer, and 3.0 μL of Ultra II EndPrep Enzyme and FFPE Repair Mix, resulting in a total reaction volume of 60 μL. The subsequent thermocycler conditions were adjusted to 60 minutes at 20 °C and 30 minutes at 65 °C. The remainder of the protocol was executed as per the manufacturer’s instructions. ~15 μl of the resultant library was loaded onto a PromethION with an R9.4.1 flowcell and run for 48 hours using MinKNOW version 2.0. Fastq files were derived from raw Nanopore data using Albacore (v2.3.1). A total of 6,976,996 approved reads (299.91 Gb), with an average read length of 42,985 bp and a read length N50 of 80,219 bp, were obtained for further genome assembly (Table 2).
Pacbio HiFi library preparation and sequencing
The SMRTbell library was prepared in according to the PacBio protocol. Briefly, over 5 μg of sheared and concentrated genomic DNA was processed using g-TUBEs (Covaris, USA) to achieve the desired fragment size for the library. Single-strand overhangs were removed, and the DNA fragments were repaired, end-polished, and ligated with stem-loop adapters. Link-failed fragments were further removed by exonuclease, and target fragments were selected using the BluePippin system (Sage Science, USA). The resulting SMRTbell library was purified using AMPure PB beads, and the fragment sizes were verified using an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). Consequently, two SMRTbell libraries, each approximately 40 kb in size, were sequenced using PacBio RSII equipment. A total of 4,303,402 HiFi reads were obtained, with an average read length of 16,939 bp and a read length N50 of 17,091 bp, which were then used for further genome assembly (Table 2).
Hi-C library preparation and sequencing
Firstly, the restriction enzyme DpnII was used to digest genomic DNA from blood tissue after conformation fixing with formaldehyde and repairing the 5′ overhangs using biotinylated residues. After in situ ligation of the blunt-end fragments, the isolated DNA was reverse cross-linked, purified, and filtered to retain biotin-containing fragments. Subsequently, DNA end repair, adaptor ligation, and PCR amplification were performed. Finally, a library was constructed using the NEBNext Ultra II DNA library Prep Kit for Illumina (NEB) according to the manufacturer’s instructions. for sequencing on a NovaSeq platform to generate short paired-end reads of 150 bp in length. A total of 314.29 Gb of clean data was obtained for further genome assembly (Table 1).
RNA library preparation and sequencing
RNA, extracted from five distinct tissues form the buck used for genome assembly, was employed for the creation of mRNA-seq libraries. We extracted RNA from five tissues (i.e., heart, kidney, lung, liver and spleen) of the cashmere goat and constructed a pooled library to perform full-length transcriptome sequencing. The pooled RNA was initially tested using the following methods:1) Agarose gel electrophoresis was used to exclude degraded and contaminated RNA; 2) The purity of the RNA (OD260/280 ratio) was ensured using Nanodrop; 3) The Qubit system was used for precise quantification of RNA concentration; 4) Finally, the integrity of the RNA was accurately assessed using an Agilent 2100. After passing the quality check, ~3 μg of pooled RNA was transcribed into cDNA using the Clontech SMARTer PCR cDNA Synthesis Kit (Takara Biotechnology, Dalian, China), which was then amplified to generate double-stranded cDNA.The BluePippin™ Size Selection System (Sage Science, Beverly, MA, USA) was used to select cDNA fragments of <4 kb and >4 kb. For each SMRTbell library, ~1 μg of cDNA was selected for construction using the Pacific Biosciences SMRTbell Template Prep Kit (PacBio, CA, USA) according to user manual. Finally, the SMRT cells were sequenced on the PacBio Sequel platform. This process generated 130,252,514 subreads, amounting to a total of 321.81 Gb, with a read N50 of 2,651 bp. Following the steps of circular consensus sequence calling, primer removal and demultiplexing, as well as refining and clustering for parallel polishing, a total of 121,692 high-quality consensus transcript sequences were derived from the primary PacBio BAM file of full-length transcriptome sequencing. These sequences were then used for further genome annotation.
The mRNA-seq libraries for short-read transcriptome sequencing were constructed using the NEBNext® UltraTM RNA Library Prep Kit for Illumina® (NEB, USA), with all procedures strictly adhering to the manufacturer’s recommendations. All libraries underwent sequencing on an Illumina NovaSeq6000 platform, employing PE-150 sequencing. Following the removal of low-quality reads and adaptor sequences by fastp (v0.23.4)14, we procured a total of 31.29 Gb high-quality data for five tissues of the cashmere goat (Table 4).
Genome survey
Before proceeding with genome assembly, we conducted a k-mer analysis using high-quality paired-end reads to estimate the genome size and heterozygosity. Specifically, the paired-end reads were analyzed through a 17-mer frequency distribution using the KMC software15 with the parameter “-k17 -ci1 -cs1000000”. This process generated spectrum data containing a total of 94,711,895,031 k-mers (Table 5), which were subsequently analyzed using FindGSE software16. The cashmere goat genome size was estimated using the formula: G = Knum/Kdepth, where G represents the genome size, Knum is the total number of 17-mers, and Kdepth denotes the 17-mer depth. As a result, we estimated the genome size to be 2.79 Gb, with a heterozygosity rate of 0.40% (Table 5).
Genome assembly using an improved four-step assembly strategy
We borrowed the previously improved assembly strategy13 to generate complete assembly for the cashmere goat. Our assembly pipeline consists of four main steps: Firstly, the ONT ultra-long reads were used to assemble initial contigs by applying the ‘correct-then-assemble’ strategy in the package NextDenovo (v2.5.2; https://github.com/Nextomics/NextDenovo) with the parameters ‘read_cutoff = 1k, seed_cutoff = 32k, blocksize = 3 g’. Subsequently, the initial contigs were corrected using the paired-end reads using the ‘best’ algorithm module in the package NextPolish v1.4.117. Consequently, we obtained the ContigV1 genome assembly with a total of 222 contigs and the contig N50 of 84.04 Mb (Table 6). Secondly, we applied the single-ended model in Bowtie2 software to map the Hi-C data onto the preceding ContigV1 assembly18. After discarding the invalid self-ligated and unligated fragments within the uniquely mapped pairs using the HiCUP pipeline (version 0.8.0)19, the valid interaction pairs were used to compute the linkage frequency among all contigs by applying an agglomerative hierarchical clustering algorithm20, for clustering of the linked contigs. This process resulted in 31 linked groups. We then utilized the nucmer program in the MUMmer4 package21 to generate synteny results between the sequences of the 31 linkage groups and the chromosomal sequences of the Saanen goat (NCBI Accession Number: GCA_015443085.1). This step enabled us to further correct the contigs within the cashmere goat linkage groups associated with the X and Y sex chromosomes. Thirdly, we mapped the HiFi reads to sequences of 31 linked groups using Minimap222, and extracted the best optimal alignment reads for each linked group. Following this, each set of classified mapped reads was utilized for local assembly using the Hifiasm package (0.19.5-r587) with default parameters23. This process generated a new set of contigs, which we named ContigV2 assembly (Table 6). The ContigV2 assembly comprises 33 contigs, with a contig N50 of 95.22 Mb. This local assembly strategy can effectively circumvent the false overlap relationships that may be induced by the genome’s repetitive sequences during the construction of the string graph in the assembly process24. Ultimately, the contigs in ContigV2 assembly were clustered, ordered and oriented using the ALLHiC algorithm25, and anchored into 31 chromosomes. Any placement and orientation errors that displayed distinct chromatin interaction patterns were manually rectified. We identified the telomere regions in all chromosomes by searching for 6-bp repeat units (TTAGGG/CCCTAA).
Genome assessment
We employed three distinct methods to assess the quality of the assembled genome. For the assessment of the assembly accuracy, we utilized a k-mer-based approach implemented in the Merqury26 to calculate the quality value (QV) with k-mer size of 21, using the paired-end reads. For the assessment of assembly completeness, we conducted BUSCO analysis to evaluate the completeness of the genome by searching against 9,226 BUSCOs of mammalia_odb10 (version 5.4.2)27, and realigned the paired-end reads to the assembled genome using BWA software28 to calculate the realignment ratio and coverage depth.
Annotation of repetitive sequences, protein-coding gene and noncoding RNA gene structure
We employed both homologous searching and ab initio prediction methods to annotate the repeated sequences within the cashmere goat genome. Notably, we first used LTR_FINDER v1.0.729, PILER v3.3.030, RepeatScout v1.0.531, and RepeatModeler v1.0.832 for the de novo construction of candidate libraries of repetitive elements within the goat genome. Subsequently, the de novo libraries of repeat sequences, in conjunction with the Repbase database, were utilized to search the cashmere goat assembly for repeated sequence annotation using RepeatMasker (v4.0.5)33. Additionally, RepeatProteinMask (v4.0.5) with default parameters was used to predict the transposable element based on the RepeatPeps database. Following these processes, we combined these results to identify 1333.29 Mb (48.26%) of repeat sequence of the cashmere goat assembly (Table 7).
We utilized homologous-, de novo-, and transcriptome-based approaches to predict the protein-coding genes within the cashmere goat genome. For homologous-based gene prediction, we aligned the protein sequences from five mammalian genomes, including Homo sapiens (GCF_000001405.39), Mus musculus (GCF_000001635.27), Bos taurus (GCF_000003205.7), Ovis aries (GCF_016772045.1), Capra hircus (GCF_001704415.1), against the cashmere goat genome using TBLASTN (version 2.2.29+) with an e-value cut-off of 1e-534. All remaining blast hits were concatenated by Solar software (version 0.9.6). We extracted the corresponding genomic region, including 1,000 bp upstream and downstream of each candidate gene, to predict the precise gene structure using wise2 (v2.4.1)35. The resulting predictions were designated as the ‘Homology set.’ For transcriptome-based prediction, RNA-seq data were assembled and generated transcript sequences with Trinity (v2.1.1)36. We aligned the transcript sequences against the cashmere goat genome using the Program to Assemble Spliced Alignment (PASA)37, where effective alignments were clustered based on genome mapping location and assembled into gene structures. The gene models created by PASA were labeled as the PASA Trinity set. In addition, RNA-seq reads were directly mapped to the cashmere goat genome using TopHat (v2.0.13)38, and the mapped reads were assembled into gene models (Cufflinks-set) by Cufflinks (v2.1.1)39. For de novo gene prediction, we employed Augustus (v2.5.5)40, GeneID (v1.4)41, GeneScan (v1.0)42, GlimmerHMM (v3.0.1)43, and SNAP (version 2013-11-29)44 to predict genes in the repeat-masked genome. Specific parameters of Augustus, SNAP, and GlimmerHMM were trained with the gene models in the PASA Trinity set. Finally, all gene models from the above sets were integrated by EVidenceModeler (v1.1.1), with the following weights assigned to each type of evidence: PASA-T-set > Homology-set = Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. Additionally, we filtered out genes that were less than 50 amino acids in length, only supported by ab initio evidence, and with an expression value of less than 1. As result, we obtained 22,480 protein-coding gene in the cashmere goat genome (Table 8).
We annotated the function of protein-coding genes within the cashmere goat genome using the SwissProt45, KEGG pathway46, NR (from NCBI), and InterPro databases, leveraging a homologous searching method. Notably, we obtained Pfam domain and Gene Ontology (GO) information from the InterPro database and predicted these using the InterProScan tool47, which is based on conserved protein domains and functional sites. For the other databases, we used BLATP with an e-value cut-off of 1e-434. Consequently, we found that 99.12% of the protein-coding genes were supported by functional databases (Table 9).
We predicted the gene structures of noncoding RNAs in the cashmere goat genome. Specifically, we used the t-RNAscan-SE tool (v1.3.1) to predict tRNAs48. We predicted ribosomal RNA (rRNA) sequences by searching against the invertebrate rRNA database using BLAST with an E-value cut-off of 1e-1049. Additionally, we annotated small nuclear and nucleolar RNAs, as well as miRNAs, using Infernal (v1.1rc4) based on the Rfam database50. As a result, we identified a total of 11,248 microRNAs (miRNA), 34,806 transfer RNA (tRNA), 756 ribosomal RNAs (rRNA), and 2,003 small nuclear RNAs (snRNA) genes (Table 10). The abundance of these different categories of noncoding RNA in the Cashmere goat genome is similar to that in the Saanen goat genome.
Technical Validation
The assessment of the cashmere goat assembly
The final genome size is closely aligned with the estimated result (2.79 Gb) from K-mer analysis (Fig. 2a and Table 5). Our assembled goat genome exhibits excellent completeness, as evidenced by the coverage of 99.97% short-reads across 99.51% of the genome, and recovery of 95.9% of BUSCOs (Benchmarking Universal Single-Copy Orthologs)27 in 9,226 conserved mammalian genes from the mammalia_odb10 database (Tables 3, 11). Our BUSCOs metric results surpass the average BUSCOs values of the genomes of the most recently published approximately 60 vertebrates13,51,52,53. Furthermore, we used a reference-free and k-mer based approach and estimated a high assembly quality value (QV) of 43.68 (Table 3), exceeding the Vertebrate Genome Project (VGP) standard of QV4026,54, suggesting a superior accuracy in our assembly.
Genome survey and assembly quality in cashmere goat. (a) The frequency distribution of the 17-mers. (b) Genome contiguity of the cashmere goat (red line) compared to two other chromosome-level goat assemblies (broken lines). X-axis shows contig N50 and N90 values. Horizontal black dashed lines indicate cumulative contig size, with combined lengths shown as a X% of total genome length.
The comparation between the cashmere goat and other published goat genomes
Compared to the published goat genomes7, our newly assembly possesses the best contiguity, completeness, and accuracy (Table 3 and Fig. 2b). We used the LASTZ program55 to analyze the one-to-one synteny blocks (>1 Kb) between our assembled genome and the Saanen goat genome, and found that 95% of cashmere goat assembly synteny to 99% the second-best Saanen genome assembly (Fig. 3a), as well as a total of 142.26 Mb and 19.19 Mb unique sequences in the Cashmere and Saanen goat assemblies, respectively (Table 12). Remarkably, unique sequences greater than 1 Mb in length totaled 110.52 Mb, accounting for the majority (77.83%), almost all of which were distributed around the proximal telomeres of the 18 chromosomes (Fig. 3a). Among these, the longest unique sequence is 17 Mb, which is on chromosome 17. We mapped the three kinds (HiFi, ONT ultra-long and paired-end reads) of sequencing data of assembled cashmere goat onto its assembly, and found that these unique sequences were almost as well covered by the sequencing data as other regions in the vicinity (Fig. 3b). This reflects the accuracy of the assembly sequences.
Data Records
The raw data, including PacBio, ONT, Hi-C, paired-end sequencing, and RNA-seq data have been deposited into the NCBI sequence read archive (SRA) under accession code: SRR28823973 - SRR28823996, SRR30599194 - SRR3059919956. The genome assembly has been deposited in the GenBank database under accession number GCA_040822015.157. The genome sequences and annotation file (i.e., the GFF file and the FASTA file record coding sequences and protein sequences) of the cashmere goat are available at Figshare (https://doi.org/10.6084/m9.figshare.25697928.v1)58.
Code availability
All commands and pipelines used in data processing were executed according to the manual and protocols of the corresponding bioinformatics software.
References
Zeder, M. A. & Hesse, B. The initial domestication of goats (Capra hircus) in the Zagros mountains 10,000 years ago. Science 287, 2254–7 (2000).
Daly, K. G. et al. Ancient goat genomes reveal mosaic domestication in the Fertile Crescent. Science 361, 85–88 (2018).
Zheng, Z. et al. The origin of domestication genes in goats. Sci Adv 6, eaaz5216 (2020).
Hatziminaoglou, Y. & Boyazoglu, J. The goat in ancient civilisations: from the Fertile Crescent to the Aegean Sea. Small Ruminant Research 51, 123–129 (2004).
MacHugh, D. E. & Bradley, D. G. Livestock genetic origins: goats buck the trend. Proc Natl Acad Sci USA 98, 5382–4 (2001).
Gong, G. et al. Identification of Genes Related to Hair Follicle Cycle Development in Inner Mongolia Cashmere Goat by WGCNA. Front Vet Sci 9, 894380 (2022).
Li, R. et al. A near complete genome for goat genetic and genomic research. Genet Sel Evol 53, 74 (2021).
Dong, Y. et al. Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat Biotechnol 31, 135–41 (2013).
Du, X. et al. An update of the goat genome assembly using dense radiation hybrid maps allows detailed analysis of evolutionary rearrangements in Bovidae. BMC Genomics 15, 625 (2014).
Bickhart, D. M. et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Genet 49, 643–650 (2017).
Siddiki, A. Z. et al. The genome of the Black Bengal goat (Capra hircus). BMC Res Notes 12, 362 (2019).
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022).
Tian, S. et al. Comparative analyses of bat genomes identify distinct evolution of immunity in Old World fruit bats. Sci Adv 9, eadd0141 (2023).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Deorowicz, S., Kokot, M., Grabowski, S. & Debudaj-Grabysz, A. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31, 1569–76 (2015).
Sun, H., Ding, J., Piednoel, M. & Schneeberger, K. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics 34, 550–557 (2018).
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–9 (2012).
Wingett, S. et al. HiCUP: pipeline for mapping and processing Hi-C data. F1000Res 4, 1310 (2015).
Li, D. et al. Population genomics identifies patterns of genetic diversity and selection in chicken. BMC Genomics 20, 263 (2019).
Marcais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14, e1005944 (2018).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 170–175 (2021).
Myers, E. W. The fragment assembly string graph. Bioinformatics 21(Suppl 2), ii79–85 (2005).
Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat Plants 5, 833–845 (2019).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245 (2020).
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–2 (2015).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–95 (2010).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 35, W265–8 (2007).
Edgar, R. C. & Myers, E. W. PILER: identification and classification of genomic repeats. Bioinformatics 21(Suppl 1), i152–8 (2005).
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21(Suppl 1), i351–8 (2005).
Smit, A. & Hubley, R.R. Open-1.0. Available from. http://www.repeatmasker.org (2008).
Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0. 2013–2015. (2015).
Mount, D. W. Using the basic local alignment search tool (BLAST). CSH Protoc 2007, pdb top17 (2007).
Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res 14, 988–95 (2004).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644–52 (2011).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31, 5654–5666 (2003).
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14, R36 (2013).
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7, 562–78 (2012).
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 (Suppl 2), ii215–25 (2003).
Guigo, R. Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol 5, 681–702 (1998).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J Mol Biol 268, 78–94 (1997).
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–9 (2004).
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
UniProt Consortium, T. UniProt: the universal protein knowledgebase. Nucleic Acids Res 46, 2699 (2018).
Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res 42, D199–205 (2014).
Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res 33, W116–20 (2005).
Schattner, P., Brooks, A. N. & Lowe, T. M. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res 33, W686–9 (2005).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J Mol Biol 215, 403–10 (1990).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–5 (2013).
Shao, Y. et al. Phylogenomic analyses provide insights into primate evolution. Science 380, 913–924 (2023).
Jebb, D. et al. Six reference-quality genomes reveal evolution of bat adaptations. Nature 583, 578–584 (2020).
Peng, C. et al. Large-scale snake genome analyses provide insights into vertebrate development. Cell 186, 2959–2976 e22 (2023).
Editorial, N.B. A reference standard for genome biology. Nat Biotechnol 36, 1121 (2018).
Harris, R. S. Improved pairwise alignment of genomic DNA. (2007).
Wang, Z. et al. NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP504456 (2024).
Wang, Z. et al. Genbank https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1104404 (2024).
Wang, Z. et al. Genome annotated files for the reference genome of cashmere goat. Figshare. https://doi.org/10.6084/m9.figshare.25697928.v1 (2024).
Acknowledgements
This work was financially supported by National Key Research and Development Program of China (2022YFE0113300, 2022YFD1300204, 2022YFD1300201), the Central Government Guides Local Science and Technology Development Fund Projects (2022ZY0211), Inner Mongolia Agricultural University Outstanding Youth Science Fund Cultivation Project (BR230304), China Agriculture Research System of MOF and MARA (No. CARS-39), Scientific and Technological Program of Inner Mongolia Autonomous Region (2023KYPT0021), Key Discipline Key Laboratory Project (NNDWTCS-2023059), Higher Education Reform and Development Project (NNDWTCS-2023058), and the Beijing Nova Program (Z211100002121022 and 20230484446).
Author information
Authors and Affiliations
Contributions
Zhiying Wang, Qi Lv and Su Rui: conceived the research project. Wenze Li and Wanlong Huang: performed the data analyses and wrote the manuscript. Gao Gong and Xiaochun Yan: draw the figures in this manuscript. Baichuan Liu, Oljibilig Chen and Na Wang: provided the samples. Yanjun Zhang, Ruijun Wang, Jinquan Li assisted in designing the idea of the manuscript. Shilin Tian checked the experimental design ideas, data analysis results, and paper writing. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, Z., Lv, Q., Li, W. et al. Chromosome-level genome assembly of the cashmere goat. Sci Data 11, 1107 (2024). https://doi.org/10.1038/s41597-024-03932-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03932-7
This article is cited by
-
Whole-genome sequencing and variants data of 304 indigenous goats from Southwest China
Scientific Data (2025)
-
Transcriptomics data for muscle development in Goats
Scientific Data (2025)