Chromosome-level genome assembly of Huai pig (Sus scrofa)

Du, Heng; Lu, Shiyu; Huang, Qianqian; Zhou, Lei; Liu, Jian-Feng

doi:10.1038/s41597-024-03921-w

Download PDF

Data Descriptor
Open access
Published: 02 October 2024

Chromosome-level genome assembly of Huai pig (Sus scrofa)

Heng Du^1,2,
Shiyu Lu^1,2,
Qianqian Huang^1,2,
Lei Zhou^1,2 &
…
Jian-Feng Liu ORCID: orcid.org/0000-0002-5766-7864^1,2

Scientific Data volume 11, Article number: 1072 (2024) Cite this article

3041 Accesses
2 Citations
Metrics details

Subjects

Abstract

Although advances in long-read sequencing technology and genome assembly techniques have facilitated the study of genomes, little is known about the genomes of unique Chinese indigenous breeds, including the Huai pig. Huai pig is an ancient domestic pig breed and is well-documented for its redder meat color and high forage tolerance compared to European domestic pigs. In the present study, we sequenced and assembled the Huai pig genome using PacBio, Hi-C, and Illumina sequencing technologies. The final highly contiguous chromosome-level Huai pig genome spans 2.53 Gb with a scaffold N50 of 138.92 Mb. The Benchmarking Universal Single-Copy Orthologs (BUSCO) completeness score for the assembled genome was 95.33%. Remarkably, 23,389 protein-coding genes were annotated in the Huai-pig genome, along with 45.87% repetitive sequences. Overall, this study provided new foundational resources for future genetic research on Chinese domestic pigs.

The 1000 Chinese Indigenous Pig Genomes Project provides insights into the genomic architecture of pigs

Article Open access 22 November 2024

Accurate haplotype construction and detection of selection signatures enabled by high quality pig genome sequences

Article Open access 23 August 2023

Pig genome functional annotation enhances the biological interpretation of complex traits and human disease

Article Open access 06 October 2021

Background & Summary

The pig (Sus Scrofa) is a crucial livestock species that supplies staple protein to humans and serves as an important biomedical model owing to its anatomical and physiological similarities to humans. Belonging to the Suidae family, S. scrofa (wild boars and domestic pigs) is the only species that has spread across multiple continents¹ and has been domesticated by humans for 9-10 thousand years ago (kya)². The Huai pig is an important Chinese domestic pig, recorded in the Compendium of Materia Medica. It is an ancient breed that has been prevalent for 2 ky in northern Jiangsu Province, China³. Huai pigs are well-documented for their high meat quality, redder meat color, high forage tolerance, and lower growth rate than European domestic pigs^4,5,6. A series of genetic studies have been conducted to dissect the characteristics of Huai pigs at the molecular level. For example, a transcriptome study of the Huai pig revealed significant differences in meat quality and muscle fiber content between the muscles of Huai pigs and Duroc pigs and identified related candidate genes⁷.

Genomic data is a powerful tool to explain the characteristics of distinct pigs. The recent pig reference genome, Sscrofa11.1, has significantly contributed to our understanding of the genetic basis of distinct phenotypes and evolutionary processes involved in porcine domestication⁸. To address the limited diversity in the reference genome⁹, several studies have assembled pig genomes of different breeds, including the Meishan pig¹⁰, Ningxiang pig¹¹, and others^12,13. However, high-quality genome assembly of Huai pigs is still lacking, and there is a strong demand for chromosome-level genomes for this breed.

In this study, using PacBio long-read and Illumina short-read sequences, we assembled the first chromosome-level Huai pig genome combined with Hi-C data (Table 1). The genome size of the Huai pig was estimated to be approximately 2.56 Gb according to the k-mer analysis of 197.78 Gb (79.11×) Illumina reads (Fig. 1b). The final genome assembly had a size of 2,533,275,462 bp, comprising 2,044 contigs with an N50 size of 11.37 Mb. After chromosome-level anchoring, 2.43 Gb (96.05%) of the assembled contigs were anchored onto 20 chromosomes (Fig. 1c), with scaffold N50 of 138.92 Mb (Table 2). In addition, we annotated 23,389 protein-coding genes in this assembly with a mean of 8.70 exons per gene (Fig. 2a, Table 3). Four types of non-coding RNAs, including transfer RNAs (tRNAs), ribosomal RNAs (rRNAs), microRNAs (miRNAs), and small nuclear RNAs (snRNAs), were also identified in the Huai pig assembly (Fig. 2b). Besides, the repetitive elements in the Huai pig assembly were also annotated, and 45.87% of assembly regions (about 1.17 Gb) were regarded as repetitive sequences (Table 3). Among all repeat elements, long interspersed nuclear elements (LINEs) were the most abundant element, accounting for 20.67% of the entire genome (Figure S1). Our research offers a versatile resource applicable to pig breeding and a foundation for the future exploration of the genetic mechanisms of porcine traits.

Table 1 The detailed information on sequencing data of Huai pig.

Full size table

Table 2 Statistics of the Huai pig assembly and the reference genome assembly of the pig (Sscrofa11.1).

Full size table

Table 3 Statistics of the Huai pig genome annotation.

Full size table

Methods

Sample collection

A male Huai pig from Nanjing, Jiangsu Province, China (31.5267°N, 120.5875°E), was collected for de novo assembly. Seven tissues of the same individual were collected and immediately frozen in liquid nitrogen and then stored at −80 °C until RNA extraction, including the heart, liver, spleen, lung, kidney, muscle, and adipose. Blood samples were collected for DNA extraction. All animal experiments were performed under the guidance of ethical regulations from the Institutional Animal Care and Use Committee (IACUC) at the China Agricultural University (Beijing, People’s Republic of China; Approval No. AW60604202-1-1).

DNA isolation and sequencing for genomes

Genomic DNA was extracted from whole blood using the DNeasy Blood & Tissue Kit (QIAGEN, Hilden, Germany). For long-read sequencing, four SMRT bell libraries were constructed using a Pacific Biosciences SMRT bell Template Prep Kit (Pacific Biosciences, Menlo Park, California, USA). Libraries were evaluated using an Agilent 4200 Bioanalyzer (Agilent Technologies, Santa Clara, California, USA). After size selection, the constructed libraries were sequenced on a Pacific Biosciences Sequel II platform (Pacific Biosciences, Menlo Park, California, USA). A paired-end library with an insert size of ~ 300 bp was constructed using the TruSeq Nano DNA Sample Preparation Kit (Illumina, San Diego, California, USA). In total, 197.78 Gb 150 bp paired-end reads were generated using an Illumina HiSeq 2000 platform. These reads were used to estimate the genome size of Huai pigs and to refine the assembly.

Approximately 10 mL of blood collected from the same Huai pig was used for the Hi-C experiment. Blood was initially crosslinked in a 2% formaldehyde solution for 15 min, and the reaction was halted by the addition of glycine. After isolating the nuclei, the chromatin was digested with MboI. The sticky ends of digested fragments were randomly biotinylated, diluted, and randomly ligated¹⁴. Subsequently, biotin-labeled DNA fragments were subjected to ultrasound shearing, followed by blunt-end repair and A-tailing. The adapters were then ligated to the DNA fragments, and polymerase chain reaction (PCR) amplification was performed to scaffold the Hi-C library. After quality control, the Hi-C library was sequenced using an Illumina paired-end sequencing platform with 2 × 150 bp reads.

Transcriptome sequencing

Total RNA was extracted from each tissue sample using the TRIzol-based RNA extraction kit (Invitrogen, Carlsbad, CA, USA). RNA degradation and contamination were monitored using 1% agarose gel electrophoresis. The total RNA concentration was quantified using a Qubit RNA Assay Kit on a Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, United States). RNA sequencing libraries with insert sizes ranging from 250 to 350 bp were prepared using Kapa RiboErase (Roche, Basel, Switzerland). Subsequently, all libraries were sequenced on an Illumina NovaSeq 6000 S4 platform, following the manufacturer’s instructions to obtain transcriptome profiles.

Genome size estimation and de novo assembly

Before de novo assembly, we estimated the genome size of Huai pigs using the k-mer method. Adapters and low-quality reads (base quality [Q] values < 20) in the 197.78 Gb Illumina paired-end reads were removed and trimmed using TrimGalore (v0.6.1)¹⁵. These high-quality reads were subjected to 17-mer frequency distribution analysis using Jellyfish (v2.3.0)¹⁶. The k-mer depth distribution computed using Jellyfish exhibited an explicit peak depth. Subsequently, the genome size of Huai pigs was calculated using the following formula: genome size = K-num/K-depth, where K-num represents the total number of k-mers, and K-depth corresponds to the highest k-mer frequency.

PacBio subreads were used to perform de novo genome assembly using Falcon software (v2018.03.12)¹⁷. The primary assembly was polished using Pilon (v1.23)¹⁸ with the aforementioned filtered Illumina paired-end reads. Two rounds of iterative error correction were conducted to ensure assembly accuracy. Finally, the highly accurate contigs were identified. Over 100 × Hi-C reads were used to connect the primary contigs and construct a pseudo-chromosome-level genome. After removing adapter sequences and low-quality bases, these reads were aligned to the primary genome assembly using aln and sampe commands from the Burrow-Wheeler Aligner (BWA v0.7.17)¹⁹. The alignment results and contigs from the assembly were used as inputs for LACHESIS (https://github.com/shendurelab/LACHESIS)²⁰, with the cluster number set to 20 and anchored to pseudo-chromosomes. The chromosome name of the Huai pig genome was also determined by LACHESIS based on the alignment results between Sscrofa11.1²¹ reference pig genome (S. scrofa) and the Huai pig genome, which achieved by blastn (v2.10.1+)²². Subsequently, the chromosome-level genome was manually optimized using JuiceBox (v2.20.00)²³. Then, the PacBio subreads were corrected using LORDEC (v0.9)²⁴ with the Illumina paired-end reads of the same sample. The chromosome-level genome was gap-filled using TGS-GapCloser (v1.2.1)²⁵ with the corrected PacBio long reads.

Genome quality assessment

To assess the completeness and accuracy of the newly assembled Huai pig genome, we conducted the following validation. First, we mapped the whole-genome sequencing short reads of the same Huai pig against the genome using BWA to estimate the accuracy of a single base of the assembly. In addition, each chromosome’s quality value (QV) score was assessed with short reads using Merqury(v1.3)²⁶. The CEGMA (v2-2.5)²⁷ pipeline software with parameter “–mam,” was also run against the new assembly. BUSCO (v5.0.0)²⁸ software, based on the lineage dataset mammalia_odb10 (creation date: 2019-11-20) was employed to assess the quality of the generated genome. Furthermore, 1,341,928 EST sequences of pig were downloaded from the UCSC database²⁹ and aligned to the Huai pig genome using Minimap2 (v2.17)³⁰.

Repetitive landscape and genome annotation

Homology-based and de novo methods were applied to repeat annotation. Tandem Repeats Finder (TRF, v4.09)³¹ and RepeatModeler (v2.0.1)³² were used to generate the de novo repeat library for the Huai pig genome, which comprised tandem and interspersed repeats. This de novo repeat library, together with the Repbase³³ library, was used for the homology search of repeats through RepeatMasker (v4.1.2, https://www.repeatmasker.org/).

Gene prediction was conducted through a combination of three independent approaches, including ab initio prediction, homology-based prediction, and transcriptome-based prediction, in a repeat-masked genome. For ab initio gene prediction, BRAKER2 (v2.1.6)³⁴ and GlimmerHMM(v3.0.4)³⁵ were used with their default parameters. For homology-based prediction, protein sequences from human (Homo sapiens)³⁶, mouse (Mus musculus)³⁷, cow (Bos taurus)³⁸, sheep (Ovis aries)³⁹, and Sscrofa11.1 were used, and the prediction was conducted by GeMoMa (v1.9)⁴⁰. For transcriptome-based prediction, RNA-Seq data were aligned to Huai pig assembly by HISAT2 (v2.2.1)⁴¹ with default parameters. StringTie (v2.1.6)⁴² and TransDecoder (v5.5.0; https://github.com/TransDecoder/TransDecoder) were used to assemble the transcripts and convert the candidate coding regions into gene models. Simultaneously, these RNA-Seq data were also de novo assembled by Trinity (v2.1.1)⁴³, and PASA (v2.5.3)⁴⁴ was employed to predict the gene structure. Finally, the gene models predicted through the three aforementioned approaches were combined by EvidenceModeler (v2.1.0)⁴⁵ into a non-redundant set of gene structures. Protein-coding genes were functionally analyzed using six datasets, including GO_Annotation, KEGG_Annotation, KOG_Annotation, Swiss-Prot_Annotation, TrEMBL_Annotation, and NR_Annotation.

The tRNAs were predicted by tRNAscan-SE (v2.0.9)⁴⁶, while the rRNA fragments were detected by barrnap (v0.9, https://github.com/tseemann/barrnap). The miRNAs and snRNAs were identified by searching the Rfam database (release 14.10) using INFERNAL (v1.1.4)⁴⁷.

Genome collinearity analysis and validation of structural variants (SVs) in the Huai pig genome

To verify the quality of the Huai pig genome, six public chromosome-level pig genomes (Sscrofa11.1, USMARC⁴⁸, Ningxiang⁴⁹, Meishan⁵⁰, Bama miniature⁵¹, and Diannan Small-ear pig⁵²) were used to conduct the collinearity analysis. MCScanX⁵³ was used to identify colinear blocks, and the genome collinearity graph was generated using jcvi⁵⁴.

Simultaneously, to validate the difference between the Huai pig genome and other pig genomes. Huai pig genome and other five pig genomes (USMARC, Ningxiang, Meishan, Bama miniature, and Diannan Small-ear pig) were aligned to the Sscrofa11.1 reference genome, and four methods were applied to identify the SVs: Assemblytics (v1.2.1)⁵⁵, smartie-sv⁵⁶, SVMU⁵⁷, and SyRI (v1.3)⁵⁸. Specifically, the pipelines of Assemblytics and SVMU were performed on the nucmer (v4.0.0rc1)⁵⁹ (-c 1000–maxgap = 500) alignment. Alignment pairs were extracted from any pair of genomes based on Minimap2 to serve as inputs for SyRI. For insertions and deletions, we merged these four results sets using SURVIVOR (v1.0.7)⁶⁰ with the parameters “1000 3 1 0 0 50” and identified candidate insertions and deletions supported by at least three methods. For the inversions, we only considered the results detected by both SyRI and SVMU.

The 300 bp Huai pig-specific insertion identified in the ENPP5 gene was validated by PCR of amplicon(s) that spanned 655–900 bp of gDNA flanking the insert, the breakpoint between the gDNA and the insert. Primers that hybridized to the gDNA flanking the insert were designed using Primer3 Software (https://sourceforge.net/projects/primer3/). Amplification was performed using 2 × EasyTaq® PCR SuperMix (AS111-12). PCR was conducted as described below: 1 μL of each primer (10 μM), 1 μL of genomic DNA (about 80 ng of DNA), 12.5 μL 2 × EasyTaq PCR SuperMix and 1 μL ddH₂O. Thermocycling was done for 30 cycles at 58°C annealing temperature and one minute extension time. The PCR product of the predicted size was identified in different pig breeds that were homozygous or heterozygous for the insert, using agarose gel electrophoresis.

Data Records

The assembled genome has been deposited at NCBI GenBank with the accession number JBGKAQ000000000⁶¹. The raw sequencing data of this genome and the RNA-Seq of seven tissues are available at NCBI SRA under the project PRJNA1147173⁶². Simultaneously, the genome and the raw sequencing data are also publicly accessible in the GSA database (https://ngdc.cncb.ac.cn/gsa/) with the accession number PRJCA024381⁶³. Additionally, files containing the protein-coding gene annotation, non-coding RNA prediction, and repeat annotation of Huai pig have been deposited in the Figshare⁶⁴ database. Furthermore, the dataset supports the genome collinearity analysis and genomic variants validation, which can also be accessed in the Figshare⁶⁴ database.

Technical Validation

Various methods have been applied to evaluate the completeness and accuracy of Huai pig assembly. First, the Huai pig genome assessment using Merqury²⁶ revealed a consensus quality score of 32.86, equivalent to a base accuracy of 99.95%. Evaluation of the Huai pig genome using the CEGMA software indicated that 91.13% of the 248 full-length genes in the core gene set were predicted. Simultaneously, approximately 95.33% (8,795 of 9,226) of the single-copy orthologous genes in the “mammalia_odb10” data set were identified in our assembled genome (Table 4), similar to the Sscrofa11.1 reference genome. Furthermore, we aligned Illumina short reads (~79.11×) from the same individual against this assembly, resulting in a mapping rate and genome coverage of 98.59% and 99.38%, respectively. Finally, 1,341,928 EST sequences belonging to pigs were downloaded from the UCSC database and aligned with the Huai pig genome. The results revealed that 93.59% of the EST sequences (coverage rate > 90%) matched the Huai pig genome. These results indicated that the Huai pig genome assembly was of high quality. The ultimate predicted gene set comprised 23,389 protein-coding genes, and the functional analysis revealed that 92.96% of the predicted genes were annotated in at least one of the six public databases (Table 3). Simultaneously, the gene features in the Huai pig genome revealed similar length distributions for coding sequences, genes, exons, and introns to Sscrofa11.1 (Fig. 2c).

Table 4 The BUSCO results of the Sscrofa11.1 and Huai pig assembly.

Full size table

In addition, the Huai pig genome demonstrates strong collinearity with the Sscrofa11.1 reference genome and other public chromosome-level pig genomes. A total of 3,239 insertions and 1,400 deletions may be specific to Huai pigs (Figure S4). Especially, an insertion with a 300 bp length was located in the first CDS of the ENPP5 gene (Fig. 3b) and validated by PCR.

Code availability

No specific code was used in this study. The data analyses adhered to the manuals and protocols offered by the creators of the corresponding bioinformatics tools, the parameter settings of which were outlined in the methods section.

References

Groenen, M. A. M. A decade of pig genome sequencing: a window on pig domestication and evolution. Genet. Sel. Evol. 48 (2016).
Frantz, L. et al. The Evolution of Suidae. Annu. Rev. Anim. Biosci. 4, 61–85 (2016). Vol 4.
Article PubMed Google Scholar
Wang, X. et al. Genetic Evaluation and Population Structure of Jiangsu Native Pigs in China Revealed by SINE Insertion Polymorphisms. Animals 12, 1345 (2022).
Article PubMed PubMed Central Google Scholar
Liu, H. et al. Genome-Wide Association Study and FST Analysis Reveal Four Quantitative Trait Loci and Six Candidate Genes for Meat Color in Pigs. Front. Genet. 13 (2022).
Cheng, P. Livestock Breeds of China. (Food and Agriculture Organization of the United Nations, Rome, 1985).
Yeqiu, Z. et al. Effects of rice bran source high fibre diet on growth performance and intestine function of Suhuai pigs. J. Nanjing Agric. Univ. (2016).
Li, X. et al. Transcriptomic Profiling of Meat Quality Traits of Skeletal Muscles of the Chinese Indigenous Huai Pig and Duroc Pig. Genes 14, 1548 (2023).
Article PubMed PubMed Central Google Scholar
Warr, A. et al. An improved pig reference genome sequence to enable pig genetics and genomics research. Gigascience 9 (2020).
Sherman, R. M. et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat. Genet. 51, 30–+ (2019).
Article CAS PubMed Google Scholar
Zhou, R. et al. The Meishan pig genome reveals structural variation-mediated gene expression and phenotypic divergence underlying Asian pig domestication. Mol. Ecol. Resour. 21, 2077–2092 (2021).
Article CAS PubMed Google Scholar
Ma, H. M. et al. Long-read assembly of the Chinese indigenous Ningxiang pig genome and identification of genetic variations in fat metabolism among different breeds. Mol. Ecol. Resour. (2021).
Zhang, L. et al. Development and Genome Sequencing of a Laboratory-Inbred Miniature Pig Facilitates Study of Human Diabetic Disease. Iscience 19, 162‐+ (2019).
Zhang, Y. et al. The genome of the naturally evolved obesity-prone Ossabaw miniature pig. iScience 24 (2021).
Wang, M. et al. Evolutionary dynamics of 3D genome architecture following polyploidization in cotton. Nat. Plants 4, 90–97 (2018).
Article ADS CAS PubMed Google Scholar
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10–12 (2011).
Article Google Scholar
Marcais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–70 (2011).
Article CAS PubMed PubMed Central Google Scholar
Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 13, 1050–1054 (2016).
Article CAS PubMed PubMed Central Google Scholar
Walker, B. J. et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. Plos One 9 (2014).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–+ (2013).
Article CAS PubMed PubMed Central Google Scholar
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_000003025.6 (2017).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 3, 99–101 (2016).
Article CAS PubMed PubMed Central Google Scholar
Salmela, L. & Rivals, E. LoRDEC: accurate and efficient long read error correction. Bioinformatics 30, 3506–14 (2014).
Article CAS PubMed PubMed Central Google Scholar
Xu, M. et al. TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads. GigaScience 9, giaa094 (2020).
Article PubMed PubMed Central Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Article CAS PubMed PubMed Central Google Scholar
Parra, G., Bradnam, K., Ning, Z., Keane, T. & Korf, I. Assessing the gene space in draft genomes. Nucleic Acids Res. 37, 289–297 (2009).
Article CAS PubMed Google Scholar
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article CAS PubMed Google Scholar
Karolchik, D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32, D493–D496 (2004).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, 573–80 (1999).
Article CAS PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. 117, 9451–9457 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 1–6 (2015).
Article Google Scholar
Bruna, T., Hoff, K. J., Lomsadze, A., Stanke, M. & Borodovsky, M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform 3, lqaa108 (2021).
Article PubMed PubMed Central Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).
Article CAS PubMed Google Scholar
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_000001405.29 (2022).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_000001635.9 (2020).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_002263795.2 (2018).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_002742125.1 (2017).
Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods Mol. Biol. Clifton NJ 1962, 161–177 (2019).
Article CAS Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013).
Article CAS PubMed Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 9 (2008).
Chan, P. P. & Lowe, T. M. tRNAscan-SE: Searching for tRNA genes in genomic sequences. Methods Mol. Biol. Clifton NJ 1962, 1 (2019).
Article CAS Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Article CAS PubMed PubMed Central Google Scholar
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_002844635.1 (2017).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_020567905.1 (2021).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_017957985.1 (2021).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_007644095.1 (2019).
National Genomics Data Center https://ngdc.cncb.ac.cn/gwh/Assembly/1052/show (2020).
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Tang, H. et al. JCVI: A versatile toolkit for comparative genomics analysis. iMeta 3, e211 (2024).
Article PubMed PubMed Central Google Scholar
Nattestad, M. & Schatz, M. C. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics 32, 3021–3023 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kronenberg, Z. N. et al. High-resolution comparative analysis of great ape genomes. Science 360 (2018).
Chakraborty, M., Emerson, J. J., Macdonald, S. J. & Long, A. D. Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits. Nat. Commun. 10, 4872 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Goel, M., Sun, H., Jiao, W.-B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 20, 1–13 (2019).
Article Google Scholar
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLOS Comput. Biol. 14, e1005944 (2018).
Article PubMed PubMed Central Google Scholar
Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8 (2017).
NCBI GenBank https://identifiers.org/ncbi/insdc:JBGKAQ000000000 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP526475 (2024).
National Genomics Data Center https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA024381 (2024).
Du, H. & Liu, J.-F. The chromosomal-level genome represents the gene evolution and genetic variants in the Huai pig. Figshare https://doi.org/10.6084/m9.figshare.25804891.v2 (2024).

Download references

Acknowledgements

This work was financially supported by the National Key Research and Development Program of China (2021YFD1200801), the Earmarked Fund for China Agriculture Research System (No. CARS-pig-35), the National Natural Science Foundations of China (32302708), Science and Technology Program of Guizhou Province (Qian Kehe Support [2022] Key 032), and the 2115 Talent Development Program of China Agricultural University. We would like to thank the High-performance Computing Platform of China Agricultural University for computing support.

Author information

Authors and Affiliations

State Key Laboratory of Animal Biotech Breeding; College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China
Heng Du, Shiyu Lu, Qianqian Huang, Lei Zhou & Jian-Feng Liu
Frontiers Science Center for Molecular Design Breeding (MOE), China Agricultural University, Beijing, 100193, China
Heng Du, Shiyu Lu, Qianqian Huang, Lei Zhou & Jian-Feng Liu

Authors

Heng Du
View author publications
Search author on:PubMed Google Scholar
Shiyu Lu
View author publications
Search author on:PubMed Google Scholar
Qianqian Huang
View author publications
Search author on:PubMed Google Scholar
Lei Zhou
View author publications
Search author on:PubMed Google Scholar
Jian-Feng Liu
View author publications
Search author on:PubMed Google Scholar

Contributions

J-F.L. conceived and designed the experiments. H.D. designed the analytical strategy and performed analysis processes. S.L. participated in the PCR experiment and revised this manuscript. Q.H. assisted in writing the manuscript. L.Z. Supervision and reviewing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jian-Feng Liu.

Ethics declarations

Competing interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemenatry information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Du, H., Lu, S., Huang, Q. et al. Chromosome-level genome assembly of Huai pig (Sus scrofa). Sci Data 11, 1072 (2024). https://doi.org/10.1038/s41597-024-03921-w

Download citation

Received: 21 May 2024
Accepted: 23 September 2024
Published: 02 October 2024
DOI: https://doi.org/10.1038/s41597-024-03921-w