Background & Summary

The pig (Sus Scrofa) is a crucial livestock species that supplies staple protein to humans and serves as an important biomedical model owing to its anatomical and physiological similarities to humans. Belonging to the Suidae family, S. scrofa (wild boars and domestic pigs) is the only species that has spread across multiple continents1 and has been domesticated by humans for 9-10 thousand years ago (kya)2. The Huai pig is an important Chinese domestic pig, recorded in the Compendium of Materia Medica. It is an ancient breed that has been prevalent for 2 ky in northern Jiangsu Province, China3. Huai pigs are well-documented for their high meat quality, redder meat color, high forage tolerance, and lower growth rate than European domestic pigs4,5,6. A series of genetic studies have been conducted to dissect the characteristics of Huai pigs at the molecular level. For example, a transcriptome study of the Huai pig revealed significant differences in meat quality and muscle fiber content between the muscles of Huai pigs and Duroc pigs and identified related candidate genes7.

Genomic data is a powerful tool to explain the characteristics of distinct pigs. The recent pig reference genome, Sscrofa11.1, has significantly contributed to our understanding of the genetic basis of distinct phenotypes and evolutionary processes involved in porcine domestication8. To address the limited diversity in the reference genome9, several studies have assembled pig genomes of different breeds, including the Meishan pig10, Ningxiang pig11, and others12,13. However, high-quality genome assembly of Huai pigs is still lacking, and there is a strong demand for chromosome-level genomes for this breed.

In this study, using PacBio long-read and Illumina short-read sequences, we assembled the first chromosome-level Huai pig genome combined with Hi-C data (Table 1). The genome size of the Huai pig was estimated to be approximately 2.56 Gb according to the k-mer analysis of 197.78 Gb (79.11×) Illumina reads (Fig. 1b). The final genome assembly had a size of 2,533,275,462 bp, comprising 2,044 contigs with an N50 size of 11.37 Mb. After chromosome-level anchoring, 2.43 Gb (96.05%) of the assembled contigs were anchored onto 20 chromosomes (Fig. 1c), with scaffold N50 of 138.92 Mb (Table 2). In addition, we annotated 23,389 protein-coding genes in this assembly with a mean of 8.70 exons per gene (Fig. 2a, Table 3). Four types of non-coding RNAs, including transfer RNAs (tRNAs), ribosomal RNAs (rRNAs), microRNAs (miRNAs), and small nuclear RNAs (snRNAs), were also identified in the Huai pig assembly (Fig. 2b). Besides, the repetitive elements in the Huai pig assembly were also annotated, and 45.87% of assembly regions (about 1.17 Gb) were regarded as repetitive sequences (Table 3). Among all repeat elements, long interspersed nuclear elements (LINEs) were the most abundant element, accounting for 20.67% of the entire genome (Figure S1). Our research offers a versatile resource applicable to pig breeding and a foundation for the future exploration of the genetic mechanisms of porcine traits.

Table 1 The detailed information on sequencing data of Huai pig.
Fig. 1
figure 1

The genome assembly of Huai pig. (a) Workflow for the genome assembly of Huai pig. (b) The frequency distribution of k-mer for Huai pig genome (k = 17). (c) A contact map at a 500-kb resolution of chromosome-level assembly in the Huai pig is shown. The color gradient in the accompanying bar represents the contact density, which transitions from red (high density) to green (low density) within the plot. (d) QV scores of chromosomes in the Huai pig assembly.

Table 2 Statistics of the Huai pig assembly and the reference genome assembly of the pig (Sscrofa11.1).
Fig. 2
figure 2

Genome annotation of Huai pig assembly. (a) Distribution of features in the Huai pig genome. For the outer to inner regions, each circle represents the GC content, transposable elements number, and gene number in the 500 kb nonoverlapping windows. (b) Statistics for the annotated non-coding RNAs of the Huai pig assembly. (c) Comparison of gene features between the Huai pig assembly and Sscrofa11.1. Gene features include the lengths of genes, CDS, exons, and introns. (d) Sequence divergence of repetitive elements in the Huai pig assembly.

Table 3 Statistics of the Huai pig genome annotation.

Methods

Sample collection

A male Huai pig from Nanjing, Jiangsu Province, China (31.5267°N, 120.5875°E), was collected for de novo assembly. Seven tissues of the same individual were collected and immediately frozen in liquid nitrogen and then stored at −80 °C until RNA extraction, including the heart, liver, spleen, lung, kidney, muscle, and adipose. Blood samples were collected for DNA extraction. All animal experiments were performed under the guidance of ethical regulations from the Institutional Animal Care and Use Committee (IACUC) at the China Agricultural University (Beijing, People’s Republic of China; Approval No. AW60604202-1-1).

DNA isolation and sequencing for genomes

Genomic DNA was extracted from whole blood using the DNeasy Blood & Tissue Kit (QIAGEN, Hilden, Germany). For long-read sequencing, four SMRT bell libraries were constructed using a Pacific Biosciences SMRT bell Template Prep Kit (Pacific Biosciences, Menlo Park, California, USA). Libraries were evaluated using an Agilent 4200 Bioanalyzer (Agilent Technologies, Santa Clara, California, USA). After size selection, the constructed libraries were sequenced on a Pacific Biosciences Sequel II platform (Pacific Biosciences, Menlo Park, California, USA). A paired-end library with an insert size of ~ 300 bp was constructed using the TruSeq Nano DNA Sample Preparation Kit (Illumina, San Diego, California, USA). In total, 197.78 Gb 150 bp paired-end reads were generated using an Illumina HiSeq 2000 platform. These reads were used to estimate the genome size of Huai pigs and to refine the assembly.

Approximately 10 mL of blood collected from the same Huai pig was used for the Hi-C experiment. Blood was initially crosslinked in a 2% formaldehyde solution for 15 min, and the reaction was halted by the addition of glycine. After isolating the nuclei, the chromatin was digested with MboI. The sticky ends of digested fragments were randomly biotinylated, diluted, and randomly ligated14. Subsequently, biotin-labeled DNA fragments were subjected to ultrasound shearing, followed by blunt-end repair and A-tailing. The adapters were then ligated to the DNA fragments, and polymerase chain reaction (PCR) amplification was performed to scaffold the Hi-C library. After quality control, the Hi-C library was sequenced using an Illumina paired-end sequencing platform with 2 × 150 bp reads.

Transcriptome sequencing

Total RNA was extracted from each tissue sample using the TRIzol-based RNA extraction kit (Invitrogen, Carlsbad, CA, USA). RNA degradation and contamination were monitored using 1% agarose gel electrophoresis. The total RNA concentration was quantified using a Qubit RNA Assay Kit on a Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, United States). RNA sequencing libraries with insert sizes ranging from 250 to 350 bp were prepared using Kapa RiboErase (Roche, Basel, Switzerland). Subsequently, all libraries were sequenced on an Illumina NovaSeq 6000 S4 platform, following the manufacturer’s instructions to obtain transcriptome profiles.

Genome size estimation and de novo assembly

Before de novo assembly, we estimated the genome size of Huai pigs using the k-mer method. Adapters and low-quality reads (base quality [Q] values < 20) in the 197.78 Gb Illumina paired-end reads were removed and trimmed using TrimGalore (v0.6.1)15. These high-quality reads were subjected to 17-mer frequency distribution analysis using Jellyfish (v2.3.0)16. The k-mer depth distribution computed using Jellyfish exhibited an explicit peak depth. Subsequently, the genome size of Huai pigs was calculated using the following formula: genome size = K-num/K-depth, where K-num represents the total number of k-mers, and K-depth corresponds to the highest k-mer frequency.

PacBio subreads were used to perform de novo genome assembly using Falcon software (v2018.03.12)17. The primary assembly was polished using Pilon (v1.23)18 with the aforementioned filtered Illumina paired-end reads. Two rounds of iterative error correction were conducted to ensure assembly accuracy. Finally, the highly accurate contigs were identified. Over 100 × Hi-C reads were used to connect the primary contigs and construct a pseudo-chromosome-level genome. After removing adapter sequences and low-quality bases, these reads were aligned to the primary genome assembly using aln and sampe commands from the Burrow-Wheeler Aligner (BWA v0.7.17)19. The alignment results and contigs from the assembly were used as inputs for LACHESIS (https://github.com/shendurelab/LACHESIS)20, with the cluster number set to 20 and anchored to pseudo-chromosomes. The chromosome name of the Huai pig genome was also determined by LACHESIS based on the alignment results between Sscrofa11.121 reference pig genome (S. scrofa) and the Huai pig genome, which achieved by blastn (v2.10.1+)22. Subsequently, the chromosome-level genome was manually optimized using JuiceBox (v2.20.00)23. Then, the PacBio subreads were corrected using LORDEC (v0.9)24 with the Illumina paired-end reads of the same sample. The chromosome-level genome was gap-filled using TGS-GapCloser (v1.2.1)25 with the corrected PacBio long reads.

Genome quality assessment

To assess the completeness and accuracy of the newly assembled Huai pig genome, we conducted the following validation. First, we mapped the whole-genome sequencing short reads of the same Huai pig against the genome using BWA to estimate the accuracy of a single base of the assembly. In addition, each chromosome’s quality value (QV) score was assessed with short reads using Merqury(v1.3)26. The CEGMA (v2-2.5)27 pipeline software with parameter “–mam,” was also run against the new assembly. BUSCO (v5.0.0)28 software, based on the lineage dataset mammalia_odb10 (creation date: 2019-11-20) was employed to assess the quality of the generated genome. Furthermore, 1,341,928 EST sequences of pig were downloaded from the UCSC database29 and aligned to the Huai pig genome using Minimap2 (v2.17)30.

Repetitive landscape and genome annotation

Homology-based and de novo methods were applied to repeat annotation. Tandem Repeats Finder (TRF, v4.09)31 and RepeatModeler (v2.0.1)32 were used to generate the de novo repeat library for the Huai pig genome, which comprised tandem and interspersed repeats. This de novo repeat library, together with the Repbase33 library, was used for the homology search of repeats through RepeatMasker (v4.1.2, https://www.repeatmasker.org/).

Gene prediction was conducted through a combination of three independent approaches, including ab initio prediction, homology-based prediction, and transcriptome-based prediction, in a repeat-masked genome. For ab initio gene prediction, BRAKER2 (v2.1.6)34 and GlimmerHMM(v3.0.4)35 were used with their default parameters. For homology-based prediction, protein sequences from human (Homo sapiens)36, mouse (Mus musculus)37, cow (Bos taurus)38, sheep (Ovis aries)39, and Sscrofa11.1 were used, and the prediction was conducted by GeMoMa (v1.9)40. For transcriptome-based prediction, RNA-Seq data were aligned to Huai pig assembly by HISAT2 (v2.2.1)41 with default parameters. StringTie (v2.1.6)42 and TransDecoder (v5.5.0; https://github.com/TransDecoder/TransDecoder) were used to assemble the transcripts and convert the candidate coding regions into gene models. Simultaneously, these RNA-Seq data were also de novo assembled by Trinity (v2.1.1)43, and PASA (v2.5.3)44 was employed to predict the gene structure. Finally, the gene models predicted through the three aforementioned approaches were combined by EvidenceModeler (v2.1.0)45 into a non-redundant set of gene structures. Protein-coding genes were functionally analyzed using six datasets, including GO_Annotation, KEGG_Annotation, KOG_Annotation, Swiss-Prot_Annotation, TrEMBL_Annotation, and NR_Annotation.

The tRNAs were predicted by tRNAscan-SE (v2.0.9)46, while the rRNA fragments were detected by barrnap (v0.9, https://github.com/tseemann/barrnap). The miRNAs and snRNAs were identified by searching the Rfam database (release 14.10) using INFERNAL (v1.1.4)47.

Genome collinearity analysis and validation of structural variants (SVs) in the Huai pig genome

To verify the quality of the Huai pig genome, six public chromosome-level pig genomes (Sscrofa11.1, USMARC48, Ningxiang49, Meishan50, Bama miniature51, and Diannan Small-ear pig52) were used to conduct the collinearity analysis. MCScanX53 was used to identify colinear blocks, and the genome collinearity graph was generated using jcvi54.

Simultaneously, to validate the difference between the Huai pig genome and other pig genomes. Huai pig genome and other five pig genomes (USMARC, Ningxiang, Meishan, Bama miniature, and Diannan Small-ear pig) were aligned to the Sscrofa11.1 reference genome, and four methods were applied to identify the SVs: Assemblytics (v1.2.1)55, smartie-sv56, SVMU57, and SyRI (v1.3)58. Specifically, the pipelines of Assemblytics and SVMU were performed on the nucmer (v4.0.0rc1)59 (-c 1000–maxgap = 500) alignment. Alignment pairs were extracted from any pair of genomes based on Minimap2 to serve as inputs for SyRI. For insertions and deletions, we merged these four results sets using SURVIVOR (v1.0.7)60 with the parameters “1000 3 1 0 0 50” and identified candidate insertions and deletions supported by at least three methods. For the inversions, we only considered the results detected by both SyRI and SVMU.

The 300 bp Huai pig-specific insertion identified in the ENPP5 gene was validated by PCR of amplicon(s) that spanned 655–900 bp of gDNA flanking the insert, the breakpoint between the gDNA and the insert. Primers that hybridized to the gDNA flanking the insert were designed using Primer3 Software (https://sourceforge.net/projects/primer3/). Amplification was performed using 2 × EasyTaq® PCR SuperMix (AS111-12). PCR was conducted as described below: 1 μL of each primer (10 μM), 1 μL of genomic DNA (about 80 ng of DNA), 12.5 μL 2 × EasyTaq PCR SuperMix and 1 μL ddH2O. Thermocycling was done for 30 cycles at 58°C annealing temperature and one minute extension time. The PCR product of the predicted size was identified in different pig breeds that were homozygous or heterozygous for the insert, using agarose gel electrophoresis.

Data Records

The assembled genome has been deposited at NCBI GenBank with the accession number JBGKAQ00000000061. The raw sequencing data of this genome and the RNA-Seq of seven tissues are available at NCBI SRA under the project PRJNA114717362. Simultaneously, the genome and the raw sequencing data are also publicly accessible in the GSA database (https://ngdc.cncb.ac.cn/gsa/) with the accession number PRJCA02438163. Additionally, files containing the protein-coding gene annotation, non-coding RNA prediction, and repeat annotation of Huai pig have been deposited in the Figshare64 database. Furthermore, the dataset supports the genome collinearity analysis and genomic variants validation, which can also be accessed in the Figshare64 database.

Technical Validation

Various methods have been applied to evaluate the completeness and accuracy of Huai pig assembly. First, the Huai pig genome assessment using Merqury26 revealed a consensus quality score of 32.86, equivalent to a base accuracy of 99.95%. Evaluation of the Huai pig genome using the CEGMA software indicated that 91.13% of the 248 full-length genes in the core gene set were predicted. Simultaneously, approximately 95.33% (8,795 of 9,226) of the single-copy orthologous genes in the “mammalia_odb10” data set were identified in our assembled genome (Table 4), similar to the Sscrofa11.1 reference genome. Furthermore, we aligned Illumina short reads (~79.11×) from the same individual against this assembly, resulting in a mapping rate and genome coverage of 98.59% and 99.38%, respectively. Finally, 1,341,928 EST sequences belonging to pigs were downloaded from the UCSC database and aligned with the Huai pig genome. The results revealed that 93.59% of the EST sequences (coverage rate > 90%) matched the Huai pig genome. These results indicated that the Huai pig genome assembly was of high quality. The ultimate predicted gene set comprised 23,389 protein-coding genes, and the functional analysis revealed that 92.96% of the predicted genes were annotated in at least one of the six public databases (Table 3). Simultaneously, the gene features in the Huai pig genome revealed similar length distributions for coding sequences, genes, exons, and introns to Sscrofa11.1 (Fig. 2c).

Table 4 The BUSCO results of the Sscrofa11.1 and Huai pig assembly.

In addition, the Huai pig genome demonstrates strong collinearity with the Sscrofa11.1 reference genome and other public chromosome-level pig genomes. A total of 3,239 insertions and 1,400 deletions may be specific to Huai pigs (Figure S4). Especially, an insertion with a 300 bp length was located in the first CDS of the ENPP5 gene (Fig. 3b) and validated by PCR.

Fig. 3
figure 3

Genome collinearity analysis and validation of Huai pig-specific SVs. (a) Colinear blocks identified between the Huai pig genome and other pig breeds’ genomes. (b) Insertion of ENPP5 gene. The left diagram shows the structure of the ENPP5 gene and the 300 bp insertion located in its CDS. The diagram on the right-hand side shows the PCR results for this insertion. The numbers 1 to 8 represent different breeds: Duroc, Landrace, Yorkshire, Bama miniature, Diannan Small-ear, Ningxiang, Meishan, and Huai pigs.