Background & Summary

Pig (Sus scrofa), initially domesticated from wild boars independently in the Near East and China approximately 10,000 years ago1,2,3,4, have evolved into indispensable contributors to anthropogenic systems through their diverse roles. Nutritionally, they supply essential proteins, vitamins, and trace minerals critical for human dietary requirements5. Economically, industrialized swine husbandry drives agricultural economies, thus generating substantial revenue streams and employment opportunities5. Ecologically, swine has an exceptional capacity for sustainable nutrient cycling5. In translational medicine, the anatomical and physiological homology of pigs and humans has established them as pivotal preclinical models6. Culturally, pig hold profound symbolic significance in Eastern traditions, especially in Chinese cultural practices5. European and Asian pig exhibit different phenotypic and genomic characteristics7. The current pig genome, Sscrofa 11.1, was assembled using the Duroc Jersey pig, which was a most continuous assemblies and provide a pivotal role in mining the germplasm-related genes, thereby serving as an indispensable foundation for strategic genetic improvement initiatives in swine populations8. However, reliance on the pig genome Sscrofa 11.1 for evolutionary and population genetic analyses of Asian pig populations may result in the incomplete detection of Asian variations, hence, compromising the comprehensive characterization of their germplasm characteristics.

Anqing Six-end-white pig (Fig. 1), a representative native pig breed with a dual-purpose meat-lard type, is primarily distributed in surrounding mountainous areas including Taihu, Wangjiang, and Susong counties in China. In 2007, there were nearly 600 Anqing Six-white pigs. Recent data from 2019 show that the population has reached 4,617. Two national-level conservation farms were established to protect Anqing Six-white pigs. This breed is deeply loved by the local people, not only for providing meat protein and increasing farmers’ income but also for their position in local culture and history. In our prior investigations of Anqing Six-end-white pig, several candidate genes were identified that play important roles in regulating meat quality traits, lipid metabolism, and fat deposition9,10,11,12. However, the genetic basis of the characteristics in Anqing Six-end-white pig, has not yet been fully elucidated. The high-quality reference genome of Chinese indigenous pig breeds is a powerful tool to elucidate the characteristics of native breeds. Several high-quality pig genomes have been assembled, including Huai pig13, Chenghua pig14, Bamei pig15, and others16,17. However, a high-quality genome assembly of Anqing Six-end-white pig is still lacking, which greatly limits the elucidate of their germplasm characteristics and protection at the genomic level.

Fig. 1
figure 1

The Anqing Six-end-white pig.

In this study, we assembled the first chromosome-level Anqing Six-end-white pig genome by combining short reads, PacBio HiFi (high fidelity) reads, and Hi-C (High-throughput chromosome conformation capture) sequencing data. The genome size and heterozygosity of Anqing Six-end-white pig were estimated to be 2.4 Gb and 0.61% according to 17-mer analysis of 129.62 Gb (57x) short reads. After de novo assembly with 229.34 Gb (95x) HiFi reads, the genome size was 2.69 Gb, composed of 62 contigs, and the contig N50 was 90.48 Mb with a 42.64% GC content. The final assembly genome (2.66 Gb) was anchored to 20 chromosomes (18 autosomes plus one X and Y), with scaffold N50 = 143.10 Mb through Hi-C assisted scaffolding. There were 23 gaps in the final assembled genome. The Anqing Six-end-white pig assembly captured 38 telomeres and 20 centromeres. Repeat sequences in the Anqing Six-end-white pig genome were annotated, and a 1.16 Gb repeat sequences (approximately 43.52% of the assembled genome) were identified. A total of 20,809 protein-coding genes were identified in our assembly, which harbored 36,142 transcripts with an average of 9.48 exons per gene. Non-coding RNAs in the assembly were annotated, and 848 miRNA, 4544 tRNA, 253 rRNA, and 2156 snRNA were identified. This study established a robust scientific foundation for the conservation, selective breeding, and exploration of superior genetic traits in the Anqing Six-end-white pig, offering critical genomic insights to safeguard germplasm resources and enhance agricultural sustainability.

Methods

Sample collection

A one-year-old male Anqing Six-end-white pig from the national Anqing Six-end-white pig conservation farm in Anqing city, Anhui province, China (30°19′N, 116°33′E) was used for genome sequencing (short reads, PacBio HiFi, and Hi-C) and assembly. Forty-two tissues of three one-year-old male Anqing Six-end-white pig were collected and rapidly frozen in liquid nitrogen and stored at −80°C for RNA sequencing, including the heart, liver, spleen, lung, kidney, stomach, longissimus dorsi muscle, adipose, duodenum, jejunum, ileum, cecum, colon and rectum. The pigs used in this study were healthy, and no genetic defects were observed in it or its parents. This study was conducted in accordance with and was approved by the Animal Care Committee of the Anhui Academy of Agricultural Sciences (Hefei, People’s Republic of China; Approval No. AAAS2023-4).

Sequencing

Genomic DNA was extracted from the blood sample using a standard phenol–chloroform method and assessed using a 0.5% agarose gel and Nanodrop spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA).

For short read sequencing, the DNA were processed with fragmented, size selected, end-repaired, A-tailed, ligated to paired-end adaptors, PCR amplificated, and sequenced on the BGISEQ DNBseq-T7 sequencing platform at OneMore Technology Co., Ltd. (Wuhan, China). In total, 141.65 Gb 150 bp paired-end reads were generated. Given that the original sequencing data may contain adapter sequences, low-quality bases, and undetected bases, Fastp18 (v0.23.2) software was used to filter the above information and 129.62 Gb data (Table 1) were retained to estimate the genome size of Anqing Six-end-white pig.

Table 1 Sequencing data used for the Anqing Six-end-white pig genome assembly, which include the sequencing types, sequencing platform, sequencing tissue, average read length, total bases and sequencing coverage.

For SMRT sequencing library construction, genomic DNA was sheared, end-repaired, and ligated to SMRTbell adapters after exonuclease-mediated overhang removal. DNA damage was repaired prior to size selection (BluePippin system). Post-ligation exonuclease digestion eliminated the unligated fragments. Library quality was verified suing Qubit quantification and Fragment Analyzer sizing. Sequencing generated 12.39 million CCS reads (229.34 Gb, Table 1).

Hi-C libraries were constructed using 10 mL blood from a male Anqing Six-white pig. Chromatin was crosslinked (paraformaldehyde), digested (restriction enzyme), and ligated to preserve spatial interactions. After blunt-end repair with biotinylated dNTPs, crosslinks were reversed and the DNA was sheared (300–500 bp). Biotin-tagged fragments were enriched using streptavidin beads, followed by Illumina adapter ligation and PCR optimization. Library quality was verified by fluorometry (Qubit), fragment analysis (Bioanalyzer), and qPCR. Sequencing generated 257 Gb raw data, which were filtered to 249 Gb clean data (Table 1) with Fastp v0.23.2.

Total RNA was extracted from each tissue using Trizol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer’s instructions. Quantity and purity were analysed using a Bioanalyzer 2100 and RNA 6000 Nano Labchip Kit (Agilent, Palo Alto, CA, USA). Only RNA samples with suitable RNA electrophoresis results (28 S/18 S ≥ 1.0) and RNA integrity number (RIN) ≥ 7.5 could be analysed further. The RNA-seq libraries were sequenced on an Illumina NovaSeq 6000, which generated 150-bp paired-end reads. An average of 10.97 Gb for each tissue were obtained (Table 1). These RNA-seq data were used for whole genome protein-coding gene prediction.

Genome size estimation and De novo genome assembly

Prior to genome assembly, key genomic features such as genome size and heterozygosity rate composition could be computationally estimated using the K-mer method. The 129.62 Gb clean data were subjected to a 17-mer frequency distribution analysis using GCE19 (v1.0.0) software (Fig. 2). The following equation was used to estimate the genome size of Anqing Six-end-white pig: G = K-mer number/K-mer Depth (where K-num is the total number of 17-mers, and K-depth denotes the K-mer depth, and G represents the genome size). To account for the confounding effects of genome heterozygosity on genome size estimation, K-mer exhibiting a depth of 1 were identified as sequencing errors. The derived error rate was used to refine the genome size calculation. The formula used was Revised Gsize = G x (1-Error Rate). The genome size and heterozygosity of Anqing Six-end-white pig were estimated to be 2.4 Gb and 0.61%.

Fig. 2
figure 2

The frequency distribution of k-mer for Anqing Six-end-white pig genome (k = 17).

De novo assembly of Anqing Six-end-white pig was performed using PacBio HiFi data and HiFiasm20 (v0.19.6) software. The HiFiasm assembly pipeline was constructed in three critical phases: (1) error correction and haplotype phasing of raw reads; (2) iterative construction of assembly graphs; and (3) refinement to generate chromosome-level contigs. To eliminate heterozygosity-induced redundancy, the purge_haplotigs21 (v1.0.4) software was applied to filter misassembled contigs by integrating the read coverage patterns and pairwise sequence alignment metrics. Contaminant sequences were systematically identified and removed by alignment against the NT database (v2023.07.04, https://ftp.ncbi.nlm.nih.gov/blast/db/) using BLASTN22 (v2.11.0+). The genome size was 2.69 Gb, composed of 62 contigs, and the contig N50 was 90.48 Mb with a GC content of 42.64% (Table 2).

Table 2 Assembly statistics for the Anqing Six-end-white pig.

Hi-C assisted scaffolding

Following quality control of Hi-C data, the validated reads were processed through a hierarchical assembly pipeline: (1) data integrity was assessed using HiCUP23 (v0.7.2); (2) alignment to the draft genome was performed with Juicer24 (v1.5.6); and (3) chromosome-scale scaffolding was achieved through iterative integration of 3d-DNA (v180.922, https://github.com/theaidenlab/3d-dna) and HapHiC25 (v1.0.2), with manual curation guided by JuiceBox26 (v1.11.08) visualization of contact matrices. The final genome assembly was 2.66 Gb with contig N50 = 90.48 Mb and scaffold N50 = 143.10 Mb (Table 2). There were 23 gaps in the final assembled genome. The chromosomal anchoring efficiency was 98.80%. A heatmap was drawn for the interactions between the chromosomes (Fig. 3). The comparison between Anqing Six-end-white pig assembled genome and Sscrofa 11.1 was shown in Table 3.

Fig. 3
figure 3

The Hi-C heatmap of chromosome interaction in Anqing Six-end-white pig at a 1 Mb resolution. The color from light to dark indicates an increase in interaction strength.

Table 3 The genome comparison of Anqing Six-end-white pig and Sscrofa 11.1.

Repetitive element identification

Repetitive sequence annotation was performed through an integrated pipeline combining three complementary strategies: (1) de novo prediction of tandem repeats using TRF27 (v4.09) software; (2) homology-based identification via RepeatMasker28 (v open-4.0.9) and RepeatProteinMask against the RepBase database (v20181026, http://www.girinst.org/repbase); and (3) a custom repeat library was constructed by integrating RepeatModeler29 (v open-1.0.11) and LTR_FINDER_parallel30 (v1.0.7), and was subsequently applied RepeatMasker (v open-4.0.9) to systematically annotate repetitive elements across the genome by de novo predictions. Consensus annotations were generated by merging non-redundant predictions from all approaches. A total of 1.16 Gb repeat sequences (approximately 43.52% of the assembled genome) were identified, of which 99.8% were classified as known repeat sequences (Table 4) Fig. 4.

Table 4 The detail information of repetitive sequences in the Anqing Six-end-white pig genome.
Fig. 4
figure 4

Schematic illustration of the genetic features of the Anqing Six-end-white pig genome in the 500 kb nonoverlapping windows. (a) GC content distribution. (b) Gene density distribution. (c) Total repeat density distribution. (d) LTR density distribution. (e) LINE density distribution. (f) DNA-TE density distribution.

Protein-coding genes prediction

Gene annotation was performed through an integrative multi-evidence framework: (1) ab initio predictions were generated using Augustus31 (v3.3.2) and Genscan32 under default parameters; (2) homology-based inference integrated cross-species protein alignments from Homo sapiens (GCF_000001405.40_GRCh38.p14), Mus musculus (GCF_000001635.27_GRCm39), Phacochoerus africanus (GCF_016906955.1_ROS_Pafr_v1), and Sus scrofa (GCF_000003025.6_Sscrofa11.1) through miniprot33 (v0.11-r234); and (3) transcriptomic evidence was incorporated through HISAT234 (v2.1.0)-guided RNA-seq alignments, followed by transcript assembly with StringTie35 (v1.3.5) and ORF prediction via TransDecoder (v5.5.0, https://github.com/TransDecoder/TransDecoder). BUSCO36 (Benchmarking Universal Single-Copy Orthologs, v5.7.1) was used for predictions based on Augustus (v3.3.2) and lineage-specific ortholog libraries (laurasiatheria_odb10). MAKER237 (v2.31.10) was used to integrate preliminary gene models into a consensus set. Final refinement was performed using HiFAP (developed in-house). In total, 20,809 protein-coding genes were identified, which harbored 36,142 transcripts with an average of 9.48 exons per gene (Table 5).

Table 5 The detail information of protein-coding gene of Anqing Six-end-white pig.

Gene function annotation

Protein-coding genes were functionally analyzed using ten datasets: NR_Annotation, SwissProt_Annotation, TrEMBL_Annotation, KOG_Annotation, TF_Annotation, InterPro_Annotation, GO_Annotation, KEGG_ALL_Annotation, KEGG_KO_Annotation, and Pfam_Annotation. Functional annotation was performed through dual complementary strategies: (1) sequence similarity analysis using diamond38 blastp against TrEMBL39, SwissProt39, NR40, KOG (https://ftp.ncbi.nih.gov/pub/COG/KOG/), COG41, GO42, and KEGG43 databases, with KOBAS44 (v3.0) mapping KEGG orthologs to metabolic pathways; and (2) domain/motif profiling through InterProScan45 (v5.61–93.0) integrating InterPro46 databases to obtain protein conserved sequences, motifs, domains and other information. Hmmscan from the HMMER3 (v3.3.1) software was used to annotate conserved sequence information, such as transcription factors, Pfam, and motif based on multiple sequence alignment and the hidden Markov model47. A total of 20,638 genes (99.18% of predicted genes) were annotated (Table 6).

Table 6 The gene function annotation of Anqing Six-end-white pig assembly.

Annotation of non-coding RNAs (ncRNA)

RNA annotation employed species-specific approaches based on molecular features: (1) tRNA were identified using tRNAscan-SE48; (2) rRNA were detected via BLASTN against conserved ribosomal sequences from phylogenetically proximal species (Sus scrofa and Phacochoerus africanus); and (3) miRNAs and snRNAs were annotated with INFERNAL49 (v1.1.4) by scanning against the Rfam50 (v14.8) database. The predicted non-coding genes included 848 miRNAs, 4544 tRNA, 253 rRNA, and 2156 snRNAs in the Anqing Six-end-white pig genome (Table 7).

Table 7 The annotation of non-coding RNAs in the Anqing Six-end-white pig assembly.

Detecting telomeric and centromeric regions

Telomeric regions were systematically identified across the Anqing Six-end-white pig genome using quarTeT51 software (version 1.1.4), which detects the consensus telomeric repeat motif AACCCT. Among all chromosomes, 18 exhibited telomeric repeats at both termini, whereas two chromosomes showed terminal repeats at a single end (Table 8). According to the results of repeat annotation in the genome annotation, regions with high repeat distribution were selected as candidate centromere inputs, and srf52 software ((https://github.com/lh3/srf)) was used for centromere prediction. The srf software identifies multiple centromere candidate regions and multiple repeat types, and then selects the candidate regions to obtain the predicted centromere region (Table 8). Visualization of porcine telomeres and centromeres is shown in Fig. 5.

Table 8 The telomeric and centromeric regions of Anqing Six-end-white pig genome.
Fig. 5
figure 5

Overview of the telomeres and centromeres. The upper part of the chromosome represents the overall length of the chromosome and the location of the assembled telomeres and centromeres. The lower part of the chromosome represents the distribution of the assembled conting on the corresponding position of the chromosome.

Data Records

The assembled genome has been deposited at NCBI GenBank under the accession number GCA_050231125.153. The PacBio HiFi, Hi-C, short reads, and RNA-seq data were submitted to the Genome Sequence Archive of the China National Center for Bioinformation (https://ngdc.cncb.ac.cn/gsa/) under the project PRJCA03772254,55. The accession number of CRR1702913 and CRR1702914 for Hi-C. The accession number of CRR1702915 for short reads. The accession number of CRR1702916 and CRR1702917 for PacBio HiFi. The accession number of from CRR1702918 to CRR1702959 for RNA-seq. Additionally, files containing the protein-coding gene annotation, non-coding RNA prediction, and repeat annotation of Anqing Six-end-white pig have been deposited in the Figshare database56 (https://doi.org/10.6084/m9.figshare.28943891).

Technical Validation

Multiple evaluation metrics were utilized to assess the quality and robustness of the Anqing Six-end-white pig assembly genome. Firstly, the short reads, PacBio HiFi reads, and RNA-seq data were mapped to the genome with bwa57 (v0.7.12-r1039), minimap258 (v2.24-r1122), and Hisat2, respectively. Alignment rates were 99.14% for short-read data, 99.99% for PacBio HiFi reads, and 89.43% for RNA-seq data. Secondly, following short-read alignment, variants (SNPs and InDels) were identified using SAMtools59 (v1.9), Picard (v1.124), and GATK60 (v4.4.0.0). The homozygous SNP rate (0.001%), homozygous InDel rate (0.001%), heterozygous SNP rate (0.346%), and heterozygous InDel rate (0.069%). The exceptionally low homozygous rates indicate high assembly accuracy. Thirdly, The Merqury61 software (v1.3) was employed to assess the quality values (QV) of Anqing Six-end-white pig genome using a combination of short-read and PacBio HiFi reads. The results revealed that the QV based on short-read and PacBio HiFi reads were 49.3113 and 72.6645 respectively. Fourthly, BUSCO36 (v5.7.1) and Compleasm62 (v0.2.6) analyses were conducted to evaluate the completeness of our assembly by laurasiatheria_odb10. BUSCO analyses revealed that 12,071(98.67%) of the 12,234 conserved single-copy genes in our assembly, of which 12034 were single, 37 were duplicated, and 78 were fragmented matches (Table 9). Compleasm analyses revealed that 12,211(99.81%) of the 12,234 conserved single-copy genes in our assembly, of which 12193 were single, 18 were duplicated, and 9 were fragmented matches (Table 10). Overall, these findings indicate the high quality of the genome assembly.

Table 9 The BUSCO results of the Anqing Six-end-white pig assembly.
Table 10 The Compleasm results of the Anqing Six-end-white pig assembly.