Background & Summary

Miniature pigs play a critical role as a source of protein for humans, and represent a promising alternative solution to the shortage of human organs for transplantation. Their relatively small body size, physiological similarities to humans and potential for genetic manipulation make them ideal candidates for xenotransplantation1,2. The breakthrough of knocking out the PERV gene in Bama miniature pigs3, along with the development of 13-gene edited miniature pigs in 20214, represents the pinnacle of genetic engineering to create suitable organ donor pigs. Since 2022, several commercialized pig-to-human heart and kidney xenotransplantations have been successfully performed in the United States5,6,7,8, marking significant progress toward clinical application. Despite the advances that have attracted worldwide attention, concerns remain here regarding the large size of commercial pigs, and the cross-match compatibility between pig cells and human cells9.

A step forward in the use of miniature pigs as organ donors has been achieved on May 17, 2024, when a genetically engineered liver from the Diannan miniature pig (DN) was successfully transplanted into a 71-year-old Chinese patient suffering from liver cancer10. The Banna miniature inbred pig (BN), developed from the inbreeding of the DN, a breed native to southwest China, offers inherent advantages for low individual variation along with its close resemblance of anatomical and physiological characteristics to humans1,11,12. With more than 40 years of inbreeding history, the BN pigs provide a valuable model for biomedical research and xenotransplantation1,11,12,13. Nevertheless, there is a dearth of genomic studies on BN pigs, and the lack of a high-quality genome for BN hampers their widespread use as a biological biomedical model and makes it more difficult to fully comprehend their genetic architecture.

Recent developments in bioinformatics tools and sequencing technologies, along with decreasing sequencing costs, have significantly advanced porcine genomics research. Here, we performed a genome assembly of the BN by integrating high-throughput chromosome conformation capture (Hi-C) in conjunction with PacBio high-fidelity (HiFi) data. We obtained approximately 92 Gb of HiFi clean reads with an average length of 14.00 Kb, achieving a sequencing depth of 36.80× (based on a genome size of 2.5 Gb), and 142 Gb (56.80×) of Hi-C clean reads. This resulted in a contig-level genome of 296 contigs with a contig N50 of 94.41 Mb. After scaffolding and final assembly, we produced a high-quality chromosomal genome of 2.66 Gb with 44 scaffolds, and a scaffold N50 of 143.60 Mb (Sscrofa11.1, 138.97 Mb). The final assembled genome size of BN was comparable to published porcine reference genomes (2.5 Gb of Sscrofa11.1, Duroc) and DN (2.65 Gb), with an anchoring rate of 97.59%, and strong collinearity with the Duroc genome. Additionally, the final BN genome had a higher scaffold N90 value and longer average and maximum scaffold lengths compared to Duroc and DN. Using the BUSCO scoring method, 96.3% of the 9,226 core genes in the mammalia_odb10 dataset were completely assembled in the BN genome.

We detected 1.29 Gb of repeat sequences which account for 48.49% of the genome, of which 99.63% were classified as known repeat elements. The annotation of the BN genome predicted a total of 19,756 protein-coding genes (PCGs), which was slightly lower than those already identified in Duroc (22,063 PCGs) and DN (21,447 PCGs). Evaluation using OMArk revealed that 99.58% of the annotated genes were complete, of which 93.62% showed consistent lineage placement, thus demonstrating high assembly completeness and annotation accuracy. In addition, 7207 non-coding RNAs (ncRNAs) were annotated in the BN genome. We also determined the mitochondrial (MT) genome of BN with a length of 16,711 bp and annotated 37 genes. This high-quality chromosome assembly of the BN genome establishes a robust cornerstone for understanding inbreeding and advancing research.

Methods

Sample collection and sequencing

The Ethics Committee of Yunnan Agricultural University approved our sampling pipeline for animal experimental procedures (Approval No. 202105009). A 33-day old male fetal sample (0028A3) from the A3 family of BN was collected from the BN farm in Jinghong City, Xishuangbanna Prefecture, Yunnan Province, China. The collected tissues were rapidly frozen in liquid nitrogen, and then stored at −80 °C until use. Whole genomic DNA was isolated from muscle by the standard phenol-chloroform protocol.

The extracted genomic DNA was purified, concentrated, and quantified using Nanodrop 1000 and Qubit assays before being prepared into the SMRTbell library. The library was sequenced on the PacBio Sequel® II systems utilizing continuous long read mode according to the official protocols at Novogene (Tianjin, China). Raw sequencing data was generated using the third-generation Circular Consensus Sequencing (CCS) default parameters of SMRT Link v8.0 (https://github.com/PacificBiosciences/pbcommand) to produce highly accurate HiFi reads.

The Hi-C library was prepared by cell cross-linking, tissue lysis, chromatin digestion with the enzyme MboI, proximal chromatin DNA ligation, biotin labeling, and DNA purification. The ends of sheared fragments with insert sizes ranging from 300 to 500 base pairs (bp) were repaired and specifically enriched. The Hi-C sequencing libraries were then amplified, quality controlled, and sequenced on the Illumina NovaSeq. 6000 platform at Novogene (Tianjin, China). After sequencing, adapter sequences were trimmed, and low-quality paired-end reads were removed using the fastp v0.22.014 with default settings.

Genome assembly

For the contig assembly of the BN genome, we obtained 92 Gb (36.80×, estimated by genome size of 2.5 Gb) of PacBio HiFi clean reads, with a read N50 of 14.00 Kb (the average read length of 13.99 kb and the longest read of 42.35 Kb), and 142 Gb (56.80×) of Hi-C clean reads. These data were assembled using the Hi-C integrated assembly approach of Hifiasm v0.16.1-r37515 with default settings except for “-t 128 -r 9” (Fig. 1a). The preliminary assembly genome at contig-level contained 296 contigs with a contig N50 of 94.41 Mb, the longest contig of 204.73 Mb, and a total size of 2.66 Gb (Table 1). For scaffolding, the Hi-C clean reads were aligned to the contig-level assembly using Bwa v0.7.1816 with default parameters, and the SAM conversion to BAM using SAMtools v1.1817, which generated uniquely mapped paired-end reads. Subsequently, we used YaHS v1.2a.118 with “-e GATC” to generate a scaffold-level draft genome with an N50 of 142.98 Mb.

Fig. 1
figure 1

Chromosome-level genome assembly of the Banna miniature inbred pig (BN). (a) The genome assembly workflow. (b) Mitochondrial (MT) genome features (window size: 50 bp). For the circular map, the tracks from outside to inside indicate: (i) distribution of MT genes, colors represent different blocks of features and arrows indicate the direction of gene transcription; (ii) the BLAST comparisons of BN with Duroc (Sscrofa 11.1, NC_000845.1); (iii) GC content; (iv) GC skew. (c) Chromosome Hi-C interaction heatmap of BN. Blocks represent chromosome group and color bar represent intensity of interaction from orange (low) to dark red (high). (d) K-mer (21-mer) frequency distribution curve.

Table 1 Summary statistics for the BN assembly.

For the MT genome, the MT sequence (NC_000845.1) of the reference Sscrofa11.1 was retrieved from the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000003025.6/). We employed MitoHiFi v3.2.119 for MT assembly based on the draft scaffolds with default parameters (except for “-t 32 -d -o 2”), yielding a final MT genome with a length of 16,711 bp, which had 37 genes (two rRNAs, 22 tRNAs, and 13 protein-coding genes), and GC content of 39.59% (Fig. 1b). The MT genome map was visualized by CGView20.

For final assembly, we employed RaGoo21 (parameters: “-t 128 -b -C”) and manual corrections to accurately anchor scaffolds to 20 chromosomes by addressing large-scale inversions and translocations. We generated a Hi-C heatmap with strong interactions within 20 chromosome-level scaffolds using HiC-Pro v3.1.022 and visualized it by EndHiC23 (Fig. 1c). Finally, we obtained a high-quality chromosome-level genome consisting of 44 scaffolds, including 20 chromosomes with an anchoring rate of 97.59%, with a size of 2.66 Gb, an N50 length of 143.60 Mb, an N95 length of 58.41 Mb, longest scaffold of 292.93 Mb, and a GC content of 42.57% (Fig. 2a–c, Tables 1 and 2). Of the 9,226 BUSCO groups, we identified 8,880 complete BUSCOs (96.3%), including 8,845 complete and single-copy BUSCOs (95.9%), and 35 complete and duplicated BUSCOs (0.4%). Fragmented BUSCOs and missing BUSCOs were 72 (0.8%) and 274 (2.9%), respectively.

Fig. 2
figure 2

Overview of the Banna miniature inbred pig (BN) genome assembly. (a) The snail plot shows metrics of the BN genome including the total length, BUSCO completeness score, and base composition. (b) Cumulative length distribution versus cumulative count of the final genome. (c) Distribution of GC content and coverage depth of HiFi reads across the final genome.

Table 2 Genome assembly statistics of BN as compared to its close relative DN and Duroc.

Assessment of the genome assembly

Firstly, Merqury v.1.324 was applied to estimate the consensus quality value (QV) and the k-mer completeness score. The optimal k-mer length was first determined as 21. The 21-kmer dataset was generated from the HiFi clean reads using the Meryl v 1.3.024. We then evaluated Merqury’s QV based on the 21-kmer dataset and the primary assembly, and found a k-mer completeness of 98.37 and a k-mer-based QV of 71.64, demonstrating a high completeness and accuracy (Fig. 1d). Secondly, we evaluated the BUSCO completeness using BUSCO v5.5.025,26 with the mammalia_odb10 database. The assembled contigs exhibited 96.2% completeness for conserved genes, while the final scaffolds achieved 96.3% with a total length of 2.66 Gb and a GC content of 42.57% (Fig. 2, Table 1). In the Hi-C heatmap, there was a well-organized interaction pattern within the chromosomal region, indicating that the final scaffolds were assembled without apparent large-scale structural errors (Fig. 1c). Additionally, the final assembly in scaffolds achieved the highest completeness in BUSCO assessments, along with the longest N50, N90, N95, average, and maximum scaffold lengths compared to the Duroc and DN assemblies. (Table 1).

Repeat element annotation

The repetitive elements of the BN genome were annotated by the automated Earl Grey TE annotation pipeline v4.1.027 with “-g genome -s ‘sus_scrofa’ -r ‘sus_scrofa’ -o OutDir -d yes -t 128”, configured with RepeatMasker v4.1.528 with Dfam v3.729, RepeatModeler v2.0.530, RepeatScout v1.0.631, Tandem Repeat Finder v4.0932, RECON v1.0833, LTR _retriever v2.9.434, and LTR_FINDER35. In total, 1.29 Gb (48.49%) of the final chromosome-level genome was masked as repeats. Of these, retroelements were the major type, occupying 36.98% (982.26 Mb) of the genome, with long interspersed repetitive elements (LINEs) being the most abundant at 29.98% (796.33 Mb), followed by long terminal repeats (LTRs) at 5.41% (143.68 Mb), short interspersed repetitive elements (SINEs) at 1.59% (42.24 Mb), and penelope elements at 0.00% (14.18 Kb). In addition, DNA transposons accounted for 2.30% (61.01 Mb), rolling circle elements for 0.01% (0.29 Mb), unclassified elements for 0.37% (9.81 Mb), and other elements for 8.84% (234.79 Mb) (Fig. 3a).

Fig. 3
figure 3

Genome annotation of the Banna miniature inbred pig (BN) assembly. (a) Divergence plot of different families of repetitive elements for BN. (b) The OMArk score of the proteome and the statistics of protein-coding genes (PCGs) in BN, Duroc (Sscrofa 11.1), Diannan miniature pig (DN). BNa and BNb were annotated using LiftOn and Braker3 with GeMoMa, respectively. (c) Annotation statistics of non-coding RNAs (ncRNAs) in BN.

Protein-coding genes prediction

For the gene prediction, LiftOn v1.0.236 with parameters for “-g ref.GFF -o genome.gff3 -copies -sc 0.95 genome ref.fasta -t 96” was used, which is an excellent genome annotation tool within the same species. We identified 23,853 PCGs based on the latest release of the Sscrofa 11.1.112 GFF3 file from Ensembl (https://www.ensembl.org/Sus_scrofa/Info/Index), with an average gene length of 44,630.63 bp (Fig. 3b). The average number of coding sequences (CDSs, mean length: 159.36 bp), and exons (mean length: 272.65 bp) in each gene was 21.15 and 22.68, respectively. We then assessed the predicted protein gene sequences for BUSCO completeness using OMArk v0.3.037. The results demonstrated 99.16% completeness (n: 13,050 OMArk), consisting of 42.04% single-copy BUSCOs, 57.12% duplicated BUSCOs, and only 0.84% missing BUSCOs, achieving the same level of completeness as the reference genome (Fig. 3b). The proteome consistent assessment showed high consistency with 97.64% consistent lineage placement, 1.16% inconsistent lineage placement, 1.2% unknown, and no contamination.

To better reveal the unique genomic characteristics of the BN pig, we further predicted gene models using a combined pipeline with Braker3 v3.0.838 and Gene Model Mapper v1.939 (GeMoMa; Fig. 1a). First, we prepared RNA-seq clean reads by using fastp to process 100 transcriptome raw datasets from various tissues of Sus scrofa, which were downloaded from the NCBI database (Table S1). These reads were then aligned to the genome using hisat2 v2.2.140 and sorted with SAMtools. Next, we used the RNA-seq data aligned to the target genome along with a self-curated protein dataset for gene prediction and identification in Braker3. The protein dataset, used for homology-based annotation, was integrated from all non-redundant vertebrate protein sequences in OrthoDB v1141, Artiodactyla proteins (accessed from NCBI on March 29, 2024), and pig reference proteins (GCF_000003025.6). Seven reference genomes were then selected: Pig (GCF_000003025.6), Human (GCF_000001405.40), Mouse (GCF_000001635.27), Cattle (GCF_002263795.3), Horse (GCF_002863925.1), Goat (GCF_001704415.2), and Dog (GCF_011100685.1). These reference genomes and their associated annotations, the Braker3 results, and all mapped RNA-seq BAM data were used to improve the annotation with GeMoMa. As a result, we annotated 19,756 protein-coding genes (PCGs) with an average length of 10.26 Kb, of which 19,271 genes had at least one untranslated region (UTR) identified (Fig. 3b, Table 1). Functional annotation using the online eggNOG-mapper v2.1.1242 revealed that 19,460 genes were successfully annotated with functional information. Moreover, OMArk evaluation showed that 99.58% of the annotated genes were intact, with 93.62% maintaining consistent lineage positions, underscoring the completeness of the assembly and the accuracy of the annotation.

Non-coding RNAs annotation

The microRNA (miRNA), small nucleolar RNA (snRNA), small RNA (sRNA), long non-coding RNA (lncRNA), transfer RNA (tRNA) and ribosomal RNA (rRNA) were annotated by Infernal v1.1.43343 with Rfam database. In total, we predicted 7207 non-coding RNAs (ncRNAs), including 911 miRNAs, 2145 snRNAs, nine sRNAs, 174 lncRNAs, 3013 tRNAs, 453 rRNAs, four frameshift elements, 21 IRESs, ten ribozymes, and 467 other ncRNAs (Fig. 3c).

Genome collinearity analysis

Genomic collinearity analyses of the CDSs were conducted between BN (contigs, draft scaffolds, final scaffolds) and the reference genome Sscrofa 11.1 using JCVI v1.4.1544 with the LiftOn annotation results (Fig. 4a–c). We observed that as the assembly quality of the BN genome improved, its clustering, continuity, and collinearity improved significantly, resulting in a high-quality genome at the chromosome level. Furthermore, genome synteny assessment of BN and reference was performed using the MUMmer v4.0.0 with default settings45 (Fig. 4d).

Fig. 4
figure 4

Genome collinearity between the Banna miniature inbred pig (BN) and Duroc (Sscorfa 11.1). (ac) Comparison of collinearity in coding sequences (CDSs) between contigs, scaffolds, the chromosome-level genome of BN and Duroc (Sscorfa 11.1). (d) Genome collinearity between BN and Duroc. Purple and blue dots indicate localized forward and reverse alignments, respectively.

To validate the high collinearity between BN and Duroc, we used the genomes of four other pig breeds (including DN46, Cross-bred47, Ossabaw48, and Nanchukmacdon (NCMD)49) with genomic annotations for further collinearity comparison (Fig. 5). The results revealed that the strongest collinearity of BN with Duroc, followed by NCMD, Cross-bred, Ossabaw, and DN pigs, which indicated a high-quality chromosome-scale pig genome of our final scaffold assembly. Overall, the strong collinearity found in the CDSs, and the genome collinearity underscored the high quality of the BN sequencing and assembly. These results confirmed the accuracy of the genome assembly and annotation.

Fig. 5
figure 5

Collinearity analysis between the Banna miniature inbred pig (BN) and five other domesticated pig breeds: DN, Cross-bred, Nanchukmacdon (NCMD), Ossabaw, and Duroc (Sscorfa 11.1).

Data Records

The PacBio clean reads (92 Gb; 36.80×) and the Hi-C clean reads (142 Gb; 56.80×) have been deposited in the Genome Sequence Archive (GSA)50 under the accession number CRA017752 with the run accession numbers CRR1232679-CRR123268351. The final chromosome assembly has been deposited in the Genome Warehouse (GWH)52 of the National Genomics Data Center (NGDC) under the BioProject number PRJCA025149 with the accession number GWHEUVB00000000.153 and NCBI GenBank under the accession JBGNFW00000000054. The annotations of the genome have been deposited in the Science Data Bank55 and FigShare56.

Technical Validation

The completeness and accuracy of the BN genome assembly and annotation were evaluated by multiple methods. With respect to the completeness of the genome assembly, the BUSCO analysis using the “mammal_odb10” dataset showed that the total genome size was 2.66 Gb, with a contig N50 of 94.41 Mb, and a scaffold N50 of 143.60 Mb, which was clearly higher than the 138.97 Mb for Duroc and 137.35 Mb for DN, showcasing superior continuity and quality (Fig. 2, Table 1). Our final genome also boasted the longest total length of non-N bases, the fewest number of scaffolds, and the highest anchoring rate (97.59%) compared with Duroc (97.34%) and DN (90.45%) (Table 2). The contact map of Hi-C interaction for our BN genome assembly revealed 20 scaffolds at the chromosome level, indicating a high accuracy of the genome assembly (Fig. 1c). To evaluate the completeness and consistency of the annotation, OMArk analysis was conducted using the Artiodactyls as the ancestral clade. As expected, compared to Duroc (99.2% completeness and 98.35% consistency) and DN (92.67% completeness and 95.11% consistency), the annotated protein-coding gene models of BN were more complete (99.16% and 99.58%) and more consistent (97.64% and 93.62% in gene classification and structure) using either LiftOn or Braker3 with GeMoMa annotation (Fig. 3b). This indicates that the gene content was not only complete but also of very high overall quality. We utilized JCVI for CDSs collinearity analysis to compare the contigs, draft scaffolds, and final assembly of BN with Duroc, demonstrating strong alignment at the chromosome level (Fig. 4a–c). We also employed MUMmer for whole-genome collinearity analysis, which confirmed the high accuracy of our assembly quality (Fig. 4d). Furthermore, collinearity analysis with other four other publicly available porcine genomes with PCGs annotation revealed that BN had the highest synteny and consistency with the reference genome (Fig. 5). Collectively, these metrics demonstrated that the assembly outcomes of BN were meticulously executed, indicating superior assembly and annotation quality.