Abstract
The Banna miniature inbred pig (BN) is an intensively inbred line for biomedical research and xenotransplantation due to its low individual variation and stable genetic background. Although it is originated from the Diannan miniature pig (DN), substantial genetic changes have actually occurred. However, the lack of a BN reference genome has limited studies on the complete genomic architecture and utilization as a biomedical model. Here, we present a high-quality genome for BN using PacBio HiFi and Hi-C sequencing technologies, with a total length of 2.66 Gb, a scaffold N50 of 143.60 Mb, and 97.59% of the sequences anchored to chromosomes. Its BUSCO score is 96.30%, higher than porcine reference assembly and DN. The genome contains 48.49% of repeats, 19,756 protein-coding genes, and 7,207 non-coding RNAs according to our annotation. The OMArk score shows a proteome completeness and consistency of 99.58% and 93.62%, respectively. These findings indicate that the chromosome-scale genome of BN provides a valuable resource for studying the genetic basis of inbreeding, facilitating further research and clinical applications.
Similar content being viewed by others
Background & Summary
Miniature pigs play a critical role as a source of protein for humans, and represent a promising alternative solution to the shortage of human organs for transplantation. Their relatively small body size, physiological similarities to humans and potential for genetic manipulation make them ideal candidates for xenotransplantation1,2. The breakthrough of knocking out the PERV gene in Bama miniature pigs3, along with the development of 13-gene edited miniature pigs in 20214, represents the pinnacle of genetic engineering to create suitable organ donor pigs. Since 2022, several commercialized pig-to-human heart and kidney xenotransplantations have been successfully performed in the United States5,6,7,8, marking significant progress toward clinical application. Despite the advances that have attracted worldwide attention, concerns remain here regarding the large size of commercial pigs, and the cross-match compatibility between pig cells and human cells9.
A step forward in the use of miniature pigs as organ donors has been achieved on May 17, 2024, when a genetically engineered liver from the Diannan miniature pig (DN) was successfully transplanted into a 71-year-old Chinese patient suffering from liver cancer10. The Banna miniature inbred pig (BN), developed from the inbreeding of the DN, a breed native to southwest China, offers inherent advantages for low individual variation along with its close resemblance of anatomical and physiological characteristics to humans1,11,12. With more than 40 years of inbreeding history, the BN pigs provide a valuable model for biomedical research and xenotransplantation1,11,12,13. Nevertheless, there is a dearth of genomic studies on BN pigs, and the lack of a high-quality genome for BN hampers their widespread use as a biological biomedical model and makes it more difficult to fully comprehend their genetic architecture.
Recent developments in bioinformatics tools and sequencing technologies, along with decreasing sequencing costs, have significantly advanced porcine genomics research. Here, we performed a genome assembly of the BN by integrating high-throughput chromosome conformation capture (Hi-C) in conjunction with PacBio high-fidelity (HiFi) data. We obtained approximately 92 Gb of HiFi clean reads with an average length of 14.00 Kb, achieving a sequencing depth of 36.80× (based on a genome size of 2.5 Gb), and 142 Gb (56.80×) of Hi-C clean reads. This resulted in a contig-level genome of 296 contigs with a contig N50 of 94.41 Mb. After scaffolding and final assembly, we produced a high-quality chromosomal genome of 2.66 Gb with 44 scaffolds, and a scaffold N50 of 143.60 Mb (Sscrofa11.1, 138.97 Mb). The final assembled genome size of BN was comparable to published porcine reference genomes (2.5 Gb of Sscrofa11.1, Duroc) and DN (2.65 Gb), with an anchoring rate of 97.59%, and strong collinearity with the Duroc genome. Additionally, the final BN genome had a higher scaffold N90 value and longer average and maximum scaffold lengths compared to Duroc and DN. Using the BUSCO scoring method, 96.3% of the 9,226 core genes in the mammalia_odb10 dataset were completely assembled in the BN genome.
We detected 1.29 Gb of repeat sequences which account for 48.49% of the genome, of which 99.63% were classified as known repeat elements. The annotation of the BN genome predicted a total of 19,756 protein-coding genes (PCGs), which was slightly lower than those already identified in Duroc (22,063 PCGs) and DN (21,447 PCGs). Evaluation using OMArk revealed that 99.58% of the annotated genes were complete, of which 93.62% showed consistent lineage placement, thus demonstrating high assembly completeness and annotation accuracy. In addition, 7207 non-coding RNAs (ncRNAs) were annotated in the BN genome. We also determined the mitochondrial (MT) genome of BN with a length of 16,711 bp and annotated 37 genes. This high-quality chromosome assembly of the BN genome establishes a robust cornerstone for understanding inbreeding and advancing research.
Methods
Sample collection and sequencing
The Ethics Committee of Yunnan Agricultural University approved our sampling pipeline for animal experimental procedures (Approval No. 202105009). A 33-day old male fetal sample (0028A3) from the A3 family of BN was collected from the BN farm in Jinghong City, Xishuangbanna Prefecture, Yunnan Province, China. The collected tissues were rapidly frozen in liquid nitrogen, and then stored at −80 °C until use. Whole genomic DNA was isolated from muscle by the standard phenol-chloroform protocol.
The extracted genomic DNA was purified, concentrated, and quantified using Nanodrop 1000 and Qubit assays before being prepared into the SMRTbell library. The library was sequenced on the PacBio Sequel® II systems utilizing continuous long read mode according to the official protocols at Novogene (Tianjin, China). Raw sequencing data was generated using the third-generation Circular Consensus Sequencing (CCS) default parameters of SMRT Link v8.0 (https://github.com/PacificBiosciences/pbcommand) to produce highly accurate HiFi reads.
The Hi-C library was prepared by cell cross-linking, tissue lysis, chromatin digestion with the enzyme MboI, proximal chromatin DNA ligation, biotin labeling, and DNA purification. The ends of sheared fragments with insert sizes ranging from 300 to 500 base pairs (bp) were repaired and specifically enriched. The Hi-C sequencing libraries were then amplified, quality controlled, and sequenced on the Illumina NovaSeq. 6000 platform at Novogene (Tianjin, China). After sequencing, adapter sequences were trimmed, and low-quality paired-end reads were removed using the fastp v0.22.014 with default settings.
Genome assembly
For the contig assembly of the BN genome, we obtained 92 Gb (36.80×, estimated by genome size of 2.5 Gb) of PacBio HiFi clean reads, with a read N50 of 14.00 Kb (the average read length of 13.99 kb and the longest read of 42.35 Kb), and 142 Gb (56.80×) of Hi-C clean reads. These data were assembled using the Hi-C integrated assembly approach of Hifiasm v0.16.1-r37515 with default settings except for “-t 128 -r 9” (Fig. 1a). The preliminary assembly genome at contig-level contained 296 contigs with a contig N50 of 94.41 Mb, the longest contig of 204.73 Mb, and a total size of 2.66 Gb (Table 1). For scaffolding, the Hi-C clean reads were aligned to the contig-level assembly using Bwa v0.7.1816 with default parameters, and the SAM conversion to BAM using SAMtools v1.1817, which generated uniquely mapped paired-end reads. Subsequently, we used YaHS v1.2a.118 with “-e GATC” to generate a scaffold-level draft genome with an N50 of 142.98 Mb.
Chromosome-level genome assembly of the Banna miniature inbred pig (BN). (a) The genome assembly workflow. (b) Mitochondrial (MT) genome features (window size: 50 bp). For the circular map, the tracks from outside to inside indicate: (i) distribution of MT genes, colors represent different blocks of features and arrows indicate the direction of gene transcription; (ii) the BLAST comparisons of BN with Duroc (Sscrofa 11.1, NC_000845.1); (iii) GC content; (iv) GC skew. (c) Chromosome Hi-C interaction heatmap of BN. Blocks represent chromosome group and color bar represent intensity of interaction from orange (low) to dark red (high). (d) K-mer (21-mer) frequency distribution curve.
For the MT genome, the MT sequence (NC_000845.1) of the reference Sscrofa11.1 was retrieved from the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000003025.6/). We employed MitoHiFi v3.2.119 for MT assembly based on the draft scaffolds with default parameters (except for “-t 32 -d -o 2”), yielding a final MT genome with a length of 16,711 bp, which had 37 genes (two rRNAs, 22 tRNAs, and 13 protein-coding genes), and GC content of 39.59% (Fig. 1b). The MT genome map was visualized by CGView20.
For final assembly, we employed RaGoo21 (parameters: “-t 128 -b -C”) and manual corrections to accurately anchor scaffolds to 20 chromosomes by addressing large-scale inversions and translocations. We generated a Hi-C heatmap with strong interactions within 20 chromosome-level scaffolds using HiC-Pro v3.1.022 and visualized it by EndHiC23 (Fig. 1c). Finally, we obtained a high-quality chromosome-level genome consisting of 44 scaffolds, including 20 chromosomes with an anchoring rate of 97.59%, with a size of 2.66 Gb, an N50 length of 143.60 Mb, an N95 length of 58.41 Mb, longest scaffold of 292.93 Mb, and a GC content of 42.57% (Fig. 2a–c, Tables 1 and 2). Of the 9,226 BUSCO groups, we identified 8,880 complete BUSCOs (96.3%), including 8,845 complete and single-copy BUSCOs (95.9%), and 35 complete and duplicated BUSCOs (0.4%). Fragmented BUSCOs and missing BUSCOs were 72 (0.8%) and 274 (2.9%), respectively.
Overview of the Banna miniature inbred pig (BN) genome assembly. (a) The snail plot shows metrics of the BN genome including the total length, BUSCO completeness score, and base composition. (b) Cumulative length distribution versus cumulative count of the final genome. (c) Distribution of GC content and coverage depth of HiFi reads across the final genome.
Assessment of the genome assembly
Firstly, Merqury v.1.324 was applied to estimate the consensus quality value (QV) and the k-mer completeness score. The optimal k-mer length was first determined as 21. The 21-kmer dataset was generated from the HiFi clean reads using the Meryl v 1.3.024. We then evaluated Merqury’s QV based on the 21-kmer dataset and the primary assembly, and found a k-mer completeness of 98.37 and a k-mer-based QV of 71.64, demonstrating a high completeness and accuracy (Fig. 1d). Secondly, we evaluated the BUSCO completeness using BUSCO v5.5.025,26 with the mammalia_odb10 database. The assembled contigs exhibited 96.2% completeness for conserved genes, while the final scaffolds achieved 96.3% with a total length of 2.66 Gb and a GC content of 42.57% (Fig. 2, Table 1). In the Hi-C heatmap, there was a well-organized interaction pattern within the chromosomal region, indicating that the final scaffolds were assembled without apparent large-scale structural errors (Fig. 1c). Additionally, the final assembly in scaffolds achieved the highest completeness in BUSCO assessments, along with the longest N50, N90, N95, average, and maximum scaffold lengths compared to the Duroc and DN assemblies. (Table 1).
Repeat element annotation
The repetitive elements of the BN genome were annotated by the automated Earl Grey TE annotation pipeline v4.1.027 with “-g genome -s ‘sus_scrofa’ -r ‘sus_scrofa’ -o OutDir -d yes -t 128”, configured with RepeatMasker v4.1.528 with Dfam v3.729, RepeatModeler v2.0.530, RepeatScout v1.0.631, Tandem Repeat Finder v4.0932, RECON v1.0833, LTR _retriever v2.9.434, and LTR_FINDER35. In total, 1.29 Gb (48.49%) of the final chromosome-level genome was masked as repeats. Of these, retroelements were the major type, occupying 36.98% (982.26 Mb) of the genome, with long interspersed repetitive elements (LINEs) being the most abundant at 29.98% (796.33 Mb), followed by long terminal repeats (LTRs) at 5.41% (143.68 Mb), short interspersed repetitive elements (SINEs) at 1.59% (42.24 Mb), and penelope elements at 0.00% (14.18 Kb). In addition, DNA transposons accounted for 2.30% (61.01 Mb), rolling circle elements for 0.01% (0.29 Mb), unclassified elements for 0.37% (9.81 Mb), and other elements for 8.84% (234.79 Mb) (Fig. 3a).
Genome annotation of the Banna miniature inbred pig (BN) assembly. (a) Divergence plot of different families of repetitive elements for BN. (b) The OMArk score of the proteome and the statistics of protein-coding genes (PCGs) in BN, Duroc (Sscrofa 11.1), Diannan miniature pig (DN). BNa and BNb were annotated using LiftOn and Braker3 with GeMoMa, respectively. (c) Annotation statistics of non-coding RNAs (ncRNAs) in BN.
Protein-coding genes prediction
For the gene prediction, LiftOn v1.0.236 with parameters for “-g ref.GFF -o genome.gff3 -copies -sc 0.95 genome ref.fasta -t 96” was used, which is an excellent genome annotation tool within the same species. We identified 23,853 PCGs based on the latest release of the Sscrofa 11.1.112 GFF3 file from Ensembl (https://www.ensembl.org/Sus_scrofa/Info/Index), with an average gene length of 44,630.63 bp (Fig. 3b). The average number of coding sequences (CDSs, mean length: 159.36 bp), and exons (mean length: 272.65 bp) in each gene was 21.15 and 22.68, respectively. We then assessed the predicted protein gene sequences for BUSCO completeness using OMArk v0.3.037. The results demonstrated 99.16% completeness (n: 13,050 OMArk), consisting of 42.04% single-copy BUSCOs, 57.12% duplicated BUSCOs, and only 0.84% missing BUSCOs, achieving the same level of completeness as the reference genome (Fig. 3b). The proteome consistent assessment showed high consistency with 97.64% consistent lineage placement, 1.16% inconsistent lineage placement, 1.2% unknown, and no contamination.
To better reveal the unique genomic characteristics of the BN pig, we further predicted gene models using a combined pipeline with Braker3 v3.0.838 and Gene Model Mapper v1.939 (GeMoMa; Fig. 1a). First, we prepared RNA-seq clean reads by using fastp to process 100 transcriptome raw datasets from various tissues of Sus scrofa, which were downloaded from the NCBI database (Table S1). These reads were then aligned to the genome using hisat2 v2.2.140 and sorted with SAMtools. Next, we used the RNA-seq data aligned to the target genome along with a self-curated protein dataset for gene prediction and identification in Braker3. The protein dataset, used for homology-based annotation, was integrated from all non-redundant vertebrate protein sequences in OrthoDB v1141, Artiodactyla proteins (accessed from NCBI on March 29, 2024), and pig reference proteins (GCF_000003025.6). Seven reference genomes were then selected: Pig (GCF_000003025.6), Human (GCF_000001405.40), Mouse (GCF_000001635.27), Cattle (GCF_002263795.3), Horse (GCF_002863925.1), Goat (GCF_001704415.2), and Dog (GCF_011100685.1). These reference genomes and their associated annotations, the Braker3 results, and all mapped RNA-seq BAM data were used to improve the annotation with GeMoMa. As a result, we annotated 19,756 protein-coding genes (PCGs) with an average length of 10.26 Kb, of which 19,271 genes had at least one untranslated region (UTR) identified (Fig. 3b, Table 1). Functional annotation using the online eggNOG-mapper v2.1.1242 revealed that 19,460 genes were successfully annotated with functional information. Moreover, OMArk evaluation showed that 99.58% of the annotated genes were intact, with 93.62% maintaining consistent lineage positions, underscoring the completeness of the assembly and the accuracy of the annotation.
Non-coding RNAs annotation
The microRNA (miRNA), small nucleolar RNA (snRNA), small RNA (sRNA), long non-coding RNA (lncRNA), transfer RNA (tRNA) and ribosomal RNA (rRNA) were annotated by Infernal v1.1.43343 with Rfam database. In total, we predicted 7207 non-coding RNAs (ncRNAs), including 911 miRNAs, 2145 snRNAs, nine sRNAs, 174 lncRNAs, 3013 tRNAs, 453 rRNAs, four frameshift elements, 21 IRESs, ten ribozymes, and 467 other ncRNAs (Fig. 3c).
Genome collinearity analysis
Genomic collinearity analyses of the CDSs were conducted between BN (contigs, draft scaffolds, final scaffolds) and the reference genome Sscrofa 11.1 using JCVI v1.4.1544 with the LiftOn annotation results (Fig. 4a–c). We observed that as the assembly quality of the BN genome improved, its clustering, continuity, and collinearity improved significantly, resulting in a high-quality genome at the chromosome level. Furthermore, genome synteny assessment of BN and reference was performed using the MUMmer v4.0.0 with default settings45 (Fig. 4d).
Genome collinearity between the Banna miniature inbred pig (BN) and Duroc (Sscorfa 11.1). (a–c) Comparison of collinearity in coding sequences (CDSs) between contigs, scaffolds, the chromosome-level genome of BN and Duroc (Sscorfa 11.1). (d) Genome collinearity between BN and Duroc. Purple and blue dots indicate localized forward and reverse alignments, respectively.
To validate the high collinearity between BN and Duroc, we used the genomes of four other pig breeds (including DN46, Cross-bred47, Ossabaw48, and Nanchukmacdon (NCMD)49) with genomic annotations for further collinearity comparison (Fig. 5). The results revealed that the strongest collinearity of BN with Duroc, followed by NCMD, Cross-bred, Ossabaw, and DN pigs, which indicated a high-quality chromosome-scale pig genome of our final scaffold assembly. Overall, the strong collinearity found in the CDSs, and the genome collinearity underscored the high quality of the BN sequencing and assembly. These results confirmed the accuracy of the genome assembly and annotation.
Data Records
The PacBio clean reads (92 Gb; 36.80×) and the Hi-C clean reads (142 Gb; 56.80×) have been deposited in the Genome Sequence Archive (GSA)50 under the accession number CRA017752 with the run accession numbers CRR1232679-CRR123268351. The final chromosome assembly has been deposited in the Genome Warehouse (GWH)52 of the National Genomics Data Center (NGDC) under the BioProject number PRJCA025149 with the accession number GWHEUVB00000000.153 and NCBI GenBank under the accession JBGNFW00000000054. The annotations of the genome have been deposited in the Science Data Bank55 and FigShare56.
Technical Validation
The completeness and accuracy of the BN genome assembly and annotation were evaluated by multiple methods. With respect to the completeness of the genome assembly, the BUSCO analysis using the “mammal_odb10” dataset showed that the total genome size was 2.66 Gb, with a contig N50 of 94.41 Mb, and a scaffold N50 of 143.60 Mb, which was clearly higher than the 138.97 Mb for Duroc and 137.35 Mb for DN, showcasing superior continuity and quality (Fig. 2, Table 1). Our final genome also boasted the longest total length of non-N bases, the fewest number of scaffolds, and the highest anchoring rate (97.59%) compared with Duroc (97.34%) and DN (90.45%) (Table 2). The contact map of Hi-C interaction for our BN genome assembly revealed 20 scaffolds at the chromosome level, indicating a high accuracy of the genome assembly (Fig. 1c). To evaluate the completeness and consistency of the annotation, OMArk analysis was conducted using the Artiodactyls as the ancestral clade. As expected, compared to Duroc (99.2% completeness and 98.35% consistency) and DN (92.67% completeness and 95.11% consistency), the annotated protein-coding gene models of BN were more complete (99.16% and 99.58%) and more consistent (97.64% and 93.62% in gene classification and structure) using either LiftOn or Braker3 with GeMoMa annotation (Fig. 3b). This indicates that the gene content was not only complete but also of very high overall quality. We utilized JCVI for CDSs collinearity analysis to compare the contigs, draft scaffolds, and final assembly of BN with Duroc, demonstrating strong alignment at the chromosome level (Fig. 4a–c). We also employed MUMmer for whole-genome collinearity analysis, which confirmed the high accuracy of our assembly quality (Fig. 4d). Furthermore, collinearity analysis with other four other publicly available porcine genomes with PCGs annotation revealed that BN had the highest synteny and consistency with the reference genome (Fig. 5). Collectively, these metrics demonstrated that the assembly outcomes of BN were meticulously executed, indicating superior assembly and annotation quality.
Code availability
All software and pipelines used in this study were performed according to the guidelines and protocols of the respective published bioinformatics tools. Software versions and parameters are clearly described in the Methods section. Default parameters were used unless explicitly stated. No custom code was generated for these analyses.
References
Wei, H. J. et al. Comparison of the efficiency of Banna miniature inbred pig somatic cell nuclear transfer among different donor cells. PLoS One 8, e57728, https://doi.org/10.1371/journal.pone.0057728 (2013).
Zhang, L. et al. Development and Genome Sequencing of a Laboratory-Inbred Miniature Pig Facilitates Study of Human Diabetic Disease. iScience 19, 162–176, https://doi.org/10.1016/j.isci.2019.07.025 (2019).
Niu, D. et al. Inactivation of porcine endogenous retrovirus in pigs using CRISPR-Cas9. Science 357, 1303–1307, https://doi.org/10.1126/science.aan4187 (2017).
Yue, Y. N. et al. Extensive germline genome engineering in pigs. Nat Biomed Eng 5, 134–143, https://doi.org/10.1038/s41551-020-00613-9 (2021).
Pan, W. et al. Cellular dynamics in pig-to-human kidney xenotransplantation. Med, https://doi.org/10.1016/j.medj.2024.05.003 (2024).
Schmauch, E. et al. Integrative multi-omics profiling in human decedents receiving pig heart xenografts. Nat Med 30, 1448–1460, https://doi.org/10.1038/s41591-024-02972-1 (2024).
Griffith, B. P. et al. Genetically Modified Porcine-to-Human Cardiac Xenotransplantation. N Engl J Med 387, 35–44, https://doi.org/10.1056/NEJMoa2201422 (2022).
Montgomery, R. A. et al. Results of Two Cases of Pig-to-Human Kidney Xenotransplantation. N Engl J Med 386, 1889–1898, https://doi.org/10.1056/NEJMoa2120238 (2022).
Cheung, M. D. et al. Spatiotemporal immune atlas of a clinical-grade gene-edited pig-to-human kidney xenotransplant. Nat Commun 15, 3140, https://doi.org/10.1038/s41467-024-47454-7 (2024).
Smriti, M. First pig-to-human liver transplant recipient “doing very well”. Nature 630, 18–18, https://doi.org/10.1038/d41586-024-01613-4 (2024).
Liu, Z. et al. Long- and short-read RNA sequencing from five reproductive organs of boar. Sci Data 10, 678, https://doi.org/10.1038/s41597-023-02595-0 (2023).
Wang, P. et al. Transcriptomic analysis of testis and epididymis tissues from Banna mini-pig inbred line boars with single-molecule long-read sequencing. †. Biol Reprod 108, 465–478, https://doi.org/10.1093/biolre/ioac216 (2023).
Sun, X. et al. Study of renal function matching between Banna Minipig Inbred line and human. Transplant Proc 36, 2488–9, https://doi.org/10.1016/j.transproceed.2004.07.060 (2004).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890, https://doi.org/10.1093/bioinformatics/bty560 (2018).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–51, https://doi.org/10.1093/bioinformatics/btu356 (2014).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–9, https://doi.org/10.1093/bioinformatics/btp352 (2009).
Zhou, C., McCarthy, S. A. & Durbin, R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics 39 https://doi.org/10.1093/bioinformatics/btac808 (2023).
Uliano-Silva, M. et al. MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC Bioinf. 24, 288, https://doi.org/10.1186/s12859-023-05385-y (2023).
Stothard, P. & Wishart, D. S. Circular genome visualization and exploration using CGView. Bioinformatics 21, 537–9, https://doi.org/10.1093/bioinformatics/bti054 (2005).
Alonge, M. et al. RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biol 20, 224, https://doi.org/10.1186/s13059-019-1829-6 (2019).
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol 16, 259, https://doi.org/10.1186/s13059-015-0831-x (2015).
Wang, S. et al. EndHiC: assemble large contigs into chromosome-level scaffolds using the Hi-C links from contig ends. BMC Bioinf. 23, 528, https://doi.org/10.1186/s12859-022-05087-x (2022).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–2, https://doi.org/10.1093/bioinformatics/btv351 (2015).
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol 38, 4647–4654, https://doi.org/10.1093/molbev/msab199 (2021).
Baril, T., Galbraith, J. & Hayward, A. Earl Grey: A Fully Automated User-Friendly Transposable Element Annotation and Analysis Pipeline. Mol Biol Evol 41, https://doi.org/10.1093/molbev/msae068 (2024).
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics Chapter 4, 4.10.1–4.10.14, https://doi.org/10.1002/0471250953.bi0410s25 (2009).
Storer, J., Hubley, R., Rosen, J., Wheeler, T. J. & Smit, A. F. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob DNA 12, 2, https://doi.org/10.1186/s13100-020-00230-y (2021).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci USA 117, 9451–9457, https://doi.org/10.1073/pnas.1921046117 (2020).
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21(Suppl 1), i351–8, https://doi.org/10.1093/bioinformatics/bti1018 (2005).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, 573–80, https://doi.org/10.1093/nar/27.2.573 (1999).
Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res 12, 1269–76, https://doi.org/10.1101/gr.88502 (2002).
Ou, S. & Jiang, N. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant Physiol 176, 1410–1422, https://doi.org/10.1104/pp.17.01310 (2018).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 35, W265–8, https://doi.org/10.1093/nar/gkm286 (2007).
Chao, K. H. et al. Combining DNA and protein alignments to improve genome annotation with LiftOn. bioRxiv, https://doi.org/10.1101/2024.05.16.593026 (2024).
Nevers, Y. et al. Quality assessment of gene repertoire annotations with OMArk. Nat Biotechnol, https://doi.org/10.1038/s41587-024-02147-w (2024).
Gabriel, L. et al. BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA. bioRxiv https://doi.org/10.1101/2023.06.10.544449 (2024).
Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods Mol Biol 1962, 161–177, https://doi.org/10.1007/978-1-4939-9173-0_9 (2019).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915, https://doi.org/10.1038/s41587-019-0201-4 (2019).
Kuznetsov, D. et al. OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Res 51, D445–d451, https://doi.org/10.1093/nar/gkac998 (2023).
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol 38, 5825–5829, https://doi.org/10.1093/molbev/msab293 (2021).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–5, https://doi.org/10.1093/bioinformatics/btt509 (2013).
Tang, H., Bowers, J. E., Wang, X., Ming, R., Alam, M. & Paterson, A. H. Synteny and collinearity in plant genomes. Science 320, 486–8, https://doi.org/10.1126/science.1153917 (2008).
Marçais, G., Delcher, A. L., Phillippy, A. M., Coston, R., Salzberg, S. L. & Zimin, A. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14, e1005944, https://doi.org/10.1371/journal.pcbi.1005944 (2018).
Xie, H. B. et al. African Suid Genomes Provide Insights into the Local Adaptation to Diverse African Environments. Mol. Biol. Evol. 39 https://doi.org/10.1093/molbev/msac256 (2022).
Warr, A. et al. An improved pig reference genome sequence to enable pig genetics and genomics research. GigaScience 9 https://doi.org/10.1093/gigascience/giaa051 (2020).
Zhang, Y., Fan, G., Liu, X., Skovgaard, K., Sturek, M. & Heegaard, P. M. H. The genome of the naturally evolved obesity-prone Ossabaw miniature pig. iScience 24, 103081, https://doi.org/10.1016/j.isci.2021.103081 (2021).
Kwon, D. et al. A chromosome-level genome assembly of the Korean crossbred pig Nanchukmacdon (Sus scrofa). Sci Data 10, 761, https://doi.org/10.1038/s41597-023-02661-7 (2023).
Chen, T. et al. The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types. Genomics Proteomics Bioinformatics 19, 578–583, https://doi.org/10.1016/j.gpb.2021.08.001 (2021).
Genome Sequence Archive (GSA) https://ngdc.cncb.ac.cn/gsa/browse/CRA017752 (2024).
Chen, M. et al. Genome Warehouse: A Public Repository Housing Genome-scale Data. Genomics Proteomics Bioinformatics 19, 584–589, https://doi.org/10.1016/j.gpb.2021.04.001 (2021).
Genome Warehouse https://ngdc.cncb.ac.cn/gwh/Assembly/85936/show (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_041937265.1 (2024).
Chen, H. M. & Wei, H. J. Annotation of the genome for the Banna miniature inbred pig. Science Data Bank https://doi.org/10.57760/sciencedb.10833 (2024).
Chen, H. M. & Wei, H. J. The annotation of the genome for the Banna miniature inbred pig. Figshare https://doi.org/10.6084/m9.figshare.26537974.v1 (2024).
Acknowledgements
This work was supported by grants from the Major Science and Technology Project of Yunnan Province (202102AA100054, 202102AA310047), the Yunnan Fundamental Research Projects (202301AW070012), Yunnan Province (202305AH340006). We are also grateful to all individuals and institutions who have supported this project.
Author information
Authors and Affiliations
Contributions
H.-J.W., M.-S.W. and H.-Y.Z., conceived this research. K.-X.X., H.Z., D.-L.J., M.-J.L., M.-A.J. and P.W., collected the samples, and contributed to the experiments. H.-M.C., C.Y., S.S. and Z.-X.L., performed the analysis, and G.-Y.P., prepared the data and software. Y.-Z.Z., developed and maintained the inbred lines. H.-M.C., drafted the manuscript. All authors read and approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, HM., Xu, KX., Yan, C. et al. A chromosome-scale reference genome of the Banna miniature inbred pig. Sci Data 11, 1345 (2024). https://doi.org/10.1038/s41597-024-04201-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-04201-3