A high-quality chromosome-level genome assembly and annotation of Anqing Six-end-white pig (Sus scrofa)

Zhang, Wei; Zhou, Mei; Liu, Linqing; Lu, Miao; Su, Shiguang; Wang, Chonglong

doi:10.1038/s41597-025-05945-2

Download PDF

Data Descriptor
Open access
Published: 20 October 2025

A high-quality chromosome-level genome assembly and annotation of Anqing Six-end-white pig (Sus scrofa)

Wei Zhang ORCID: orcid.org/0000-0001-7673-1924^1,2,3,4,
Mei Zhou^1,2,3,4,
Linqing Liu^1,2,3,4,
Miao Lu^1,2,3,
Shiguang Su^1,2,3,4 &
…
Chonglong Wang^1,2,3,4

Scientific Data volume 12, Article number: 1656 (2025) Cite this article

2111 Accesses
7 Altmetric
Metrics details

Subjects

Abstract

The Anqing Six-end-white pig, a precious Chinese autochthonous breed, is mainly distributed in the Anhui Province, China. It is well documented for its excellent meat quality and crude-feed tolerance. However, the current lack of high-quality Anqing Six-end-white pig genome assembly poses a significant limitation for elucidating its germplasm characteristics and implementing genomic-level conservation. In this study, we assembled a high-quality chromosome-level genome; the genome size was 2.66 Gb with contig N50 = 90.48 Mb and scaffold N50 = 143.10 Mb. There were 23 gaps in the final assembled genome. A total of 1.16 Gb repeat sequences were identified and accounting for 43.52% of the genome. A total of 20,809 protein-coding genes were identified, and 99.18% of these genes were annotated. The predicted non-coding genes included 848 miRNAs, 4544 tRNA, 253 rRNA, and 2156 snRNAs in the genome. Genome completeness was 98.67% using BUSCO and 99.81% using Compleasm. This chromosome-level assembly provides a robust scientific foundation for the conservation, breeding, and exploration of genetic traits in the Anqing Six-end-white pig.

Chromosome-level genome assembly of Huai pig (Sus scrofa)

Article Open access 02 October 2024

A chromosome-level genome assembly of the Korean minipig (Sus scrofa)

Article Open access 03 August 2024

A chromosome-level genome assembly reveals genomic characteristics of the American mink (Neogale vison)

Article Open access 16 December 2022

Background & Summary

Pig (Sus scrofa), initially domesticated from wild boars independently in the Near East and China approximately 10,000 years ago^1,2,3,4, have evolved into indispensable contributors to anthropogenic systems through their diverse roles. Nutritionally, they supply essential proteins, vitamins, and trace minerals critical for human dietary requirements⁵. Economically, industrialized swine husbandry drives agricultural economies, thus generating substantial revenue streams and employment opportunities⁵. Ecologically, swine has an exceptional capacity for sustainable nutrient cycling⁵. In translational medicine, the anatomical and physiological homology of pigs and humans has established them as pivotal preclinical models⁶. Culturally, pig hold profound symbolic significance in Eastern traditions, especially in Chinese cultural practices⁵. European and Asian pig exhibit different phenotypic and genomic characteristics⁷. The current pig genome, Sscrofa 11.1, was assembled using the Duroc Jersey pig, which was a most continuous assemblies and provide a pivotal role in mining the germplasm-related genes, thereby serving as an indispensable foundation for strategic genetic improvement initiatives in swine populations⁸. However, reliance on the pig genome Sscrofa 11.1 for evolutionary and population genetic analyses of Asian pig populations may result in the incomplete detection of Asian variations, hence, compromising the comprehensive characterization of their germplasm characteristics.

Anqing Six-end-white pig (Fig. 1), a representative native pig breed with a dual-purpose meat-lard type, is primarily distributed in surrounding mountainous areas including Taihu, Wangjiang, and Susong counties in China. In 2007, there were nearly 600 Anqing Six-white pigs. Recent data from 2019 show that the population has reached 4,617. Two national-level conservation farms were established to protect Anqing Six-white pigs. This breed is deeply loved by the local people, not only for providing meat protein and increasing farmers’ income but also for their position in local culture and history. In our prior investigations of Anqing Six-end-white pig, several candidate genes were identified that play important roles in regulating meat quality traits, lipid metabolism, and fat deposition^9,10,11,12. However, the genetic basis of the characteristics in Anqing Six-end-white pig, has not yet been fully elucidated. The high-quality reference genome of Chinese indigenous pig breeds is a powerful tool to elucidate the characteristics of native breeds. Several high-quality pig genomes have been assembled, including Huai pig¹³, Chenghua pig¹⁴, Bamei pig¹⁵, and others^16,17. However, a high-quality genome assembly of Anqing Six-end-white pig is still lacking, which greatly limits the elucidate of their germplasm characteristics and protection at the genomic level.

In this study, we assembled the first chromosome-level Anqing Six-end-white pig genome by combining short reads, PacBio HiFi (high fidelity) reads, and Hi-C (High-throughput chromosome conformation capture) sequencing data. The genome size and heterozygosity of Anqing Six-end-white pig were estimated to be 2.4 Gb and 0.61% according to 17-mer analysis of 129.62 Gb (57x) short reads. After de novo assembly with 229.34 Gb (95x) HiFi reads, the genome size was 2.69 Gb, composed of 62 contigs, and the contig N50 was 90.48 Mb with a 42.64% GC content. The final assembly genome (2.66 Gb) was anchored to 20 chromosomes (18 autosomes plus one X and Y), with scaffold N50 = 143.10 Mb through Hi-C assisted scaffolding. There were 23 gaps in the final assembled genome. The Anqing Six-end-white pig assembly captured 38 telomeres and 20 centromeres. Repeat sequences in the Anqing Six-end-white pig genome were annotated, and a 1.16 Gb repeat sequences (approximately 43.52% of the assembled genome) were identified. A total of 20,809 protein-coding genes were identified in our assembly, which harbored 36,142 transcripts with an average of 9.48 exons per gene. Non-coding RNAs in the assembly were annotated, and 848 miRNA, 4544 tRNA, 253 rRNA, and 2156 snRNA were identified. This study established a robust scientific foundation for the conservation, selective breeding, and exploration of superior genetic traits in the Anqing Six-end-white pig, offering critical genomic insights to safeguard germplasm resources and enhance agricultural sustainability.

Methods

Sample collection

A one-year-old male Anqing Six-end-white pig from the national Anqing Six-end-white pig conservation farm in Anqing city, Anhui province, China (30°19′N, 116°33′E) was used for genome sequencing (short reads, PacBio HiFi, and Hi-C) and assembly. Forty-two tissues of three one-year-old male Anqing Six-end-white pig were collected and rapidly frozen in liquid nitrogen and stored at −80°C for RNA sequencing, including the heart, liver, spleen, lung, kidney, stomach, longissimus dorsi muscle, adipose, duodenum, jejunum, ileum, cecum, colon and rectum. The pigs used in this study were healthy, and no genetic defects were observed in it or its parents. This study was conducted in accordance with and was approved by the Animal Care Committee of the Anhui Academy of Agricultural Sciences (Hefei, People’s Republic of China; Approval No. AAAS2023-4).

Sequencing

Genomic DNA was extracted from the blood sample using a standard phenol–chloroform method and assessed using a 0.5% agarose gel and Nanodrop spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA).

For short read sequencing, the DNA were processed with fragmented, size selected, end-repaired, A-tailed, ligated to paired-end adaptors, PCR amplificated, and sequenced on the BGISEQ DNBseq-T7 sequencing platform at OneMore Technology Co., Ltd. (Wuhan, China). In total, 141.65 Gb 150 bp paired-end reads were generated. Given that the original sequencing data may contain adapter sequences, low-quality bases, and undetected bases, Fastp¹⁸ (v0.23.2) software was used to filter the above information and 129.62 Gb data (Table 1) were retained to estimate the genome size of Anqing Six-end-white pig.

Table 1 Sequencing data used for the Anqing Six-end-white pig genome assembly, which include the sequencing types, sequencing platform, sequencing tissue, average read length, total bases and sequencing coverage.

Full size table

For SMRT sequencing library construction, genomic DNA was sheared, end-repaired, and ligated to SMRTbell adapters after exonuclease-mediated overhang removal. DNA damage was repaired prior to size selection (BluePippin system). Post-ligation exonuclease digestion eliminated the unligated fragments. Library quality was verified suing Qubit quantification and Fragment Analyzer sizing. Sequencing generated 12.39 million CCS reads (229.34 Gb, Table 1).

Hi-C libraries were constructed using 10 mL blood from a male Anqing Six-white pig. Chromatin was crosslinked (paraformaldehyde), digested (restriction enzyme), and ligated to preserve spatial interactions. After blunt-end repair with biotinylated dNTPs, crosslinks were reversed and the DNA was sheared (300–500 bp). Biotin-tagged fragments were enriched using streptavidin beads, followed by Illumina adapter ligation and PCR optimization. Library quality was verified by fluorometry (Qubit), fragment analysis (Bioanalyzer), and qPCR. Sequencing generated 257 Gb raw data, which were filtered to 249 Gb clean data (Table 1) with Fastp v0.23.2.

Total RNA was extracted from each tissue using Trizol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer’s instructions. Quantity and purity were analysed using a Bioanalyzer 2100 and RNA 6000 Nano Labchip Kit (Agilent, Palo Alto, CA, USA). Only RNA samples with suitable RNA electrophoresis results (28 S/18 S ≥ 1.0) and RNA integrity number (RIN) ≥ 7.5 could be analysed further. The RNA-seq libraries were sequenced on an Illumina NovaSeq 6000, which generated 150-bp paired-end reads. An average of 10.97 Gb for each tissue were obtained (Table 1). These RNA-seq data were used for whole genome protein-coding gene prediction.

Genome size estimation and De novo genome assembly

Prior to genome assembly, key genomic features such as genome size and heterozygosity rate composition could be computationally estimated using the K-mer method. The 129.62 Gb clean data were subjected to a 17-mer frequency distribution analysis using GCE¹⁹ (v1.0.0) software (Fig. 2). The following equation was used to estimate the genome size of Anqing Six-end-white pig: G = K-mer number/K-mer Depth (where K-num is the total number of 17-mers, and K-depth denotes the K-mer depth, and G represents the genome size). To account for the confounding effects of genome heterozygosity on genome size estimation, K-mer exhibiting a depth of 1 were identified as sequencing errors. The derived error rate was used to refine the genome size calculation. The formula used was Revised Gsize = G x (1-Error Rate). The genome size and heterozygosity of Anqing Six-end-white pig were estimated to be 2.4 Gb and 0.61%.

De novo assembly of Anqing Six-end-white pig was performed using PacBio HiFi data and HiFiasm²⁰ (v0.19.6) software. The HiFiasm assembly pipeline was constructed in three critical phases: (1) error correction and haplotype phasing of raw reads; (2) iterative construction of assembly graphs; and (3) refinement to generate chromosome-level contigs. To eliminate heterozygosity-induced redundancy, the purge_haplotigs²¹ (v1.0.4) software was applied to filter misassembled contigs by integrating the read coverage patterns and pairwise sequence alignment metrics. Contaminant sequences were systematically identified and removed by alignment against the NT database (v2023.07.04, https://ftp.ncbi.nlm.nih.gov/blast/db/) using BLASTN²² (v2.11.0+). The genome size was 2.69 Gb, composed of 62 contigs, and the contig N50 was 90.48 Mb with a GC content of 42.64% (Table 2).

Table 2 Assembly statistics for the Anqing Six-end-white pig.

Full size table

Hi-C assisted scaffolding

Following quality control of Hi-C data, the validated reads were processed through a hierarchical assembly pipeline: (1) data integrity was assessed using HiCUP²³ (v0.7.2); (2) alignment to the draft genome was performed with Juicer²⁴ (v1.5.6); and (3) chromosome-scale scaffolding was achieved through iterative integration of 3d-DNA (v180.922, https://github.com/theaidenlab/3d-dna) and HapHiC²⁵ (v1.0.2), with manual curation guided by JuiceBox²⁶ (v1.11.08) visualization of contact matrices. The final genome assembly was 2.66 Gb with contig N50 = 90.48 Mb and scaffold N50 = 143.10 Mb (Table 2). There were 23 gaps in the final assembled genome. The chromosomal anchoring efficiency was 98.80%. A heatmap was drawn for the interactions between the chromosomes (Fig. 3). The comparison between Anqing Six-end-white pig assembled genome and Sscrofa 11.1 was shown in Table 3.

Table 3 The genome comparison of Anqing Six-end-white pig and Sscrofa 11.1.

Full size table

Repetitive element identification

Repetitive sequence annotation was performed through an integrated pipeline combining three complementary strategies: (1) de novo prediction of tandem repeats using TRF²⁷ (v4.09) software; (2) homology-based identification via RepeatMasker²⁸ (v open-4.0.9) and RepeatProteinMask against the RepBase database (v20181026, http://www.girinst.org/repbase); and (3) a custom repeat library was constructed by integrating RepeatModeler²⁹ (v open-1.0.11) and LTR_FINDER_parallel³⁰ (v1.0.7), and was subsequently applied RepeatMasker (v open-4.0.9) to systematically annotate repetitive elements across the genome by de novo predictions. Consensus annotations were generated by merging non-redundant predictions from all approaches. A total of 1.16 Gb repeat sequences (approximately 43.52% of the assembled genome) were identified, of which 99.8% were classified as known repeat sequences (Table 4) Fig. 4.

Table 4 The detail information of repetitive sequences in the Anqing Six-end-white pig genome.

Full size table

Protein-coding genes prediction

Gene annotation was performed through an integrative multi-evidence framework: (1) ab initio predictions were generated using Augustus³¹ (v3.3.2) and Genscan³² under default parameters; (2) homology-based inference integrated cross-species protein alignments from Homo sapiens (GCF_000001405.40_GRCh38.p14), Mus musculus (GCF_000001635.27_GRCm39), Phacochoerus africanus (GCF_016906955.1_ROS_Pafr_v1), and Sus scrofa (GCF_000003025.6_Sscrofa11.1) through miniprot³³ (v0.11-r234); and (3) transcriptomic evidence was incorporated through HISAT2³⁴ (v2.1.0)-guided RNA-seq alignments, followed by transcript assembly with StringTie³⁵ (v1.3.5) and ORF prediction via TransDecoder (v5.5.0, https://github.com/TransDecoder/TransDecoder). BUSCO³⁶ (Benchmarking Universal Single-Copy Orthologs, v5.7.1) was used for predictions based on Augustus (v3.3.2) and lineage-specific ortholog libraries (laurasiatheria_odb10). MAKER2³⁷ (v2.31.10) was used to integrate preliminary gene models into a consensus set. Final refinement was performed using HiFAP (developed in-house). In total, 20,809 protein-coding genes were identified, which harbored 36,142 transcripts with an average of 9.48 exons per gene (Table 5).

Table 5 The detail information of protein-coding gene of Anqing Six-end-white pig.

Full size table

Gene function annotation

Protein-coding genes were functionally analyzed using ten datasets: NR_Annotation, SwissProt_Annotation, TrEMBL_Annotation, KOG_Annotation, TF_Annotation, InterPro_Annotation, GO_Annotation, KEGG_ALL_Annotation, KEGG_KO_Annotation, and Pfam_Annotation. Functional annotation was performed through dual complementary strategies: (1) sequence similarity analysis using diamond³⁸ blastp against TrEMBL³⁹, SwissProt³⁹, NR⁴⁰, KOG (https://ftp.ncbi.nih.gov/pub/COG/KOG/), COG⁴¹, GO⁴², and KEGG⁴³ databases, with KOBAS⁴⁴ (v3.0) mapping KEGG orthologs to metabolic pathways; and (2) domain/motif profiling through InterProScan⁴⁵ (v5.61–93.0) integrating InterPro⁴⁶ databases to obtain protein conserved sequences, motifs, domains and other information. Hmmscan from the HMMER3 (v3.3.1) software was used to annotate conserved sequence information, such as transcription factors, Pfam, and motif based on multiple sequence alignment and the hidden Markov model⁴⁷. A total of 20,638 genes (99.18% of predicted genes) were annotated (Table 6).

Table 6 The gene function annotation of Anqing Six-end-white pig assembly.

Full size table

Annotation of non-coding RNAs (ncRNA)

RNA annotation employed species-specific approaches based on molecular features: (1) tRNA were identified using tRNAscan-SE⁴⁸; (2) rRNA were detected via BLASTN against conserved ribosomal sequences from phylogenetically proximal species (Sus scrofa and Phacochoerus africanus); and (3) miRNAs and snRNAs were annotated with INFERNAL⁴⁹ (v1.1.4) by scanning against the Rfam⁵⁰ (v14.8) database. The predicted non-coding genes included 848 miRNAs, 4544 tRNA, 253 rRNA, and 2156 snRNAs in the Anqing Six-end-white pig genome (Table 7).

Table 7 The annotation of non-coding RNAs in the Anqing Six-end-white pig assembly.

Full size table

Detecting telomeric and centromeric regions

Telomeric regions were systematically identified across the Anqing Six-end-white pig genome using quarTeT⁵¹ software (version 1.1.4), which detects the consensus telomeric repeat motif AACCCT. Among all chromosomes, 18 exhibited telomeric repeats at both termini, whereas two chromosomes showed terminal repeats at a single end (Table 8). According to the results of repeat annotation in the genome annotation, regions with high repeat distribution were selected as candidate centromere inputs, and srf⁵² software ((https://github.com/lh3/srf)) was used for centromere prediction. The srf software identifies multiple centromere candidate regions and multiple repeat types, and then selects the candidate regions to obtain the predicted centromere region (Table 8). Visualization of porcine telomeres and centromeres is shown in Fig. 5.

Table 8 The telomeric and centromeric regions of Anqing Six-end-white pig genome.

Full size table

Data Records

The assembled genome has been deposited at NCBI GenBank under the accession number GCA_050231125.1⁵³. The PacBio HiFi, Hi-C, short reads, and RNA-seq data were submitted to the Genome Sequence Archive of the China National Center for Bioinformation (https://ngdc.cncb.ac.cn/gsa/) under the project PRJCA037722^54,55. The accession number of CRR1702913 and CRR1702914 for Hi-C. The accession number of CRR1702915 for short reads. The accession number of CRR1702916 and CRR1702917 for PacBio HiFi. The accession number of from CRR1702918 to CRR1702959 for RNA-seq. Additionally, files containing the protein-coding gene annotation, non-coding RNA prediction, and repeat annotation of Anqing Six-end-white pig have been deposited in the Figshare database⁵⁶ (https://doi.org/10.6084/m9.figshare.28943891).

Technical Validation

Multiple evaluation metrics were utilized to assess the quality and robustness of the Anqing Six-end-white pig assembly genome. Firstly, the short reads, PacBio HiFi reads, and RNA-seq data were mapped to the genome with bwa⁵⁷ (v0.7.12-r1039), minimap2⁵⁸ (v2.24-r1122), and Hisat2, respectively. Alignment rates were 99.14% for short-read data, 99.99% for PacBio HiFi reads, and 89.43% for RNA-seq data. Secondly, following short-read alignment, variants (SNPs and InDels) were identified using SAMtools⁵⁹ (v1.9), Picard (v1.124), and GATK⁶⁰ (v4.4.0.0). The homozygous SNP rate (0.001%), homozygous InDel rate (0.001%), heterozygous SNP rate (0.346%), and heterozygous InDel rate (0.069%). The exceptionally low homozygous rates indicate high assembly accuracy. Thirdly, The Merqury⁶¹ software (v1.3) was employed to assess the quality values (QV) of Anqing Six-end-white pig genome using a combination of short-read and PacBio HiFi reads. The results revealed that the QV based on short-read and PacBio HiFi reads were 49.3113 and 72.6645 respectively. Fourthly, BUSCO³⁶ (v5.7.1) and Compleasm⁶² (v0.2.6) analyses were conducted to evaluate the completeness of our assembly by laurasiatheria_odb10. BUSCO analyses revealed that 12,071(98.67%) of the 12,234 conserved single-copy genes in our assembly, of which 12034 were single, 37 were duplicated, and 78 were fragmented matches (Table 9). Compleasm analyses revealed that 12,211(99.81%) of the 12,234 conserved single-copy genes in our assembly, of which 12193 were single, 18 were duplicated, and 9 were fragmented matches (Table 10). Overall, these findings indicate the high quality of the genome assembly.

Table 9 The BUSCO results of the Anqing Six-end-white pig assembly.

Full size table

Table 10 The Compleasm results of the Anqing Six-end-white pig assembly.

Full size table

Code availability

The software versions, settings and parameters used are described below. No custom code was used during this study for the curation and/or validation.

Fastp v0.23.2: -l 50

GCE v1.0.0: -k 17

HiFiasm v0.19.6: default

purge_haplotigs v1.0.4: default

BLASTN v2.11.0+: -evalue 0.00001 -max_hsps 1

HiCUP v0.7.2: default

Juicer v1.5.6: default

3d-DNA v180.922: -r 0

HapHiC v1.0.2: --NM 3

JuiceBox v1.11.08: default

TRF v4.09: 2 7 7 80 10 50 2000 -d -h

RepeatMasker v open-4.0.9: -nolow -no_is -norna

RepeatModeler v open-1.0.11: default

LTR_FINDER_parallel v1.0.7: default

Miniprot v0.11-r234: --gff-only -O 11 -E 1 -F 23 -C 1 -B 5 -G 200000 -j 1

HISAT2 v2.1.0: --dta

StringTie v1.3.5: default

TransDecoder v5.5.0: default

BUSCO v5.7.1: --miniprot -m genome -f --offline

MAKER2 v2.31.10: max_dna_len = 3000000 min_contig = 10000 pred_flank = 500 min_protein = 30

HMMER3 v3.3.1: hmmsearch -E 1e-05 --domE 1e-05

Diamond v2.0.14: --evalue 1e-05

tRNAscan-Se v1.3.1: -q

Rfam v14.8: cmscan --rfam –nohmmonly

quarTeT v1.1.4: -c animal

bwa v0.7.12-r1039: default

minimap2 v2.24-r1122: default

SAMtools v1.9: default

Picard v1.124: default

GATK v4.4.0.0: default

References

Ai, H. et al. Adaptation and possible ancient interspecies introgression in pigs identified by whole-genome sequencing. Nature genetics 47, 217–225, https://doi.org/10.1038/ng.3199 (2015).
Article CAS PubMed Google Scholar
Frantz, L. A. F. et al. Evidence of long-term gene flow and selection during domestication from analyses of Eurasian wild and domestic pig genomes. Nature genetics 47, 1141–1148, https://doi.org/10.1038/ng.3394 (2015).
Article CAS PubMed Google Scholar
Groenen, M. A. M. et al. Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491, 393–398, https://doi.org/10.1038/nature11622 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Larson, G. et al. Worldwide phylogeography of wild boar reveals multiple centers of pig domestication. Science 307, 1618–1621, https://doi.org/10.1126/science.1106927 (2005).
Article ADS CAS PubMed Google Scholar
China National Commission of Animal Genetic Resource. Animal Genetic Resource in China. Pigs; China Agriculture Press: Beijing, China, (2011).
Swindle, M. M., Makin, A., Herron, A. J., Clubb, F. J. Jr & Frazier, K. S. Swine as models in biomedical research and toxicology testing. Veterinary pathology 49(2), 344–356, https://doi.org/10.1177/0300985811402846 (2012).
Article CAS PubMed Google Scholar
Li, M. et al. Comprehensive variation discovery and recovery of missing sequence in the pig genome using multiple de novo assemblies. Genome research 27, 865–874, https://doi.org/10.1101/gr.207456.116 (2017).
Article CAS PubMed PubMed Central Google Scholar
Warr, A. et al. An improved pig reference genome sequence to enable pig genetics and genomics research. Gigascience 9, https://doi.org/10.1093/gigascience/giaa051 (2020).
Wang, Y. L. et al. RNA sequencing analysis of the longissimus dorsi to identify candidate genes underlying the intramuscular fat content in Anqing Six-end-white pigs. Animal genetics 54(3), 315–327, https://doi.org/10.1111/age.13308 (2023).
Article CAS PubMed Google Scholar
Zhang, W. et al. Identification of Signatures of Selection by Whole-Genome Resequencing of a Chinese Native Pig. Frontiers in genetics 11, 566255, https://doi.org/10.3389/fgene.2020.566255 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhang, X. D. et al. Association between ADSL, GARS-AIRS-GART, DGAT1, and DECR1 expression levels and pork meat quality traits. Genetics and molecular research 14(4), 14823–30, https://doi.org/10.4238/2015.November.18.47 (2015).
Article CAS PubMed Google Scholar
Hu, H. et al. Comparative analysis of meat sensory quality, antioxidant status, growth hormone and orexin between Anqingliubai and Yorkshire pigs. J. Appl. Anim. Res 47, 357–361, https://doi.org/10.1080/09712119.2019.1643729 (2019).
Article ADS CAS Google Scholar
Du, H. et al. Chromosome-level genome assembly of Huai pig (Sus scrofa). Scientific data 11(1), 1072, https://doi.org/10.1038/s41597-024-03921-w (2024).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. A chromosome-level genome of Chenghua pig provides new insights into the domestication and local adaptation of pigs. International journal of biological macromolecules 270, 131796, https://doi.org/10.1016/j.ijbiomac.2024.131796 (2024).
Article CAS PubMed Google Scholar
Li, D. et al. Pangenome and genome variation analyses of pigs unveil genomic facets for their adaptation and agronomic characteristics. iMeta 3(6), e257, https://doi.org/10.1002/imt2.257 (2024).
Article CAS PubMed PubMed Central Google Scholar
Ma, H. et al. Long-read assembly of the Chinese indigenous Ningxiang pig genome and identification of genetic variations in fat metabolism among different breeds. Molecular ecology resources 22(4), 1508–1520, https://doi.org/10.1111/1755-0998.13550 (2022).
Article CAS PubMed Google Scholar
Zhou, R. et al. The Meishan pig genome reveals structural variation-mediated gene expression and phenotypic divergence underlying Asian pig domestication. Molecular ecology resources 21(6), 2077–2092, https://doi.org/10.1111/1755-0998.13396 (2021).
Article CAS PubMed Google Scholar
Chen, S. et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34(17), i884–i890, https://doi.org/10.1093/bioinformatics/bty560 (2018).
Article CAS PubMed PubMed Central Google Scholar
Liu, B. et al. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. Quantitative Biology 35, 62–67, https://doi.org/10.1016/S0925-4005(96)02015-1 (2013).
Article Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature methods 18(2), 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC bioinformatics 19(1), 460, https://doi.org/10.1186/s12859-018-2485-7 (2018).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of molecular biology 215(3), 403–410, https://doi.org/10.1016/S0022-2836(05)80360-2 (1990).
Article CAS PubMed Google Scholar
Wingett, S. et al. HiCUP: pipeline for mapping and processing Hi-C data. F1000Research 4, 1310, https://doi.org/10.12688/f1000research.7334.1 (2015).
Article CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell systems 3(1), 95–8, https://doi.org/10.1016/j.cels.2016.07.002 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zeng, X. et al. Chromosome-level scaffolding of haplotype-resolved assemblies using Hi-C data without reference genomes. Nature plants 10(8), 1184–1200, https://doi.org/10.1038/s41477-024-01755-3 (2024).
Article CAS PubMed Google Scholar
Durand, N. C. et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell systems 3(1), 99–101, https://doi.org/10.1016/j.cels.2015.07.012 (2016).
Article CAS PubMed PubMed Central Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27(2), 573–580, https://doi.org/10.1093/nar/27.2.573 (1999).
Article CAS PubMed PubMed Central Google Scholar
Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. [http://repeatmasker.org] (1996)
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358, https://doi.org/10.1093/bioinformatics/bti1018 (2005).
Article CAS PubMed Google Scholar
Ou, S. & Jiang, N. LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons. Mobile DNA 10, 48, https://doi.org/10.1186/s13100-019-0193-0 (2019).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic acids research 34, W435–W439, https://doi.org/10.1093/nar/gkl200 (2006).
Article CAS PubMed PubMed Central Google Scholar
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. Journal of molecular biology 268(1), 78–94, https://doi.org/10.1006/jmbi.1997.0951 (1997).
Article CAS PubMed Google Scholar
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC bioinformatics 6, 31, https://doi.org/10.1186/1471-2105-6-31 (2005).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology 37(8), 907–915, https://doi.org/10.1038/s41587-019-0201-4 (2019).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature biotechnology 33(3), 290–295, https://doi.org/10.1038/nbt.3122 (2015).
Article CAS PubMed PubMed Central Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular biology and evolution 38(10), 4647–4654, https://doi.org/10.1093/molbev/msab199 (2021).
Article CAS PubMed PubMed Central Google Scholar
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC bioinformatics 12, 491, https://doi.org/10.1186/1471-2105-12-491 (2011).
Article PubMed PubMed Central Google Scholar
Buchfink, B., Reuter, K. & Drost, H. G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature methods 18(4), 366–368, https://doi.org/10.1038/s41592-021-01101-x (2021).
Article CAS PubMed PubMed Central Google Scholar
Coudert, E. et al. Annotation of biologically relevant ligands in UniProtKB using ChEBI. Bioinformatics 39(1), btac793, https://doi.org/10.1093/bioinformatics/btac793 (2023).
Article CAS PubMed Google Scholar
Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic acids research 49(D1), D10–D17, https://doi.org/10.1093/nar/gkaa892 (2021).
Article CAS PubMed Google Scholar
Galperin, M. Y. et al. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic acids research 49(D1), D274–D281, https://doi.org/10.1093/nar/gkaa1018 (2021).
Article CAS PubMed Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics 25(1), 25–29, https://doi.org/10.1038/75556 (2000).
Article CAS PubMed PubMed Central Google Scholar
Ogata, H. et al. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic acids research 27(1), 29–34, https://doi.org/10.1093/nar/27.1.29 (1999).
Article CAS PubMed PubMed Central Google Scholar
Bu, D. et al. KOBAS-i: intelligent prioritization and exploratory visualization of biological functions for gene enrichment analysis. Nucleic acids research 49(W1), W317–W325, https://doi.org/10.1093/nar/gkab447 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30(9), 1236–1240, https://doi.org/10.1093/bioinformatics/btu031 (2014).
Article CAS PubMed PubMed Central Google Scholar
Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic acids research 51(D1), D418–D427, https://doi.org/10.1093/nar/gkac993 (2023).
Article CAS PubMed Google Scholar
Eddy, S. R. Profile hidden Markov models. Bioinformatics. 14(9), 755–763, https://doi.org/10.1093/bioinformatics/14.9.755 (1998).
Article CAS PubMed Google Scholar
Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic acids research 49(16), 9077–9096, https://doi.org/10.1093/nar/gkab688 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935, https://doi.org/10.1093/bioinformatics/btt509 (2013).
Article CAS PubMed PubMed Central Google Scholar
Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A. & Eddy, S. R. & Bateman, A. Rfam: annotating non-coding RNAs in complete genomes. Nucleic acids research 33, D121–D124, https://doi.org/10.1093/nar/gki081 (2005).
Article CAS PubMed Google Scholar
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture research 10(8), uhad127, https://doi.org/10.1093/hr/uhad127 (2023).
Article PubMed PubMed Central Google Scholar
Zhang, Y., Chu, J., Cheng, H. & Li, H. De novo reconstruction of satellite repeat units from sequence data. Genome research 33(11), 1994–2001, https://doi.org/10.1101/gr.278005.123 (2023).
Article CAS PubMed PubMed Central Google Scholar
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_050231125.1 (2025).
NGDC Genome Sequence Archive. https://ngdc.cncb.ac.cn/gsa/browse/CRA024101 (2025).
NGDC Genome Warehouse. https://ngdc.cncb.ac.cn/gwh/Assembly/93577/show (2025).
Zhang, W. et al. A high-quality Chromosome-level genome assembly and annotation of Anqing Six-end-white pig (Sus scrofa). Figshare https://doi.org/10.6084/m9.figshare.28943891 (2025).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760, https://doi.org/10.1093/bioinformatics/btp324 (2009).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100, https://doi.org/10.1093/bioinformatics/bty191 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16), 2078–9, https://doi.org/10.1093/bioinformatics/btp352 (2009).
Article CAS PubMed PubMed Central Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research 20(9), 1297–1303, https://doi.org/10.1101/gr.107524.110 (2010).
Article CAS PubMed PubMed Central Google Scholar
Rhie, A. et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Article CAS PubMed PubMed Central Google Scholar
Huang, N. & Li, H. Compleasm: a faster and more accurate reimplementation of BUSCO. Bioinformatics 39(10), btad595, https://doi.org/10.1093/bioinformatics/btad595 (2023).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was financially supported by the grants from Anhui Provincial Financial Agricultural Germplasm Resources Protection and Utilization Fund Project, The Special Fund for Anhui Agricultural Research System (AHCYJSTX-05-12, AHCYJSTX-05-23), Anhui Provincial Science and Technology Mission Project, Anhui Province Academic and Technical Leader Candidate Project (No.2022H300), 2025 Institutional Research Program of Anhui Academy of Agricultural Sciences (2025YL043). We thank Wuhan Onemore-tech Co. Ltd. for their kindly helps in data analysis.

Author information

Authors and Affiliations

Institute of Animal Husbandry and Veterinary Medicine, Anhui Academy of Agricultural Sciences, Hefei, 230031, China
Wei Zhang, Mei Zhou, Linqing Liu, Miao Lu, Shiguang Su & Chonglong Wang
Anhui Provincial Breeding Pig Genetic Evaluation Center, Hefei, 230031, China
Wei Zhang, Mei Zhou, Linqing Liu, Miao Lu, Shiguang Su & Chonglong Wang
Key Laboratory of Pig Molecular Quantitative Genetics of Anhui Academy of Agricultural Sciences, Hefei, 230031, China
Wei Zhang, Mei Zhou, Linqing Liu, Miao Lu, Shiguang Su & Chonglong Wang
Anhui Provincial Key Laboratory of Livestock and Poultry Product Safety, Hefei, 230031, China
Wei Zhang, Mei Zhou, Linqing Liu, Shiguang Su & Chonglong Wang

Authors

Wei Zhang
View author publications
Search author on:PubMed Google Scholar
Mei Zhou
View author publications
Search author on:PubMed Google Scholar
Linqing Liu
View author publications
Search author on:PubMed Google Scholar
Miao Lu
View author publications
Search author on:PubMed Google Scholar
Shiguang Su
View author publications
Search author on:PubMed Google Scholar
Chonglong Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

C.-L.W. and W.Z. conceived and designed the experiments; W.Z. designed the analytical strategy and performed analysis processes. M.Z., S.-G.S. and L.-Q.L. performed the sampling; M.L. extracted the genomic DNA and RNA; The initial draft of the manuscript was authored by W.Z., who incorporated detailed feedback from all the authors. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Shiguang Su or Chonglong Wang.

Ethics declarations

Competing interests

We declare that we do not have any commercial or associative interests that represent conflicts of interest in connection with the submitted work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, W., Zhou, M., Liu, L. et al. A high-quality chromosome-level genome assembly and annotation of Anqing Six-end-white pig (Sus scrofa). Sci Data 12, 1656 (2025). https://doi.org/10.1038/s41597-025-05945-2

Download citation

Received: 12 May 2025
Accepted: 03 September 2025
Published: 20 October 2025
Version of record: 20 October 2025
DOI: https://doi.org/10.1038/s41597-025-05945-2