Introduction

Over millennia, natural selection and artificial breeding have led to significant phenotypic diversification, with over 700 domestic pig breeds globally, mainly in Asia and Europe1,2. Regional pig breeds with distinct traits emerged through long-term socio-cultural evolution3. Head morphology, a key trait linked to domestication and genetics, influences behaviors like foraging, feeding, and aggression. Chinese indigenous pigs are vital genetic resources due to their environmental adaptability and distinct phenotypes. According to China’s National Germplasm Center (https://cdad-is.org.cn/), over twenty native breeds exist, such as Neijiang, Hang, and Wannan Black pigs, all showing significant head morphology differences. Yet, scientific studies are scarce.

The Neijiang pig, one of the typical representatives of Chinese local pig breeds, originates from the Neijiang region in Sichuan Province and has a domestication history of ~1800 years4. The region’s subtropical climate, agricultural practices, and traditional breeding culture together have shaped its unique evolutionary path. The Neijiang pig is known for its high fertility, strong stress resistance, remarkable adaptability, and distinct morphology, making it highly valuable for conservation and sustainable utilization5. However, modern commercial breeding has drastically reduced its population, threatening genetic diversity. Although swine genomics has made significant progress, the absence of a breed-specific, complete reference genome continues to hinder the effective use of these local genetic resources.

The current mainstream pig reference genome (Sscrofa11.1)6 is renowned for its high contiguity, but its genetic foundation is derived from the European Duroc breed, which exhibits approximately 7% genomic divergence from Asian pig breeds7. This discrepancy significantly limits its application in analyzing genetic variations in indigenous Chinese pigs, especially those related to specific adaptive variations. The integration of third-generation sequencing technologies (such as PacBio High-Fidelity (HiFi) and Oxford Nanopore ultra-long reads) with chromatin conformation capture (Hi-C) has enabled the achievement of T2T genome assemblies8,9,10,11,12,13. HiFi reads combine long read lengths (>20 kb) with high accuracy (>99.9%), allowing precise resolution of complex repeat units, while Nanopore ultra-long reads (>100 kb) span large structural variation regions. Their synergistic use substantially reduces assembly fragmentation14. These technologies are expected to achieve complete genome assemblies for indigenous pig breeds, fully resolving centromeres, telomeres, and repetitive gene clusters, thereby establishing a paradigm for precise genetic resource characterization and functional genomic studies.

In this study, we constructed the nearly gap-free NJP-T2T by integrating PacBio HiFi reads, Nanopore ultra-long reads, and Hi-C sequencing data. This assembly enabled systematic characterization of centromeric and telomeric architectures, while comparative genomic analyses revealed unique features distinguishing the Neijiang pig genome. Population genetic analyses delineated the genetic structure and inbreeding patterns within conservation populations. Through GWAS and selection signature analyses, we identified candidate genes associated with reproductive performance and head morphology. The NJP-T2T genome serves as both a high-precision molecular toolkit for swine genetic resource utilization and a critical foundation for deciphering adaptive evolution mechanisms in East Asian pig breeds and advancing molecular breeding strategies.

Results

NJP-T2T genome assembly and annotation

A Neijiang pig, one of China’s most iconic indigenous breeds (Fig. 1a), was selected from the National Neijiang Pig Genetic Resources Conservation Farm for genome assembly. PacBio HiFi sequencing generated 2,984.26 Gb of raw data, from which 176.13 Gb of high-quality reads (68× coverage, N50 = 18 kb) were retained after filtering (Supplementary Data 1). Nanopore sequencing yielded 462.65 Gb of ultra-long reads (179× coverage), including 33.95 Gb >100 kb (Supplementary Fig. 1 and Supplementary Data 2). Hi-C data (396.57 Gb, ~152× coverage) were additionally acquired (Supplementary Data 3).

Fig. 1: NJP-T2T genome assembly.
Fig. 1: NJP-T2T genome assembly.
Full size image

a Neijiang pig. b Heat map of interactive intensity between chromosome sequences of Neijiang pig by Hi-C. c The genome circos plot of NJP-T2T, displaying the genomic features of NJP-T2T. From innermost to outermost, it represents the chromosome length of NJP-T2T, genes, GC content, dispersed repeats, tandem repeats, Class 1 retrotransposons, Class 2 DNA transposons, tandem repeats, and synteny. d BUSCO scores of genome assembly and annotation. e: Genome synteny between NJP-T2T and Sscrofa11.1.

The initial contig-level assembly was constructed using HiFi sequencing data, resulting in 161 contigs (total length 2696.18 Mb, N50 = 90.71 Mb). Meanwhile, an Oxford Nanopore Technologies (ONT) assembly was generated using Nanopore ultra-long reads, producing 844 contigs (total length 2523.41 Mb, N50 = 59.65 Mb; Supplementary Data 4). Integration of Hi-C chromatin conformation capture data (Supplementary Data 5) anchored 96.69% of the HiFi sequences to 19 pseudochromosomes, including 18 autosomes and one X chromosome (Supplementary Data 6). In the heatmap, intrachromosomal interaction signals along the diagonal regions were markedly stronger than interchromosomal signals (Fig. 1b), confirming the accuracy of the assembly. Gap closure using the ONT assembly achieved a near complete telomere-to-telomere assembly, with only one 500 bp gap remaining on chromosome 16 (Fig. 1c). The final genome size was approximately 2.55 GB (including only sequences anchored to autosomes and the X chromosome), with contig and scaffold N50 values both reaching 142.12 Mb (Supplementary Data 7).

Repetitive elements were annotated across 42.26% of the genome, comprising 19.33% long interspersed nuclear elements (LINEs), 1.53% short interspersed nuclear elements (SINEs), 4.41% long terminal repeats (LTRs), and 8.60% satellite repeats (Fig. 1c, Supplementary Data 8). Protein-coding genes were predicted through integrative annotation (Supplementary Fig. 2), identifying 23,375 high-confidence genes. Functional annotation was achieved for 95.79% of genes (Supplementary Fig. 3 and Supplementary Data 9), with enrichment observed in intracellular components (cellular anatomical entities), molecular binding/catalytic activities, and carbohydrate utilization pathways (Supplementary Fig. 4). Among all protein-coding genes, the previously unresolved regions (PURs) regions contained 1199 genes, including 1040 with known functional annotations and 159 without annotation (Supplementary Data 10). These genes are distributed across all chromosomes, with relatively high densities observed in the posterior regions of the p-arms of chromosomes 2 and 7(Supplementary Fig. 5). Non-coding RNA annotation revealed 4421 tRNAs, 632 rRNAs, and 954 miRNAs (Supplementary Data 11), alongside 173 pseudogenes averaging 5700 bp.

The quality of genome assembly and annotation was evaluated through sequencing read coverage, mapping rates, consensus quality value (QV), Benchmark of Universal Single-Copy Orthologs (BUCSO) scores, and genomic synteny. HiFi and ultra-long ONT reads were mapped back to NJP-T2T to validate assembly accuracy, which found that all reads were uniformly distributed across nearly all chromosomes (Supplementary Fig. 6), with mapping rates reaching 100% and 99.99%. Consensus quality value (QV) was determined using a Kmer-based method to assess assembly accuracy, showing QV values of 54.25–78.07 across all chromosomes (mean = 62.38; Supplementary Data 12). The BUCSO score indicated that 97.90% of the benchmark universal single-copy orthologs were annotated (Fig. 1d). Additionally, high synteny was observed between NJP-T2T and the NCBI pig reference genome Sscrofa11.1, these results demonstrate the high quality and reliability of the NJP-T2T genome (Fig. 1e).

Characterization of centromeres and telomeres

Centromeres play essential roles in eukaryotic cell division15, yet their complete resolution in porcine genomes has been hindered by highly repetitive sequences. In this study, a two-step strategy was implemented: preliminary localization of centromeres through regional genomic features, followed by precise demarcation using centromeric repeat monomers. The centromeric regions were characterized by the absence of gene clusters, elevated GC content, methylation enrichment, and dense tandem repeat coverage. The sequences demonstrated remarkably uniform complexity in terms of entropy. Notably, previously documented satellite DNA should be observed in the centromeric regions (Supplementary Data 13). Through these defining characteristics, the centromeric positions in the NJP-T2T genome were preliminarily mapped, and a comprehensive centromeric feature landscape was established (Fig. 2a). The analysis revealed dense accumulation of tandem repeats in centromeric regions, while Class I retrotransposons and Class II DNA transposons exhibited minimal representation. Subsequently, nine major centromeric repeat units (CRUs) were identified within the preliminary centromere regions (Supplementary Data 14). CRU1 (335 bp), CRU2 (1679 bp), CRU3 (336 bp), CRU4 (672 bp), and CRU5 (336 bp) corresponded to Mc1 satellite sequences predominantly enriched in metacentric chromosomes (Chr1-12). In contrast, CRU6 (1342 bp), CRU7 (907 bp), CRU8 (2242 bp), and CRU9 (1344 bp) were identified as Ac2 satellite sequences specifically localized to acrocentric chromosomes (Chr13-18). Centromere positions were definitively mapped based on CRU distributions (Supplementary Data 15), with Chr1-12 classified as metacentric (arm ratio: 1.11-3.51; Fig. 2b). The longest metacentric centromere was observed in Chr1 (8.62 Mb), while the shortest was detected in Chr5 (1.54 Mb). Chr13-18 were categorized as acrocentric, with the longest centromere spanning 16.5 Mb in Chr16. ChrX is a standard metacentric chromosome, with an arm ratio of a uniform 1.08. Tandem repeat sequences are enriched along the centromeric position on the chromosome, with different distribution patterns across various chromosomes (Fig. 2c). In addition, our assembly detected telomere structures at the ends of all chromosomes (Supplementary Data 16).

Fig. 2: Centromere Region Identification and Features.
Fig. 2: Centromere Region Identification and Features.
Full size image

a Genomic features at the centromere of Neijiang pig. From top to bottom, these include gene density, GC content, centromere satellite sequence alignment results, 5mC methylation, tandem repeats, Class 1 retrotransposons, Class 2 DNA transposons, sequence entropy distribution, and centromere sequence identity heatmap. b The position and arm ratio of Neijiang pig centromeres. c Composition of satellite sequences in the centromere region length of NJP-T2T.

Assembly improvement of NJP-T2T

Figure 3a illustrates the structural distinctions between the NJP-T2T and Sscrofa11.1 genomes. By aligning the NJP-T2T assembly with the Sscrofa11.1 reference genome, we identified a set of PURs that correspond to sequences either absent or misassembled in the earlier reference. Compared with Sscrofa11.1, the NJP-T2T assembly was extended by 181.35 Mb in autosomes and the X chromosome, with 532 gaps closed and the telomeric regions of the Neijiang pig resolved for the first time (Supplementary Fig. 7). In addition, structural errors present in Sscrofa11.1 were corrected in NJP-T2T (Supplementary Data 17). Further sequence characterization for PURs revealed that these additional regions are enriched for centromeric satellite sequences (43.21%), Segmental duplications (SDs) (14.21%), and dispersed repeats (13.26%), with 3.51% overlapping both centromeric satellites and segmental duplications, and 25.81% remaining unclassified (Fig. 3b). Taken together, the NJP-T2T assembly represents one of the most contiguous and robust swine genomes currently available. The remarkably low number of gaps and the presence of exceptionally long contiguous sequences provide a highly valuable, high-resolution genomic resource for the Neijiang pig (Supplementary Data 18).

Fig. 3: Improvement of the NJP-T2T Genome.
Fig. 3: Improvement of the NJP-T2T Genome.
Full size image

a Differences in details between the NJP-T2T and Sscrofa11.1 genomes. b Sequence characteristics of previously unresolved regions. c Comparison of DNA resequencing data mapping rates. d Comparison of correctly mapped resequencing data. e Comparison of base error rates in resequencing data. f Comparison of RNA resequencing data mapping rates.

To evaluate the practical utility of NJP-T2T, we downloaded whole-genome sequencing (WGS) data for 400 individuals from public databases, including 200 indigenous pigs and 200 exotic pigs across four breeds, with sequencing depths ranging from 8.5× to 38.27× (Supplementary Data 19). When aligned to NJP-T2T and Sscrofa11.1, significantly higher mapping rates were observed for NJP-T2T in both indigenous (99.21% for NJP-T2T and 98.64% for Sscrofa11.1) and exotic pigs (99.18% for NJP-T2T and 98.84% for Sscrofa11.1) (Fig. 3c). The properly mapped read rates further demonstrated NJP-T2T’s superiority, achieving 97.44% in indigenous and 96.39% in exotic pigs, whereas Sscrofa11.1 showed lower rates of 95.86% and 95.91%, respectively (Fig. 3d). This indicates enhanced accuracy in identifying functional genomic regions and reduced misalignment events.

Base error rate analysis revealed that NJP-T2T exhibited a significantly lower error rate in indigenous populations (0.98% for NJP-T2T and 1.27% for Sscrofa11.1) (Fig. 3e), confirming its higher assembly accuracy for local breeds. However, a marginally elevated error rate was observed in European pigs (1.11% for NJP-T2T and 0.90% for Sscrofa11.1), likely attributable to greater genetic divergence between breeds.

RNA-seq data (Supplementary Data 20) alignment showed improved mapping rates for NJP-T2T (90.26% for indigenous pigs using NJP-T2T and 89.47% using Sscrofa11.1; 90.67% for exotic pigs using NJP-T2T and 89.85% using Sscrofa11.1) (Fig. 3f), underscoring its utility in transcriptomic studies. Structural variant (SV) detection using short-read data from three individuals per group demonstrated NJP-T2T’s increased sensitivity, with higher counts of deletions/insertions (Del/Ins) and inversions/translocations (Inv/Tra) identified compared to Sscrofa11.1 (Supplementary Fig. 8).

In summary, the NJP-T2T genome demonstrates superior accuracy and applicability in both DNA and RNA sequencing analyses, particularly for genetic studies of indigenous pig breeds. These experimental validations establish NJP-T2T as a comprehensive and reliable genomic resource for advancing swine genetics research and breeding optimization.

Gene family analysis

Gene family clustering analysis involving Large White, Landrace, Duroc, Berkshire, African Warthog, Jinhua, Bamei, Tibetan, and Rongchang pigs identified 14,376 core gene families shared across these breeds (Fig. 4a). These shared gene families form the genetic foundation for essential biological functions and physiological processes. Their universal presence reflects a high degree of evolutionary conservation during critical stages of development, establishing a solid framework for comparative genomic investigations among diverse pig lineages. Strikingly, 75 gene families were found to be unique to the Neijiang pig genome, the highest number of breed-specific families identified in this analysis (Fig. 4a). This notable enrichment is likely driven by both the exceptional completeness and accuracy of the NJP-T2T genome assembly, as well as the distinctive genetic background and adaptive traits of the Neijiang breed. Together, these unique gene families provide important insights into the genetic distinctiveness and potential functional specialization of the Neijiang pig.

Fig. 4: Gene family analysis results.
Fig. 4: Gene family analysis results.
Full size image

a Gene family clustering analysis of Large White, Landrace, Duroc, Berkshire, Neijiang pig, Mei shan pig, Jinhua pig, Baimei pig, Tibetan pig, and Rongchang pig. b Summary of homologous genes. c Distribution of Neijiang pig-specific genes on chromosomes. d GO functional enrichment analysis of Neijiang pig-specific genes. e KEGG pathway enrichment analysis of Neijiang pig-specific genes. f Unique gene families of Neijiang pig, Jinhua pig, Baimei pig, Tibetan pig, and Rongchang pig.

Comparative genomic analysis revealed a substantial number of multicopy orthologous genes in the NJP-T2T genome (Fig. 4b). Notably, 318 genes were found to be unique to the Neijiang pig, which were absent in all other breeds (Supplementary Data 21), underscoring the genomic distinctiveness of the Neijiang breed. These genes were distributed across all chromosomes of the NJP-T2T genome, with dense clustering observed on the short arms (p-arms) of chromosomes 7 and 10, as well as on chromosome 13 (Fig. 4c), suggesting potential functional or adaptive specialization in these regions. Functional enrichment analysis demonstrated that these genes are involved in a wide range of processes, from basic cellular components and metabolic pathways (Fig. 4d) to complex physiological systems such as sensory perception, immunity, and aging regulation (Fig. 4e). The enrichment profiles indicate that Neijiang pigs possess genetic advantages in environmental adaptation, health maintenance, and developmental regulation. In particular, the enrichment of sensory-related genes is closely associated with their enhanced environmental perception, such as olfactory sensitivity. Direct comparison with other indigenous breeds (Jinhua, Bamei, Tibetan, Rongchang) identified 803 Neijiang-specific gene families encompassing 1330 genes, exclusively present in the Neijiang genome (Fig. 4f). Functional analysis of these families revealed profiles comparable with those in Fig. 4, further evidencing Neijiang pigs’ unique adaptive specializations(Supplementary Fig. 9).

Population genetic structure and inbreeding analysis of Neijiang pigs

Based on the genomic data of the Chinese National Neijiang Pig Genetic Resources Conservation Farm, the genetic diversity, population structure, and inbreeding level of the population were analyzed. The results of genetic diversity analysis (Supplementary Data 22) show that the average frequency of minor alleles in this population is 0.126, which is moderately low, indicating that the population retains some genetic variation, but the major allele dominates. The effective number of alleles is 1.286, suggesting that alleles are unevenly distributed across many loci, yet the population still possesses some level of diversity. The observed heterozygosity and expected heterozygosity are 0.205 and 0.206, respectively (Supplementary Data 23), reflecting moderate genetic variation both within individuals and the overall population. Taken together, the Neijiang pig population exhibits a moderate level of genetic diversity, providing a solid foundation for conservation and sustainable use, highlighting its genetic value as a local breed.

Principal component analysis (PCA) of 115 individuals from the Neijiang pig population reveals that the first three principal components (PC1, PC2, and PC3) explain 37.93% of the genetic variation. The PCA scatter plot (Fig. 5a) shows that the population forms a relatively uniform cluster, reflecting that there is no obvious subpopulation differentiation within the Neijiang pig population. While the population possesses some genetic diversity, the level of diversity is not particularly rich. The phylogenetic tree, constructed based on genetic distance (Supplementary Fig. 10), divides the Neijiang pig population into six distinct lineages. Furthermore, the kinship matrix (Fig. 5b) constructed from the same dataset further reveals the genetic structure of the population: most individuals have distant kinship, indicating diversity in the overall genetic background; however, there are also closer relationships between some individuals, suggesting the possible presence of local inbreeding.

Fig. 5: Population structure and inbreeding analysis of Neijiang Pig.
Fig. 5: Population structure and inbreeding analysis of Neijiang Pig.
Full size image

a PCA analysis of Neijiang pig. b Phylogenetic relationship of Neijiang pig population. c ROH length distribution. d Distribution of ROH on different chromosomes. e SROH-NROH scatter plot.

Inbreeding coefficients for the Neijiang pig population were calculated using the Plink (v1.90) software based on genomic data, with an average value of 0.119, indicating a moderately low inbreeding risk. A total of 1126 runs of homozygosity (ROHs) were detected in the analysis (Supplementary Data 24), with an average length of 790,724 bp, reflecting the commonality of short ROHs in the population. 64.2% of the ROHs are distributed within the shorter range of 0.5 to 0.8 Mb (Fig. 5c), likely resulting from distant historical inbreeding or random homozygosity in large populations, indicating that the population faces some homozygosity pressure while still retaining moderate genetic diversity. Figure 5d shows that ROHs are distributed across different chromosomal regions but are unevenly spread, with chromosomes 7 (263 ROHs) and 13 (206 ROHs) exhibiting significantly higher numbers compared to other chromosomes (e.g., chromosomes 11 and 18 show almost no ROHs), reflects the effects of local selection or genetic drift. The sum of ROH lengths and number of ROHs (SROH-NROH) scatter plot (Fig. 5e) reflects the individual variations in genetic diversity and inbreeding levels, with each point representing an individual. Individuals in the high N (ROH count) and high S (total ROH length) region exhibit higher inbreeding levels, while those in the low N and low S region show better genetic diversity, indicating that the population overall exhibits a certain degree of inbreeding variation. Overall, the Neijiang pig population has a relatively low level of inbreeding, but the high ROH count and length in some individuals suggest a certain inbreeding risk. The population maintains a moderate level of genetic diversity, although short ROHs dominate, and overall homozygosity still exerts some pressure on diversity.

GWAS of Reproductive Traits in Neijiang Pigs

This study conducted a GWAS using 115 individuals from the Chinese National Neijiang Pig Genetic Resources Conservation Farm (Sichuan, China) with well-documented phenotypic records of reproductive traits, aiming to identify key genetic loci influencing these traits in Neijiang pigs. Descriptive statistics of four parity records (total number born, TNB; number born alive, NBA) are provided in Supplementary Data S25. Histograms of reproductive traits exhibited approximately normal distributions with high inter-parity variability (Fig. 6a). Imputation of low-coverage sequencing data was performed using the STITCH-Beagle pipeline, yielding 19,785,303 SNPs with an imputation accuracy of 0.85 (Fig. 6b). Assessment of the genomic inflation factor (0.996 < λGC < 1.034) and QQ plots (Fig. 6c-d) indicated no substantial inflation, suggesting that the GWAS results were unlikely to be driven by systematic bias.

Fig. 6: Genome-wide association analysis of reproductive traits.
Fig. 6: Genome-wide association analysis of reproductive traits.
Full size image

a Distribution of reproductive traits in Neijiang pig. b Filled SNP density. c Manhattan plot and QQ plot of total litter size GWAS. d Manhattan plot and QQ plot of live birth GWAS.

The GWAS revealed significant genetic heterogeneity across parities, with the first parity analysis detecting only one significant locus (Supplementary Data 26) lacking direct gene associations (Fig. 6c), while no significant loci were identified in the fourth parity, potentially due to limited sample size, data resolution, or environmental confounders. In contrast, analyses of the second and third parities detected multiple significant loci on chromosomes 11 and 8, respectively (Fig. 6).

For NBA in the second parity, 11 significant loci were mapped to chromosome 11, including SHISA2 (rs511232), ATP8B1 (rs511340), CDK8 (rs511480), GPR12 (rs511692), and MTUS2 (rs513144). Notably, the CDK8 locus (rs511480) showed significance in both NBA and TNB analyses. Previous studies have reported that CDK8 is required for embryo implantation in mice16, and functional validation following GWAS has also demonstrated its role in granulosa cell proliferation in pigs17. Furthermore, SMG1 is essential for embryogenesis and early embryonic development18. Taken together, these results provide evidence for deciphering the genetic characteristics underlying the reproductive performance of Neijiang pigs and identify potential loci as well as their associated genes. These molecular markers may offer support for future breeding strategies.

Selective characteristics of head morphology in Neijiang pigs

Neijiang pigs exhibit the characteristic head shapes typical of Chinese indigenous pig breeds, particularly the Shizitou-type (SZT) and Erfangtou-type (EFT) (Fig. 7). SZT Neijiang pigs have very short snouts, prominent facial wrinkles, and generally short head shapes. In contrast, EFT pigs possess longer snouts, fewer facial wrinkles, and overall longer head shapes4. These distinctive head types are not only key phenotypic markers of the breed but also reflect adaptive evolutionary processes. To elucidate the genetic basis of head differentiation in Neijiang pigs, we performed selection signature analysis using high-depth (>100×) resequencing data from individuals with divergent extreme head morphotypes. Investigating the genetic underpinnings of these morphotypes is critical for understanding their environmental adaptation and phenotypic evolution.

Fig. 7: Selection signal analysis of Neijiang pig’s special head shape.
Fig. 7: Selection signal analysis of Neijiang pig’s special head shape.
Full size image

a PCA validation of Neijiang pig special head shape grouping. b Phylogenetic tree and feature display of individuals with special head shapes in Neijiang pig. c π value comparison between SZT and EFT populations. d ROD comparison between SZT and EFT populations. e Fixation index of differentiation between SZT and EFT populations. f Selected features of OR gene family. g RT-qPCR of selected OR genes (n = 3 biologically independent animal experiments). Error bars represent the mean ± SD from three independent experiments. ** indicates p < 0.001; * indicates p < 0.05; ns indicates no statistically significant difference.

We first validated our grouping of eight pigs through PCA. The PCA clearly segregated the samples into two clusters (Fig. 7a), concordant with phenotypic classifications. To further resolve phylogenetic relationships, we constructed a phylogenetic tree using a Duroc pig as an outgroup. The tree revealed distinct branches for SZT and EFT (Fig. 7b), confirming the robustness of our grouping strategy.

Genetic divergence between the two morphotypes was assessed using sliding-window nucleotide diversity (π) analysis. While π distributions were largely overlapping, significant divergence was observed on chromosomes 1, 6, 9, and 13 (Fig. 7c), suggesting selective pressures in these regions. ROD (ratio of observed to expected differences) analysis identified additional selection signals on chromosomes 1, 4, 8, and 13 (Fig. 7d). Fixed differentiation index (Fst) analysis further highlighted multiple regions under selection (Fig. 7e). Integrating Fst and ROD thresholds (top 5%), we identified 148 high-confidence selection windows containing 301 candidate genes (Supplementary Data 27). Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment revealed significant associations with sensory perception, immune response, and environmental adaptation (Supplementary Fig. 11), implicating these processes in head differentiation. Notably, a selection window on chromosome 1 contains a gene cluster consisting of five olfactory receptor (OR) genes (OR5L1, OR5D14, OR4S2, OR140, OR4C11), which are not included in the current NCBI pig reference genome assembly (Fig. 7f). To verify the involvement of these olfactory receptor (OR) genes in head morphology differentiation of Neijiang pigs, we analyzed their expression patterns in the SZT and EFT groups using RT-qPCR. The results showed that OR5L1 exhibited significantly higher relative expression in the EFT group compared to the SZT group (*, P < 0.05); OR5D14 showed no significant difference in expression between SZT and EFT (ns); whereas OR4S2, OR140, and OR4C11 displayed markedly higher expression in EFT than in SZT (**, P < 0.01) (Fig. 7g, Supplementary Data 28). These findings confirm that these previously unassembled OR genes exhibit transcriptional activity in Neijiang pigs and further support their functional role in shaping the head features of Neijiang pigs.

Discussion

T2T genome assembly has proven to be a powerful tool for elucidating the genetic basis of various species19. For the Neijiang pig genome, the T2T assembly generated a near-complete sequence, with only a 500 bp gap remaining on chromosome 16. We compiled the chromosome-anchored sequence lengths of four existing pig T2T genomes: NJP-T2T (2.61 Gbp), T2T_Mpig1.0 (2.61 Gbp), T2T_RCpig1.0 (2.63 Gbp)20, and JH-T2T (2.57 Gbp)21. The genome size of NJP-T2T is similar to those of T2T_Mpig1.0, T2T_RCpig1.0, and JH-T2T. Differences in genome size may be attributed to breed variation, differences in telomere and centromere lengths of the assembled individuals, as well as assembly methodologies. Compared to other representative indigenous pig genomes, it represents one of the most contiguous and reliable assemblies currently available20,21,22,23. Moreover, compared to the NCBI pig reference genome (Sscrofa11.1), the NJP-T2T assembly demonstrates higher alignment accuracy and a lower error rate in variant detection. This genome exhibits broad applicability in genetic analyses of Chinese indigenous pigs, particularly the Neijiang pig.

Centromeric regions are essential for chromosome segregation and stability, but their highly repetitive sequences pose significant challenges for analysis24. Currently, commonly used approaches for centromere detection include software-based predictions and ChIP-seq25,26,27. Software-based prediction typically localizes centromeres by enriching for tandem repeat sequences. Although this approach is convenient, its precision is often limited. ChIP-seq, combined with CENH3 or CENP antibodies, offers more accurate centromere mapping, but species-specific antibodies are unavailable for many organisms. For example, pigs lack a CENP antibody. In our study, we selected a rabbit-derived anti-CENPA antibody (ab217622, Abcam, Cambridge, UK), which shares 96% homology with Neijiang pig CENPA, for ChIP-seq analysis. However, Western blotting failed to effectively detect CENPA in Neijiang pig. Additionally, the centromere localization results from tools such as quaraTET were inconsistent with prior studies in pigs. Ultimately, we preliminarily located centromeric regions based on their characteristic genomic features and confirmed their accuracy by aligning with previously validated pig centromeric satellite sequences. Subsequently, we performed clustering analysis of repeat monomers28, successfully identifying and mapping the centromeric regions of the Neijiang pig genome for the first time.

As one of the representative indigenous pig breeds in China, Neijiang pigs are known for their exceptional meat quality, reproductive performance (relative to other local breeds), and environmental adaptability29. However, with the advancement of modern farming practices, their genetic diversity and resources are at risk of loss and endangerment. Genetic diversity analysis reveals that the Neijiang pig population exhibits moderate genetic variation, with a predominant presence of fixed alleles, though some diversity is still retained. However, certain individuals show a higher number of ROH, suggesting potential local inbreeding risks. These results indicate that the breed is susceptible to genetic drift, underscoring the need to control inbreeding within the limited gene pool to ensure the sustainable development of the population.

In the GWAS analysis, we detected several significant loci associated with reproductive traits in the second and third parities, primarily on chromosomes 11 and 8. These loci involved candidate genes such as CDK8 and SMG1, which are known to participate in critical biological processes, including embryonic development and cell fate regulation16,17,18, suggesting that they may contribute to the genetic regulation of reproductive traits in Neijiang pigs. In the analyses of both the number born alive and the total number born, CDK8 and SMG1 showed potential associations. Further queries in the PigGTEx database showed that these genes also exhibited associations with reproductive indicators such as teat number and uterine capacity, providing external evidence in support of our GWAS findings. It should be noted, however, that the statistical power of this study is limited by the relatively small sample size, and these results will require validation in larger populations and through integration with transcriptomic or other multi-omics data. Nevertheless, our findings provide foundational insights into the genetic basis of reproductive performance in Neijiang pigs and highlight candidate genes that may serve as valuable references for future molecular breeding research.

In the historical and evolutionary context of Neijiang pigs, significant head differentiation reflects the development of breed characteristics, mainly including two groups: “SZT” and “EFT.” The SZT type shows a lion-like head with a full face and short snout, while the EFT type has a longer snout, similar to common local pigs. Due to environmental changes and neglect, the SZT phenotype has gradually disappeared. Field investigations of several conservation farms and breeding bases revealed that only five individuals currently show clear SZT features, making them rare materials for studying the genetic mechanism of head morphology in this breed.

The Neijiang pig population has been influenced by environmental changes and feeding practices throughout its history. The breed can be traced back to the Han Dynasty. During the Ming Dynasty, the expansion of sugarcane cultivation increased feed availability, reduced the pigs’ reliance on olfaction, and helped retain cranial traits such as the SZT type. However, with social changes and shifts in breeding practices, traditional free-range feeding was restored, increasing reliance on olfaction and leading to a rise in the proportion of the EFT group.

To investigate the genetic mechanisms underlying head variation in Neijiang pigs, we performed PCA and phylogenetic analysis. The results revealed significant genetic differences between the SZT and EFT groups, providing genetic evidence for head-type differentiation. Selection signal analysis showed that head-type variation is closely related to gene pathways involved in sensory function and environmental adaptation. Notably, an unassembled cluster of OR genes, absent from Sscrofa11.1, was identified on chromosome 1, with stronger selective pressure observed in the EFT population for this region. Real-time quantitative PCR analysis further confirmed significant transcriptional activity of these previously unassembled OR genes in the olfactory epithelium of Neijiang pigs. Moreover, the expression levels of OR5L1, OR4S2, OR140, and OR4C11 were significantly higher in the EFT group compared to the SZT group. This transcriptional evidence reinforces the functional role of OR genes in shaping head traits in Neijiang pigs and suggests that differences in olfactory capacity may be an important factor driving changes in head morphology. In addition, gene family analysis revealed that Neijiang pig-specific gene families are also enriched in sensory-related functions, further supporting the uniqueness of this breed in sensory traits.

In summary, this study not only generated the first high-quality T2T reference genome of the Neijiang pig, but also provided the first systematic characterization of its centromeric regions, offering a technical framework for studying complex genome structures in pigs. Based on this genomic platform, we achieved high-resolution detection of genetic variation and structural evaluation in conservation populations. Combined with GWAS analysis, we precisely identified key genomic regions associated with reproductive traits, filling a gap in genetic research on Neijiang pigs in this area. These findings provide a theoretical foundation for the conservation and utilization of Neijiang pig germplasm resources and establish a methodological paradigm for exploring the genetic mechanisms of Chinese indigenous pig breeds and developing precision breeding strategies. Future research will focus on the co-evolutionary mechanisms between head morphology and olfactory perception, and will integrate multi-omics approaches to further explore the structural and functional relationships of centromeric regions, paving reliable paths for the innovative use of local pig genetic resources.

Materials and methods

Ethics approval

All experimental procedures in this study were approved by the Institutional Review Board of Sichuan Agricultural University (IRB14044) and conducted in accordance with the ethical guidelines and regulations established by the Institutional Animal Care and Use Committee. The study was performed under permit number DKY-B20220358. The ethical review and approval covered all samples and treatments reported in this paper. We have complied with all relevant ethical regulations for animal use.

Sample collection and sequencing

A 12-month-old female Neijiang pig was selected from the standardized breeding population at the Chinese National Neijiang Pig Genetic Resources Conservation Farm (Neijiang, Sichuan, China). The longissimus dorsi muscle was collected within 30 min post-slaughter for DNA sequencing to support T2T genome assembly and subsequent analyses. In addition, to facilitate gene annotation, tissue samples from seven organs of one 12-month-old female Neijiang pig were collected for RNA-seq, and tissue samples from six organs of another 12-month-old female Neijiang pig were collected for Iso-seq (Isoform Sequencing). All pigs in this study were humanely slaughtered according to the Chinese National Standard GB/T 22569-2008, the Technical Specification for the Humane Slaughter of Pigs. The procedure involved immediately rendering the pigs unconscious via electrical stunning, followed by exsanguination through sticking within 15 seconds after stunning.

To analyze the population structure and inbreeding level of Neijiang pigs and further investigate their reproductive performance, a total of 115 individuals with complete breeding records were selected from the National Neijiang Pig Genetic Resources Conservation Farm (Neijiang, China). Ear tissue samples were collected and subjected to whole-genome DNA sequencing. The resulting data were used for population structure analysis, inbreeding coefficient estimation, and GWAS.

The Shizitou-type (SZT) Neijiang pig represents a nearly extinct subpopulation within the breed. To define the extreme morphological differences between SZT and Erfangtou-type (EFT) individuals, classification criteria were established based on typical head features: individuals with snouts nearly level with the forehead and with prominently raised forehead wrinkles were defined as extreme SZT; individuals with snouts clearly protruding and misaligned with the forehead, and with few or no forehead wrinkles, were defined as extreme EFT. Based on these criteria, eight Neijiang pigs with markedly different head morphologies were selected from the conservation population and subjected to high-depth whole-genome sequencing (100×) to investigate the genetic basis underlying head phenotype variation.

Library preparation and sequencing

High-molecular-weight DNA was extracted from muscle tissue using a long-fragment column-based method for constructing PacBio HiFi, Hi-C, and ONT libraries (Table 1).

Table 1 Sequencing data statistics

For PacBio HiFi sequencing, DNA was sheared to 15–20 kb fragments using a Megaruptor system, followed by damage repair and enzymatic processing to generate SMRTbell libraries. Sequencing was performed on the PacBio Sequel II platform with five SMRT Cells, yielding high-quality circular consensus sequencing (CCS) data after filtration.

For Hi-C sequencing employed in chromosome-level scaffolding, muscle tissue was crosslinked with 1% formaldehyde (Sigma) and resuspended in lysis buffer. Chromatin was fragmented using the MboI restriction enzyme, followed by biotin labeling with T4 DNA ligase (Enzymatics). The biotinylated DNA fragments were captured using streptavidin-coated magnetic beads (Thermo Fisher Scientific) and sequenced on one lane of the DNBseq platform.

For ONT sequencing aimed at gap filling, high-molecular-weight DNA was processed as above. DNA fragments longer than 200 kb were selected using the BluePippin automated size selection system (Sage Science) and processed with the Ligation Sequencing 1D kit. Sequencing was conducted on the Nanopore platform at BGI.lFor RNA-seq that supported genome annotation, total RNA was extracted from seven tissues (heart, liver, spleen, lung, kidney, intestines, and muscle) using TRIzol reagent (Invitrogen). mRNA libraries were constructed using the Optimal Dual-mode mRNA Library Prep Kit (BGI-Shenzhen, China) and sequenced on the Illumina NovaSeq 6000 platform

For full-length transcriptome (Iso-seq) analysis also used in annotation, pooled total RNA from six tissues (heart, liver, spleen, lung, kidney, and muscle) was used to prepare SMRTbell libraries with two size-selected fractions (0–5 kb and 4.5–10 kb). Libraries passing quality control were sequenced on the PacBio Sequel II platform to capture full-length isoform diversity.

For resequencing applied in population genetic analyses, genomic DNA was extracted from 123 muscle samples via the cetyltrimethylammonium bromide (CTAB) method. Libraries for 115 individuals with breeding trait records were prepared using the TruSeq Nano DNA HT Kit (Illumina) and sequenced at approximately 1× coverage on the Illumina NovaSeq 6000 platform for population genomic analysis and GWAS. Separately, libraries for the remaining eight individuals with distinctive head morphology were processed with the BGI Optimal DNA Library Prep Kit and subjected to >100× deep resequencing on the DNBSEQ platform for selection sweep analysis.

Genome assembly and evaluation

Primary contig-level genome assembly was performed using Hifiasm (v0.16.1-r375)30 with PacBio HiFi reads preprocessed by CCS v6.2 (Pacific Biosciences, CA, USA) under parameters: --minPasses 3 --min-rg 0.99 --min-length 500. Hi-C reads were processed to remove adapters and low-quality sequences, then aligned to the contig-level assembly using Juicer (v1.5.6)31 to generate valid chromatin interaction pairs. Chromosome-level scaffolding was achieved with 3D-DNA (v180922)32, leveraging Hi-C-derived genomic proximity signals. To resolve residual gaps, Nanopore ultra-long reads were filtered by Nanopore_IOC.pl (BGI, Shenzhen, China; parameters: -minlen 5000 -mingv 7), error-corrected with NECAT (v2020)33, and integrated into the assembly using LR_gapcloser (v3) for gap closure (Supplementary Fig. 12).

The final gap-free genome assembly was rigorously evaluated through multiple metrics. Read mapping rates and coverage depths were calculated by aligning PacBio HiFi and ONT reads to the assembly using Minimap2 (v2.24)34. Continuity was assessed via contig and scaffold N50 statistics. Base-level accuracy was quantified using Merqury (v1.3)35 with a k-mer size of 21 to generate quality value (QV) scores. Completeness was validated through BUSCO (v5.4.7)36 analysis against the Tetrapoda and Vertebrata databases.

Repetitive sequence annotation

The annotation process began with de novo prediction using RepeatModeler2 (v2.0.1)37, which integrates outputs from RECON (v1.0.8)38 and RepeatScout (v1.0.6)39. Predicted repeats were classified via RepeatClassifier with reference to the Dfam (v3.5) database. Long terminal repeat (LTR) retrotransposons were specifically annotated using LTR_retriever (v2.9.0)40, which combines predictions from LTRharvest (v1.5.10)41 and LTR_FINDER (v1.07)42.

A custom repeat library was constructed by merging de novo predictions with known repeat entries from Dfam, followed by deduplication. This library was then used as input for RepeatMasker (v4.1.2)43 to systematically annotate transposable elements (TEs) across the genome (Supplementary Fig. 13 and 14a).

Tandem repeats were identified through complementary approaches: microsatellites were predicted using MISA (v2.1)44, while larger tandem repeats were annotated with Tandem Repeat Finder (TRF, v4.09)45 under parameters: 2 7 7 80 10 50 500 -d -h.

SDs were identified using BISER (v1.4)46. To ensure reliability, candidate SDs were retained only if they satisfied the following criteria: (1) sequence identity greater than 90%; (2) alignment gaps covering no more than 50% of the sequence; (3) aligned length of at least 1 kb; and (4) satellite repeats, as annotated by RepeatMasker, comprising no more than 70% of the sequence.

Protein-coding gene annotation and evaluation

Protein-coding genes were annotated through an integrated approach combining de novo prediction, homology-based inference, and transcriptome evidence (Supplementary Fig. 14). De novo gene structures were predicted using Augustus (v3.1.0)47 and SNAP (2006-07-28 release)48, leveraging intrinsic genomic features. Homology-based predictions were generated with GeMoMa (v1.7)49 through cross-species alignment of orthologous sequences. Transcriptome-guided annotation utilized two complementary strategies: (1) Illumina RNA-seq reads were aligned to the genome via HISAT2 (v2.1.0)50, assembled into transcripts with StringTie (v2.1.4)51, and processed by GeneMarkS-T (v5.1)52 for gene prediction; (2) full-length Iso-seq transcripts assembled through RNA-Bloom (v2.0.0)53 were integrated with PASA (v2.4.1)54 to model isoform diversity. Additionally, long-read transcripts were aligned to the genome using GMAP (2020-06-30 release)55, followed by splice junction refinement with PASA.

Predictions from all three approaches (de novo, homology-based, and transcriptome-guided) were consolidated using EvidenceModeler (EVM, v1.1.1)56 and further polished with PASA. The final gene set was functionally annotated against multiple databases, including NR, eggNOG (v5.0)57, Gene Ontology (GO), KEGG58, TrEMBL, KOG, SWISS-PROT59, and Pfam (v35.0)60.

Non-coding RNA and pseudogene annotation

Non-coding RNAs were systematically annotated using a combination of tool-based predictions and database alignments. Transfer RNAs (tRNAs) were identified using tRNAscan-SE (v1.3.1)61, while ribosomal RNAs (rRNAs) were predicted via barrnap (v0.9). Small non-coding RNAs, including microRNAs (miRNAs), small nucleolar RNAs (snoRNAs), and small nuclear RNAs (snRNAs), were annotated by aligning sequences to the Rfam database (v14.9)62 with Infernal (v1.1)63.

Pseudogenes, defined as genomic loci homologous to functional genes but rendered non-functional by frameshift mutations or premature stop codons, were identified through a two-step pipeline: (1) GenBlastA (v1.0.4)64 was used to detect homologous sequences in the genome after masking functional gene loci; (2) putative pseudogenes were validated by screening for disruptive mutations (e.g., premature termination codons, indels) using GeneWise (v2.4.1)65. Protein domain motifs were annotated with InterProScan (v5.34-73.0)66.

Centromere and telomere identification

Centromeric regions were identified based on their distinct genomic signatures, including minimal gene density, reduced interspersed repeat content, stable sequence entropy, elevated tandem repeat density, GC-rich sequences, and cytosine methylation patterns. Initial localization of centromeres was performed by aligning NCBI-curated porcine centromeric repeats to the NJP-T2T genome using BLAST (v2.15.0)67 with stringent thresholds (identity ≥90%, e-value < 1e-10), followed by calculating coverage within 20 kb sliding windows. Sequence complexity was assessed in 20 kb windows using NeSSie68, while genomic features such as gene density, tandem/inverted repeats, and GC content were quantified using prior annotation data. Cytosine methylation (5mC) profiles were generated from ONT fast5 data through Nanopolish (v0.14.0)69 and Minimap234 alignments, with methylation coverage evaluated in 20 kb windows via BEDTools (v2.31.1)70. Putative centromeric regions were further analyzed by constructing k-mer libraries with KMC71 (parameters: -fm -k151 -ci10 -cs100000), followed by identification of centromeric repeat monomers using SRF72 based on k-mer frequency distributions. Final centromere boundaries were determined by integrating chromosomal architectural features and repeat unit patterns. Telomeres were annotated using quarTeT (v1.2.4)73 with species-specific parameters (-c animal -m 10).

Comparative analysis with NCBI reference genome

Given that the NJP-T2T genome was derived from a female individual and therefore lacks the Y chromosome, comparative analyses were restricted to autosomes and the X chromosome. Structural comparisons between NJP-T2T and Sscrofa11.1 were performed by first aligning the genomes using Minimap234, followed by structural annotation and visualization through syri (v1.6.3)74. PURs were detected by aligning the NJP-T2T assembly to the Sscrofa11.1 reference using Winnowmap (v2.03)75, converting the alignments into PAF format with paftools.js, and extracting unresolved intervals with BEDtools (v2.31.1)70.

Resequencing data from 200 individuals (encompassing four indigenous and four exotic pig breeds) were obtained from public repositories. RNA-seq datasets from five additional indigenous and five exotic breeds (10 individuals per breed) were included, combining publicly available data with newly generated sequences from this study. Resequencing reads were aligned to the reference genome using BWA (0.7.17-r1188)76, with mapping statistics (mapping rate, base error rate, proper pair rate) calculated via SAMtools (v1.9)77. SV detection was implemented through a pipeline integrating discordant read pairs and split reads extracted from BWA alignments after duplicate removal with Picard. SVs were identified using Lumpy-sv (v0.2.13)78 and genotyped with svtyper (v0.7.1)79. RNA-seq reads were aligned to the genome using HISAT2 (v2.2.1)80, with alignment metrics systematically quantified.

Gene family analysis

Genome assemblies and GFF3 annotation files for Bamei pig (GCA_001700235.1), Jinhua pig (GCA_001700295.1), Meishan pig (GCA_001700195.1), Tibetan pig (GCA_000472085.2), Rongchang pig (GCA_001700155.1), Berkshire (GCA_001700575.1), Duroc (GCA_000003025.6), Landrace (GCA_001700215.1), and Large White pig (GCA_001700135.1) were obtained from public repositories (Ensembl). Gene family identification was performed using OrthoFinder v2.5.481 by integrating the NJP-T2T genome assembly with datasets from the 10 aforementioned species. Unannotated genes in NJP-T2T were functionally characterized via emapper (v2.1.9) against the EggNOG database (parameters: -m mmseqs -evalue 1e-5), which provides comprehensive functional annotations including Gene Ontology (GO) terms, KEGG pathways, and COG categories. The chromosomal distribution of unassigned genes was visualized using the RIdeogram82 package with a sliding window size of 1000 kb.

Variant detection

Raw sequencing data were first quality-assessed using FastQC (v0.11.9)83, followed by adapter trimming and low-quality sequence removal with Trimmomatic (v0.39)84 to generate high-quality clean reads. Clean reads were aligned to NJP-T2T using the BWA-MEM algorithm in BWA (0.7.17-r1188)76. Resulting BAM files were sorted with SAMtools (v1.9)77 and deduplicated using Picard’s MarkDuplicates module (parameter: REMOVE_DUPLICATES=true). The deduplicated BAM files were then used for variant detection with GATK (v4.1.9.0)85. Single nucleotide polymorphism (SNP) loci were extracted using SelectVariants (-select-type SNP) and filtered via VariantFiltration with thresholds: QD < 2.0 | | MQ < 40.0 | | FS > 60.0 | | SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0. Sites with >10% missing genotypes were removed using VCFtools (v0.1.17)86, yielding a variant set for population genetic analysis. Further filtering excluded loci with minor allele frequency (MAF) < 0.05, retaining high-quality biallelic SNPs for genome-wide association and selective sweep analyses.

Population genetic analysis

Genetic diversity, inbreeding levels, and runs of homozygosity (ROH) within the Neijiang pig conservation population were analyzed using PLINK (v1.90)87. SNP loci were filtered to exclude sites with missing rates >10% or significant deviations from Hardy-Weinberg equilibrium (HWE; p < 1e-6). Allele frequency spectra were generated via PLINK’s --freq parameter, with effective allele numbers (Ne) calculated as 1/(p² + q²), where p and q represent allele frequencies. Observed heterozygosity (Ho), expected heterozygosity (He), and inbreeding coefficients (F) were derived using --het.

Based on genome-wide SNP data, a multiple sequence alignment file in PHYLIP format was generated. Genetic distances were calculated using the dnadist program in the PHYLIP package (v3.697) with the default substitution model. A neighbor-joining (NJ) phylogenetic tree was then constructed using the neighbor program. The resulting tree was exported in Newick format and visualized using MEGA1188. Genetic relatedness matrices were generated via GCTA (v1.94.1)89 to assess kinship patterns. ROH segments were identified with PLINK parameters: --homozyg --homozyg-density 50 --homozyg-gap 100 --homozyg-kb 500 --homozyg-snp 50 --homozyg-window-het 1 --homozyg-window-snp 50 --homozyg-window-threshold 0.05.

Genome-wide association study

To enhance genotype accuracy from low-coverage sequencing data, the STITCH-Beagle imputation pipeline was implemented. The STITCH algorithm (v1.7.1)90, optimized for genotype imputation using low-depth data, was first applied with parameters K = 10 --nGen=320 --method = “diploid” --nIterations=40 to perform self-imputation on variant data from 115 phenotypically well-characterized Neijiang pigs, achieving an imputation accuracy of DR2 = 0.85. Subsequently, Beagle (v5.4)91 performed reference-based imputation using a dedicated reference panel of eight deeply sequenced Neijiang pig genomes (>100× coverage).

Imputed data were filtered through PLINK87 using thresholds --geno 0.1 --maf 0.05 --indep-pairwise 50 10 0.3 to retain high-confidence SNPs for analysis. In the GWAS, phenotype data was first corrected by fitting a linear model that controlled for fixed effects, including Farrowing year and Farrowing season. The residuals from this model were used directly as the corrected phenotype data for subsequent GWAS analysis. GWAS analysis was performed using the EMMAX (v.20120210)92, employing a mixed linear model (MLM) for association analysis. The model is expressed as:

$$y=X\beta +{G}_{\gamma }+{Zu}+\epsilon$$

where \(y\) is the phenotype vector, \(X\) is the fixed effect design matrix, \(\beta\) is the fixed effect parameters, \(G\) is the genotype matrix, \(\gamma\) is the genotype effect parameters, \(Z\) is the random effect matrix, representing the kinship matrix, \(u\) is the random effect, and \(\epsilon\) is the error term. The kinship matrix \(Z\) was used to account for population structure, and genome-wide significance thresholds were defined as 1/n, where n represents the effective number of independent SNPs.

Selective signatures of head morphology in Neijiang pigs

Genome-wide scans for selection signatures were performed using VCFtools (v0.1.17)86. Nucleotide diversity (π) was calculated in 100 kb sliding windows with 10 kb step size (--window-pi 100000 --window-pi-step 10000). The ratio of nucleotide diversity (ROD) between two phenotypic groups was derived from π values. Population differentiation was assessed through fixation index (FST) analysis using VCFtools parameters --fst-window-size 100000 --fst-window-step 10000.

Genomic regions under selection were identified by intersecting the top 5% of both FST and ROD values. Adjacent candidate windows were merged using BEDtools (v2.31.1)70 merge command. Genes overlapping these regions by at least 10% (assessed via BEDtools intersect -F 0.1) were retained as candidate selected genes. Gene structure visualization was implemented using GSDS (v2.0)93.

RT-qPCR assay

Three individuals each from the extreme head shape EFT group and the SZT group were selected for the collection of olfactory epithelium tissue. Total RNA was extracted from the cells using Trizol reagent (Invitrogen Corporation, Grand Island, NY, USA) following the manufacturer’s instructions. The extracted RNA was reverse-transcribed into cDNA using the PrimeScript RT reagent Kit (RR037Q) from Takara Bio, according to the recommended protocol provided with the kit.

Real-time fluorescence quantification of the synthesized cDNA was conducted using the Hieff UNICON® Universal Blue qPCR SYBR Green Master Mix kit (NextGen). The reaction mixtures were prepared, mixed, and centrifuged following the kit instructions. Fluorescence signal detection and quantification were performed on a BIO-RAD real-time PCR instrument. The primer sequences used for each gene in the qPCR analysis are listed in Supplementary Data 29.

Statistics and reproducibility

Statistical analyses and reproducibility information for this study are summarized as follows. Population genetic parameters and selective sweeps were analyzed using PLINK and VCFtools. Genome-wide association studies (GWAS) were performed using EMMAX under a mixed linear model, with the genome-wide significance threshold set at p < 1/n (where nn is the number of effective independent SNPs). Gene expression validation by RT-qPCR (three biological replicates per group) was assessed using a two-tailed Student’s t-test (p < 0.05). All raw sequencing data generated in this study have been deposited in public repositories. Specific sample sizes are provided in the respective methodological sections: de novo genome assembly (n = 1), population/GWAS cohort (n = 115), and selection sweep cohort (n = 8).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.