Introduction

Prokaryotes, as unicellular microorganisms, demonstrate extraordinary adaptability across diverse ecosystems. This adaptability can largely be attributed to key evolutionary mechanisms such as horizontal gene transfer (HGT), mutations, and genetic drift1. These processes are the principal drivers of prokaryotic evolution, continuously introducing novel genetic variations into the gene pool and thus promoting diversity at both the population and species levels2. Prokaryotic pan-genome analysis is a systematic method for identifying and characterizing all genes within a specific species. By examining the distribution patterns and functional diversity of gene families, researchers can gain insights into the evolutionary trajectories and adaptive strategies of species from a population perspective. With ongoing advancements in theoretical frameworks and technological innovations, pan-genome analysis has become a vital approach for studying the genomic dynamics of prokaryotes.

Three key developmental trends are transforming prokaryotic pan-genome research, marking a significant leap forward in our understanding of microbial genomic dynamics. Firstly, the rapid advancement of sequencing technologies, along with an explosion of large-scale datasets, has revolutionized the field. The number of analyzed strains has grown from just a few dozen in early studies to thousands today3,4,5. This remarkable growth in scale requires innovative analytical methodologies and heightened efficiency in processing the vast amount of data. Secondly, pan-genome analysis is shifting its focus from the localized examination of core genes to a holistic exploration of the entire pan-genome6,7,8,9. This evolution presents new challenges, such as accurately identifying paralogous genes resulting from recent duplication events and reliably distinguishing shell and cloud gene clusters. Addressing these challenges necessitates advanced analytical techniques capable of navigating this complex landscape. Finally, as we delve deeper into the intricacies of the pan-genome, the scope of research is expanding beyond the simple partitioning of homologous genes. There is now a robust focus on uncovering the evolutionary dynamics of gene families10,11,12,13. This broader approach highlights the pressing need for comprehensive analytical outputs and enhanced post-inference capabilities to grasp the evolutionary tales these genes truly tell. These collective trends signal a pivotal transformation in pan-genome research, steering us toward more extensive data integration, refined analytical methodologies, and the establishment of robust computational models14. Embracing these advancements not only enriches our scientific inquiry but also lays the foundation for groundbreaking discoveries in prokaryotic genomics.

Over the past two decades, a wide variety of pan-genome analysis methods have been developed, broadly categorized into three main categories: reference-based, phylogeny-based, and graph-based. Reference-based methods, such as eggNOG and COG, rely on established orthologous gene databases to identify orthologous genes by aligning genomic sequences with homologous gene annotations in these databases14,15,16. These highly efficient methods are well-suited for analyzing genomes with well-annotated reference data. In contrast, phylogeny-based and graph-based methods employ de novo approaches, focusing on gene family homology or positional characteristics. Phylogeny-based methods classify orthologous gene clusters using sequence similarity and phylogenetic information. These methods often employ techniques such as bidirectional best hits (BBH) or phylogeny-based scoring methods. By constructing phylogenetic trees, they aim to reconstruct the evolutionary trajectories of genes17,18. Graph-based methods, on the other hand, focus on gene collinearity and the conservation of gene neighborhoods (CGN), which refers to the preservation of gene order across genomes. By creating graph structures to represent relationships across different genomes, these methods enable the rapid and accurate identification of orthologous gene clusters19,20,21,22,23.

Current methods for analyzing gene clusters have made progress, but they still face some challenges. Reference-based methods depend on existing annotated datasets, making them less effective for studying new species. They primarily focus on sequence homology and often overlook other structural features, potentially leading to confusion between similar genes or paralogs. Phylogeny-based methods can track the origins of gene duplications through complex phylogenetic trees, but this process can be time-consuming and not ideal for large-scale analyses. Graph-based methods, while computationally efficient, often struggle with accuracy in clustering non-core gene groups, such as mobile genetic elements, and have difficulty adapting to high genomic variability among strains. Most tools primarily provide qualitative descriptions of gene clusters, lacking in quantitative characterization of gene relationships and attributes. This limitation restricts our understanding of orthologous gene functions and their evolution. Overall, there is a clear need for improved methods to enhance our understanding of gene functions and their evolutionary dynamics.

In 2011, we launched the pan-genome analysis pipeline (PGAP), a widely used workflow for analyzing genomes based on gene homology, designed for dozens of strains16. With the dramatic increase in the number of prokaryotic genomes, there is a higher demand for pan-genome analysis.

Now, we present PGAP2, an enhanced method for pan-genome analysis that accommodates thousands of genomes. It offers comprehensive workflows and visualization tools to effectively help users interpret input strain properties. Validation with simulated and carefully curated datasets shows that PGAP2 consistently outperforms other methods’ stability and robustness, even under genomic diversity. We analyzed the pan-genome of 2794 zoonotic Streptococcus suis strains to showcase its real-world application. This study highlights PGAP2’s effectiveness in handling diverse prokaryotic populations and its potential to advance prokaryotic genomics research.

Results

Overview of the method

The workflow of PGAP2 can be broadly divided into four successive steps: data reading, quality control, homologous gene partitioning, and postprocessing analysis (Fig. 1a):

Fig. 1: Overview of the PGAP2 workflow.
Fig. 1: Overview of the PGAP2 workflow.
Full size image

a The flowchart depicts four sequential stages of the PGAP2 pipeline, arranged from top to bottom: data preprocessing, quality control, gene clustering, and postprocessing analysis. b The core algorithm of PGAP2 begins by constructing an identity network and a synteny map, which serve as the foundational data structures for ortholog inference. Following regional refinement, PGAP2 iteratively merges nodes based on gene cluster diversity, connectivity, and Bidirectional Best Hit (BBH) criteria. The abbreviation “sp.” in this figure refers to “species”.

PGAP2 is compatible with various input formats

PGAP2 accepts four types of input data: GFF3, genome FASTA, GBFF and GFF3 with annotations and genomic sequences. The last one, produced by genome annotation tools such as Prokka, combines a GFF3 annotation with its corresponding nucleotide sequence. PGAP2 can accept a mix of different input formats and identify the input format based on the file suffix. After reading and validating all the data, PGAP2 organizes the input into a structured binary file to facilitate checkpointed execution and downstream analysis.

Perform quality control and generate feature visualization reports

If no specific strain is designated, PGAP2 will select a representative genome based on gene similarity across strains. Briefly, PGAP2 evaluates outliers using two methods. The first method is based on the similarity of average nucleotide identity (ANI). If a strain’s similarity to the representative genome falls below a certain threshold (e.g., 95%), it is classified as an outlier. The second method compares the number of unique genes in the strain with those in other strains. A strain with a higher number of unique genes is more likely to be classified as an outlier. Additionally, PGAP2 generates interactive HTML and vector plots to visualize features such as codon usage, genome composition, gene count, and gene completeness, helping users assess input data quality (Supplementary Figs. 1 and 2).

Infer orthologs through fine-grained feature analysis under a dual-level regional restriction strategy

The process of inferring orthologous genes in PGAP2 can be divided into three key steps: data abstraction, feature analysis, and result dumping (Fig. 1b). Essentially, PGAP2 organizes the required data into two distinct networks: a gene identity network and a gene synteny network. In the identity network, edges represent the degree of similarity between genes, while in the synteny network, edges denote adjacent genes - specifically, genes that are one position apart. PGAP2 then splits gene clusters that contain redundant genes within the same strain, using conserved gene neighbor (CGN) to ensure that the graph remains acyclic (Supplementary Note 1). Finally, PGAP2 calculates a diversity score using the updated networks. This score helps evaluate the conservation level of orthologous genes in subsequent analyses. In the second step, PGAP2 performs orthologous gene inference by traversing all subgraphs in the identity network. This process is divided into two main stages: regional refinement and feature analysis. In each iteration, PGAP2 applies a dual-level regional restriction strategy, evaluating gene clusters only within a predefined identity and synteny range. This targeted approach significantly reduces search complexity by focusing on a confined radius, enabling more detailed and efficient analysis of features within these clusters (Supplementary Note 2). The reliability of orthologous gene clusters is evaluated using three criteria: 1) gene diversity, 2) gene connectivity, and 3) the bidirectional best hit (BBH) criterion, which is applied to duplicate genes within the same strain. The detailed algorithms used for these assessments are outlined in the methods section and Supplementary Note 3. Once merged, gene clusters are updated in the synteny network, which undergoes further iterations until all clusters no longer meet the established criteria. Finally, PGAP2 merges nodes with exceptionally high sequence identity, which often arise from recent duplication events driven by horizontal gene transfer or insertion sequences. The properties of the orthologous gene clusters are then outputted, including the average identity, minimum identity, average variance, and uniqueness to other clusters, providing insights into genome dynamics. More detail of the graph algorithm is described in Supplementary Methods.

Record pan-genome profile and generate visualization reports

Similar to preprocessing, the postprocessing module generates interactive visualizations in both HTML and vector formats. These visualizations display the rarefaction curve, statistics of homologous gene clusters, and quantitative results of orthologous gene clusters. PGAP2 employs the distance-guided (DG) construction algorithm, initially proposed in PanGP13, to construct the pan-genome profile. Finally, by integrating multiple software tools, PGAP2 provides comprehensive workflows, which include sequence extraction, single-copy phylogenetic tree construction, and bacterial population clustering. This integration ensures a seamless experience for users (Supplementary Figs. 36).

Methods’ performance on the simulated dataset

To assess PGAP2’s robustness, we evaluated its accuracy using different thresholds for orthologs and paralogs, simulating variations in species diversity. We compared PGAP2 against five state-of-the-art tools, including Roary, Panaroo, PanTa, PPanGGOLiN, and PEPPAN, all tested using their respective default parameters. The thresholds for orthologs were adjusted from 0.99 to 0.91, while those for paralogs ranged from 0.9 to 0.6. This approach resulted in 16 combinations of datasets.

We used the adjusted rand index (ARI) to evaluate the accuracy of software tools by indicating the differences between two matrices. The results show that all the software tools achieved an average accuracy of over 90%. PGAP2 consistently outperformed the other tools in most gradient combinations (14 out of 16), achieving an impressive average accuracy of 0.9997. This exceeds that of the second-best tool, PEPPAN, which recorded an accuracy of 0.9990 (Fig. 2a and Supplementary Table 1). These results illustrate PGAP2’s robustness in simulated extreme diversity scenarios, demonstrating its ability to maintain high accuracy even under challenging conditions. These findings suggest that PGAP2 is particularly well-suited for analyzing highly diverse strains, where high precision is critical for accurate downstream analyses and biological interpretation. Furthermore, robustness tests using simulated incomplete annotations demonstrated that PGAP2 maintained over 99% clustering accuracy, even with significant gene loss and fragmentation. This further validates its reliability across various low-quality input conditions (Supplementary Fig. 7).

Fig. 2: Performance comparison of various software on simulated data.
Fig. 2: Performance comparison of various software on simulated data.
Full size image

a The accuracy of PGAP2 and other software was assessed using the Adjusted Rand Index (ARI). The dataset comprises 16 combinations formed by varying orthologous thresholds (from 0.99 to 0.91) and paralogous thresholds (from 0.9 to 0.6). The software with the highest accuracy is marked with an asterisk (*). b shows the accuracy of each software across different cluster categories. c presents the accuracy of PGAP2 on orthologous and paralogous gene clusters, respectively. d summarizes the error types generated by each software: False merges refer to the incorrect merging of multiple clusters, and False splits refer to the incorrect splitting of a single cluster. e provides a runtime comparison of each software across varying numbers of genomes (from 50 to 1000), tested with 8 threads and default parameters. For panels bd, The boxplots represent the first, second (median), and third quartiles; whiskers represent the most extreme data points within 1.5× IQR of the quartiles; and black dots indicate outliers. All sample sizes are n = 16, corresponding to the datasets described in panel (a).

By categorizing gene clusters from different perspectives, we further evaluated the accuracy (or error rate) of each software from three dimensions to demonstrate that PGAP2 does not exhibit any specific error bias.

First, we assessed the accuracy of PGAP2 across different categories of gene families, including strict core, core, shell, and cloud (Fig. 2b, Supplementary Table 2). Here, strict core genes are defined as those present in 100% of genomes; core genes are present in ≥95% of genomes; shell genes are found in 15% to <95% of genomes; and cloud genes are rare, occurring in <15% of genomes. The results showed that PGAP2 achieved the highest accuracy across all gene family types, with an average accuracy of 99.62% ± 0.32%. Notably, other software tools showed significantly lower accuracy in shell gene families, likely due to the high diversity in shell clusters. In contrast, PGAP2 maintained an error rate of only 0.77% in shell gene families, significantly outperforming the others, which ranged from 4.96% to 30.39%. This improvement can be attributed to PGAP2’s dynamic parameter planning, which enables it to maintain high accuracy even in highly diverse gene families.

Second, we classified gene families based on their evolutionary origin into orthologous and paralogous types. As expected, nearly all software tools achieved high accuracy in identifying orthologous genes, ranging from 90.82% to 99.91% on average, with PGAP2 reaching the highest at 99.97%. However, notable differences were observed in the identification of paralogous genes. PGAP2, PEPPAN, and PPanGGOLiN performed significantly better than the remaining three tools, which can be explained by their respective optimizations: PEPPAN uses phylogenetic tree-based methods, PPanGGOLiN applies maximum likelihood estimation, and PGAP2 incorporates dynamic parameter planning (Fig. 2c, Supplementary Tables 34).

Finally, we analyzed the types of errors made by each software and classified them into two categories: false merges, where two or more clusters are incorrectly merged into a single cluster, and false splits, where a single cluster is incorrectly split into multiple clusters. The results revealed that the main error types varied across software. For example, the primary errors in Panaroo and Roary were false splits, while PanTa, PEPPAN, and PPanGGOLiN were more prone to false merges. PGAP2 showed mean counts of 20.1 for false merges and 23 for false splits, indicating no significant error bias. Interestingly, although PPanGGOLiN achieved the highest accuracy in identifying paralogous clusters, it appeared to cause more false merges in orthologous gene clusters, suggesting a bias toward this type of error (Fig. 2d, Supplementary Tables 56).

Computation time

We generated 11 additional gradient-simulated datasets to assess the runtime of PGAP2 and competing tools. The number of genomes ranged from 50 to 1000, with an average of 4500 coding genes per genome. All software was tested using 8 threads with default parameters (Fig. 2e, Supplementary Tables 78). PGAP2 demonstrated near-linear scalability, with no significant computational bottlenecks as the dataset size increased. On the largest dataset, PGAP2 completed the analysis in approximately 6 min, nearly twice as fast as the second-fastest tool, PanTa, and 23 times faster than Roary, while requiring only one-third of the memory used by Roary. These findings suggest that PGAP2 maintains high accuracy and is well-equipped to handle the demands of large-scale prokaryotic pan-genome analyses.

To further evaluate the scalability of PGAP2 with real-world datasets, we chose three prokaryotic species with the most complete genomes available in the NCBI RefSeq database: Klebsiella pneumoniae (25,437 genomes), Staphylococcus aureus (18,411 genomes), and Escherichia coli (44,284 genomes). Using 32 threads and default settings, PGAP2 successfully conducted pan-genome analyses for all three datasets. The runtimes were about 6 h for S. aureus, 12 h for K. pneumoniae, and 119 h for E. coli, with peak memory usage reaching up to 886 GB. These results demonstrate that PGAP2 can handle some of the largest collections of prokaryotic genomes currently available (Supplementary Fig. 8).

Benchmark datasets construction for pan-genome analysis

To further assess the accuracy of PGAP2 using real-world data, we manually created a benchmark pan-genome dataset composed of 50 strains of Escherichia coli, representing eight major phylogroups (Supplementary Table 9). The workflow for constructing this dataset is illustrated in Fig. 3a, and detailed quality control criteria are provided in the methods section. Briefly, we selected representative strains based on their genome distances. Raw annotations were manually curated and merged based on semantic similarity, while homologous gene families were defined by integrating functional annotations from SwissProt. Homologs were further refined using synteny information, followed by phylogenetic tree construction and BBH-based paralog identification. When comparing the raw cluster dataset to the benchmark dataset, several improvements have been made to boost accuracy. The benchmark dataset has removed pseudogenes and short sequences, corrected clusters that needed merging or splitting, and resolved all instances of paralogous genes. The Cohen’s kappa score between the raw and benchmark datasets is 0.6713 (Supplementary Table 10), indicating a substantial level of agreement and reflecting the changes made through manual curation. Ultimately, the benchmark dataset includes 224,798 genes organized into 14,360 homologous gene clusters, which consist of 3063 core gene families, 2494 shell gene families, and 8803 cloud gene families (Supplementary Table 11 and Supplementary Data 1).

Fig. 3: The data profiling of the Escherichia coli benchmark.
Fig. 3: The data profiling of the Escherichia coli benchmark.
Full size image

a The construction of the benchmark includes six steps: merging clusters based on annotations (Steps 1 and 2), splitting paralogs based on collinearity (Step 3), classifying and correcting clusters based on best-matching (Steps 4 and 5), and multiple verifications and manual curation to generate the complete version (Step 6). b The average protein sequence similarity across various types of clusters, with sample sizes: ntotal = 14,360, nstrict core = 2331, nsoft core = 732, nshell = 2494, ncloud = 8803. c The average semantic similarity of domain annotations across various types of clusters. The violin plots show the kernel density estimate of the semantic similarity distribution, where the width represents the relative frequency at the type of clusters, diamond symbols represent mean. The sample size only includes the clusters with annotations: ntotal = 9872, nstrict core = 2250, nsoft core = 672, nshell = 2011, ncloud = 4939. d Average gene identity of nucleic acid sequence across various types of clusters, with the same sample sizes as in b. e The Paired Wilcoxon Signed-Rank Test with no adjustment for multiple comparisons indicates a significant difference (p < 2.22e-16) that the average genetic distance within the clusters is significantly less than the nearest genetic distance between clusters. The sample size only includes the clusters which have the nearest genetic distance cluster: n = 13,860. Inner Dist represents the average genetic distance within the cluster, Outer Dist represents the genetic distance between the cluster and the nearest cluster, “****” indicates a significant difference with p < 0.0001. In b,d, and e, the box plots depict the median (central line), 25th and 75th percentiles (box bounds), outliers (gray points) and mean (diamond symbols).

The pan-genome profile illustrates the open nature of the E. coli pan-genome (Supplementary Fig. 3C). Core genes are mainly associated with essential conserved functions, including translation, ribosomal structure, biogenesis, and energy production and conversion. Shell genes are involved in functions related to transcription, cell motility, intracellular transport, and secretion, while cloud genes are enriched in functions related to or associated with transcription, replication, recombination, and repair, with a significant proportion of genes of unknown function (Supplementary Fig. 10). These functional patterns are consistent with previous observations24,25,26.

We constructed a phylogenetic tree using curated single-copy core clusters, which revealed that all strains grouped into their respective phylogroups (Supplementary Fig. 11). To further validate the reliability of our dataset, we examined it from four perspectives: the sequence similarity among proteins within the clusters (Fig. 3b), the consistency of functional annotations (Fig. 3c), the average genetic distance among genes within the clusters (Fig. 3d), and the difference in genetic distances between intra-cluster genes and the nearest inter-cluster genes (Fig. 3e). Our results demonstrated that the average protein sequence similarity within all clusters exceeded 70%, and the consistency of functional annotations for the annotated gene families was also above 70%. Furthermore, the average genetic distance among genes within clusters was less than 0.3, which is significantly smaller than the distance to the nearest neighboring cluster (two-sided Wilcoxon signed-rank test, p < 2.22e-16). These findings highlight the accuracy and reliability of our curated dataset.

Methods’ performance on the Escherichia coli data dataset

The performance of each pan-genome analysis method on the curated Escherichia coli dataset is illustrated in Fig. 4a and c. All methods achieved an accuracy of over 0.98. We evaluated three sensitivity modes in PGAP2 - soft, mid, and strict, which define the connectivity thresholds used in gene clustering. The strict mode requires a fully connected graph, wherein all gene pairs must exceed the homology threshold. This mode is suitable for identifying conserved genes across different strains. In contrast, the soft mode only requires minimal connectivity (one edge), making it more appropriate for identifying highly variable homologous gene clusters across species. The testing of these modes yielded accuracies of 0.9932, 0.9933, and 0.9932, respectively, all of which exceed the accuracy of the second-best method, Panaroo, which achieved an accuracy of 0.9917 (Fig. 4a and Supplementary Table 12). Regarding core orthologous gene clusters, Fig. 4c shows that Panaroo identified the highest number, with 3072 clusters, followed closely by PPanGGOLiN with 3064 and PGAP2 with 3063. Among these methods, PGAP2’s result most closely matched the gold standard. Conversely, PanTa identified the fewest clusters, totaling 2900 (Supplementary Table 13).

Fig. 4: Comparison of pan-genome predictions with curated datasets.
Fig. 4: Comparison of pan-genome predictions with curated datasets.
Full size image

a The accuracy and false counts for each software were evaluated using a manually curated dataset consisting of 50 E. coli genomes.b The accuracy and false counts for each software evaluated using a manually curated dataset of 15 Salmonella genomes. c The distribution of clusters within pan-genome groups derived from the E. coli dataset. d The distribution of clusters within pan-genome groups derived from the Salmonella dataset. The terms “soft”, “mid”, and “strict” refer to the different levels of clustering strictness employed in PGAP2.

We also examined misclassified genes and gene clusters (Fig. 4a). Depending on the sensitivity mode used, PGAP2 misclassified approximately 1500 clusters, significantly fewer than the average of 3,321.0 ± 1,512.18 clusters misclassified by other tools. PGAP2 misclassified only six strict core clusters, compared to ten identified by Panaroo. This is notably lower than the overall average of 79.8 ± 38.92 clusters (Supplementary Table 14).

Methods’ performance on the Salmonella enterica benchmark dataset

To address potential biases related to the dataset or strain preferences, we re-evaluated PGAP2 using a different gold standard dataset that included 15 Salmonella enterica strains (Fig. 4b, d, and Supplementary Table 15). Similar to its performance on the E. coli benchmark, all methods demonstrated an overall accuracy exceeding 0.98. PGAP2 achieved the highest accuracy at 0.9964, surpassing the second-best method, Roary, which reached an accuracy of 0.9949 (Supplementary Tables 1617).

Notably, Roary and PanTa performed comparatively better on Salmonella but showed lower performance on Escherichia coli, suggesting that these methods may have species-specific thresholds or parameter sensitivities. This observation is consistent with their behavior on simulated datasets and highlights a potential bias in these tools when applied across diverse taxa. Regarding error rates, PGAP2 yielded the fewest misclassified genes and gene clusters. Specifically, it reported 1854 misclassified genes, compared to 2203 for Roary. PGAP2 also misclassified only 425 gene clusters, while PEPPAN and PPanGGOLiN misclassified 489 clusters each (Supplementary Table 18).

Taken together, our results demonstrate that PGAP2 maintains high accuracy across species and does not exhibit species-specific performance bias. Its consistent superiority across independent benchmarks confirms its robustness and reliability for pan-genome inference in diverse prokaryotic datasets.

A pan-genome analysis for Streptococcus suis

PGAP2 is capable of rapidly generating high-quality pan-genome profiles from thousands of strains. To demonstrate its scalability and robustness on a high-volume dataset, we analyzed all 2794 strains of Streptococcus suis, a zoonotic pathogen known for its extensive genetic diversity and large gene pool, using data available from the NCBI RefSeq database.

PGAP2 completed the pan-genome analysis of 2794 S. suis strains within 20 minutes using 32 threads. A total of 17,646 orthologous gene clusters were identified, comprising 1395 core genes (7.90%), 6.98% shell genes, and 85.12% cloud genes (Fig. 5a). This number is smaller than the 29,738 gene clusters identified in 1204 S. suis strains using Roary27. This discrepancy may stem from Roary’s clustering strategy, which is prone to over-segment gene clusters with high sequence diversity, especially those annotated as shell or cloud genes, thereby introducing false-split errors, as demonstrated in our benchmarking evaluation (Fig. 2d). Core genes represented an average of 70.67% of all coding genes in each strain, slightly higher than previous studies, such as 66.5% by Zhang et al.28 and 63% by Xia et al.29. Among the analyzed strains, S. suis YB51 Colony29 exhibited the highest proportion of core genes at 85.42%, while S. suis SS/UPM/MY/F001 had the lowest at 55.95%. These differences are likely due to the varying number of coding genes present in their genomes (Supplementary Fig. 12). The pan-genome growth curve follows a sub-linear power-law trend, indicating that as more genomes are included, new genes continue to be discovered but at a diminishing rate. The positive exponent confirms an expanding pan-genome (Fig. 5b), supporting the idea of an open pan-genome, which is consistent with previous studies27,28,29,30.

Fig. 5: Pan-genome analysis of 2794 Streptococcus suis strains using PGAP2.
Fig. 5: Pan-genome analysis of 2794 Streptococcus suis strains using PGAP2.
Full size image

a Pie chart showing the proportion of core, shell, and cloud gene clusters; accompanying histogram illustrates the number of clusters as a function of strain frequency. b Rarefaction curves for the pan-genome and core genome. c Distribution of four quantitative parameters (Uniqueness, Minimum identity, Mean identity, and Variance) calculated by PGAP2 for all gene clusters. d Similarity relationships among homologous gene clusters. Unique clusters are defined as those without detectable homology to any other clusters, whereas coherent clusters exhibit sequence similarity exceeding the predefined gene family threshold. The bar chart depicts the distribution of these two cluster types across different pan-genome categories. e Intra-cluster similarity among homologous genes. Conserved clusters are defined as those with sequence identity above the conservation threshold, indicating strong internal consistency. Variable clusters meet the gene family threshold but fall below the conservation threshold, indicating elevated diversity within the cluster.

Quantitative attributes of Streptococcus suis pan-genome profile

PGAP2 provides quantitative metrics for each gene cluster based on four parameters: Uniqueness (Uni), Minimum identity (Min), Mean identity (Mean), and Identity variance (Var) (Fig. 5c). Uni describes inter-cluster relationships, whereas Min, Mean, and Var reflect intra-cluster sequence similarity. Uni quantifies the minimum distance between an ortholog and other clusters, with lower values indicating greater separation. When Uni exceeds a given threshold (0.7 by default), it defines a unique cluster, indicating that there is insufficient homology for classification within an established gene family. Our analysis revealed that over two-thirds of homologous gene clusters are unique, with most originating from core and cloud clusters. In contrast, shell clusters tend to exhibit higher homology (Fig. 5d). This finding suggests that cloud genes are predominantly influenced by horizontal gene transfer in S. suis, which contributes to its genetic diversity.

Min, Mean, and Var are key to classifying gene clusters into conserved and variable categories. Conserved clusters, where both Min and Mean exceed a threshold, show high sequence consistency, due to purifying selection (Fig. 5e). In contrast, variable clusters exhibit increased sequence diversity, particularly when either Min or Mean falls below the threshold. This variation may arise from evolutionary forces such as genetic drift and mutations. Notably, soft core and shell genes demonstrate the greatest homolog diversity, likely owing to their stability within populations, making them more susceptible to environmental selection pressures (Supplementary Fig. 13).

The enrichment analysis of high-variance clusters indicated that these homologs are preferentially associated with pathways involved in environmental interactions (Supplementary Table 19). The most significant enrichment was observed in quorum sensing, a bacterial communication mechanism that allows cells to sense population density via autoinducer molecules. This process regulates collective behaviors such as biofilm formation and virulence factor expression31,32,33. Other enriched pathways include biosynthesis of peptidoglycan34,35 and various pathways related to carbohydrate metabolisms, such as starch and sucrose metabolism, galactose metabolism, and the phosphotransferase system. These pathways likely highlight strain-specific adaptations in carbon source utilization and preferences across different environments35,36.

Distribution of paralogous genes across pan-genome groups

Paralogous genes play a fundamental role in genome evolution and the functional diversification of organisms. They provide genetic redundancy, which allows for the development of new gene functions through the processes of duplication and divergence37,38. To investigate this further, we introduced the concept of a paralogous ratio, which is defined as the average number of paralogous genes per strain within a homologous gene cluster. A higher paralogous coefficient indicates that these genes tend to cluster within specific strains, while a lower value suggests a more uniform distribution across all strains.

We investigated the distribution of paralogous genes in Streptococcus suis and found distinct patterns across different pan-genome groups (Supplementary Fig. 14A). Core gene clusters generally exhibit an even distribution of paralogous genes across various strains. In contrast, accessory gene clusters are more likely to contain multiple copies within individual strains, which reflects a statistically significant difference. Interestingly, although paralogous gene clusters are most prevalent in the cloud category, they account for only 6.75% of all cloud gene clusters. In comparison, paralogous gene clusters account for 51.54% (719 out of 1395) of all core gene clusters, with the highest proportion found in strict core clusters at 60.76% (79 out of 130) (Supplementary Fig. 14B). These findings suggest that additional copies of core gene clusters provide essential redundancy for critical functions, thereby enhancing the survival and robustness of S. suis under varying conditions39.

Discussion

In the past two decades, significant development trends have reshaped prokaryotic pan-genome research40,41,42,43,44. These trends highlight the growing need for advanced analytical methods and computational models to handle increasingly large datasets, understand complex genomic dynamics, and uncover the evolutionary narratives contained within the pan-genome. Although state-of-the-art methods were effective in earlier studies, they have struggled to keep up with the rapid advancements in sequencing technologies and the explosion of large-scale datasets.

PGAP2 is designed to address the challenges posed by the rapid growth in dataset size. Unlike full-graph search strategies such as Roary22 and PanTa20, PGAP2 employs a dual-level regional restriction mechanism. This approach limits the search for orthologous genes to a confined radius, greatly reducing unnecessary computations. Doing so not only accelerates the search process but also allows enough time for detailed feature analysis, thereby enhancing overall accuracy. As a result, PGAP2 can efficiently process large-scale genomic data. In our case study, PGAP2 successfully inferred the pan-genome of 2794 Streptococcus suis strains in just 20 minutes using 28 threads, outperforming previous methods by a significant margin. This demonstrates that large-scale pan-genome analysis is not only feasible in the high-throughput computing era but also highly efficient.

Most current methods for inferring complex gene families primarily rely on gene identity and synteny within strains, which are effective for conserved homologs like core genes. However, these criteria often fall short when dealing with more intricate gene families. Tree-based methods, such as PEPPAN17 and PanX18, can analyze phylogenetic relationships to infer complex gene families. Still, they tend to be time-consuming and are unsuitable for large-scale genomic analyses. PGAP2 addresses these challenges by introducing a fine-grained feature analysis mechanism that considers strain diversity, connectivity, and bidirectional best-hit (BBH) criteria in the identification of orthologous genes. These internal features help to better constrain the inference of complex gene clusters, resulting in a more accurate and reliable process.

Moreover, PGAP2 introduces a suite of parameters that quantitatively describe the properties of homologs, allowing researchers to capture the evolutionary dynamics of gene families in a more detailed and intuitive way. This focus on quantitative analysis is essential for understanding the forces driving prokaryotic evolution, and PGAP2’s ability to quantify these relationships sets it apart from other methods. Furthermore, building on the foundation of PGAP16, PGAP2 provides seamless workflows for preprocessing and post-processing (Supplementary Figs. 16), along with visualization tools to aid users in exploring and understanding the underlying dynamics of pan-genomes.

PGAP2 has several current limitations. First, it does not include non-coding genes in pan-genome clustering because its pipeline relies on protein-level similarity, which non-coding elements typically lack. Second, PGAP2 is sensitive to the quality of genome assemblies. It performs poorly on fragmented or incomplete assemblies, such as metagenome-assembled genomes (MAGs), where significant portions of the sequences may be missing. This lack of continuity disrupts gene order and weakens synteny signals, which are crucial for accurately distinguishing orthologs from paralogs in PGAP2. As a result, such inputs can significantly compromise the accuracy of clustering. Additionally, PGAP2 relies on both protein similarity and gene synteny among genomes. When these assumptions are violated, for example in analyses that span a genus with evolutionarily distant species, clustering accuracy may decline, and results should be interpreted cautiously (Supplementary Fig. 15). In such cases, further biological validation is recommended. Lastly, the fine-grained feature analysis used by PGAP2 is computationally intensive. While region-restricted alignment strategies help avoid unnecessary comparisons, runtime can still increase significantly for very large datasets, especially those with extensive paralogy and fragmentation. For example, analyzing the complete E. coli dataset (over 40,000 genomes) led to a considerable increase in both runtime and memory usage, indicating a practical scalability limit with the current configurations.

Methods

Escherichia coli strain selection for manual curation

A total of 170 complete genomes with well-defined subtype annotations were downloaded from the NCBI RefSeq database. The genome of the representative strain Escherichia coli K-12 substr. MG1655 was selected as the reference genome. Genome distances between the reference genome and other genomes were calculated using Mash v2.3 with default parameters45. A phylogenetic tree based on Mash distances was constructed using Mashtree v1.2.046, and the tree shape was adjusted with iTOL v6.9.147, as shown in Supplementary Fig. 9. Representative strains were selected based on the genome distance gradient and their positions within the tree. In total, 50 Escherichia coli strains, including the reference strain K-12 substr. MG1655 and another representative strain O157:H7 str. Sakai, were selected as the initial dataset for the gold standard. Detailed information of the data curation process is provided in Fig. 3a.

All CDS sequences were extracted from genome sequences and gff annotation files by in-house scripts. All pseudogenes and CDS sequences with protein sequence lengths less than 20 were deleted. Each sequence was renamed in the format A_B_C to generate unique sequence identification numbers, where A represents the genome number, B represents the chromosome and plasmid number, and C represents the sequence number.

Gene family classification of Escherichia coli benchmark

We curated orthologous gene families using a six-step cascade process, as outlined in Fig. 3a and Supplementary Methods. In step 1, we merged genes with identical annotations, and calculated the semantic similarity of gene products extracted from annotation files using the Python library SequenceMatcher. For gene families with an annotation similarity greater than 90%, we manually curated them to confirm if they represented the same protein. If they were determined to be identical, we combined them into the same gene family. In step 2, to address annotation differences caused by inconsistent protein names (e.g., ribonucleoside hydrolase 1 and pyrimidine-specific ribonucleoside hydrolase RihA, which are the same protein), we used BLASTp v2.15.048 to re-annotate all protein sequences against the Swiss-Prot database49, setting thresholds of identity ≥70%, coverage ≥70%, and e-value ≤ 1e-5. Gene families with consistent Swiss-Prot annotations were then merged.

While homologous annotation effectively identifies types of gene families, it fails to adequately distinguish between paralogous genes because it does not take the location of genes within the genome into account. To address this, step 3 introduced a genomic context score, which simulates the degree of collinearity between genome segments by comparing the number of homologous genes within the five upstream and downstream genes surrounding each target gene. Additionally, phylogenetic trees were constructed for all assumed paralogous gene clusters using the default parameters of MAFFT v7.31050 and FastTree v2.1.1051, and using ETE3 v3.1.352 used to obtain more accurate evolutionary relationships. The classification rules were as follows: If a gene was not located on the same genome as its nearest gene, paralogous genes were distinguished based on the genomic context score. In cases where the genomic context scores were consistent, paralogs were separated according to the principle of nearest distance. This process resulted in the homologous gene family classification established in step 3.

Next, we focused on unknown proteins, typically annotated as hypothetical proteins, that homologous annotations could not classify. The best matching genes from all strains were searched for these proteins rather than relying on known databases. The bidirectional best hit (BBH) of a gene was defined based on the following criteria: identity ≥70%, coverage ≥70%, and e-value ≤ 1e-5. If the BBH had been correctly classified, the corresponding hypothetical protein was classified into the gene family of the BBH. If no BBH was found, the hypothetical protein was classified into a new family. In cases where the BBH was also a hypothetical protein, no classification was made at that stage. Finally, any unclassified putative proteins were grouped using MCL v14-13753, applying the same bidirectional thresholds of identity ≥70%, coverage ≥70%, and e-value ≤ 1e-5. This approach ensured a clear differentiation between clusters, culminating in step 4.

Additionally, we performed a consistency check based on BBH for all classified gene clusters. This was done to verify that all BBH pairs belonged to the same gene family and to identify any proteins that may have been misclassified due to not meeting the BBH criteria. The misclassified genes were extracted and reclassified using MCL v14-137 with the same bidirectional thresholds to form step 5.

We finally addressed special genes, such as Insertion Sequences (IS) and prophage proteins, which are mobile elements that often appear randomly in genomes and do not conform to traditional homologous classification rules54,55. These genes were manually classified using a combination of genome context score, sequence similarity, and annotation information. For genes that remained ambiguous, we ensured the accuracy of our dataset by removing them, so that only correctly classified genes were retained. This process led to the creation of the final version of the gene family.

Quality assessment of Escherichia coli benchmark dataset

Benchmarking the inference of orthologs typically involves several key factors, such as testing for functional consistency, inconsistencies in the species tree, conservation of gene neighborhoods, and validation against gold standard gene trees56,57. Therefore, the reliability of the benchmark was verified by the sequence similarity of proteins, the consistency of domain annotation, the average genetic distance of genes, and the association between the nearest genetic distance and the average genetic distance. First, we classified the types of gene clusters in our benchmark. We then used BLASTp to calculate the protein sequence similarity of pair-to-pair genes in each gene family and the average sequence similarity within the cluster. Next, the proteins in each gene family were annotated by InterProScan v5.6358. We calculated the semantic similarity in the same family based on the domains annotated, employing the Python package scikit-learn to compute cosine similarity. We performed alignment of nucleic acid sequences within each genome family using MAFFT v7.31059. Next, we utilized FastTree v2.1.1060 to build the gene tree and employed ETE3 v3.1.352 to calculate the genetic distance between each pair of genes. Following this, we extracted representative sequences from each genome family for sequence alignment and phylogenetic tree construction. We also calculated the genetic distance between gene families and compared the average genetic distance within a gene family to that between the nearest gene family to determine if there was a significant difference.

Characteristic analysis of Escherichia coli benchmark dataset

We used PanGP v1.0.113 to calculate the pan-genome size of E. coli, including the core genome size and the number of new genes. This was done through nonlinear fitting based on the gene presence/absence matrix of E. coli benchmark. To annotate all proteins and identify the functional characteristics of gene clusters, we utilized eggnog-mapper v2.1.12 with the DIAMOND alignment algorithm, referencing the eggNOG database v5.0.2. Next, we constructed OrgDB annotation package by R package AnnotationForge v1.40.2, and performed GO enrichment analyses for gene clusters using clusterProfiler v4.6.261 with P-value < 0.05. Finally, we visualized the COG annotations and GO terms for core genes, shell genes, and cloud genes using R scripts.

Salmonella enterica benchmark dataset generation

The Salmonella enterica dataset was initially generated by Sean-Paul Nuccio and Andreas J. Bäumler62. The strains were downloaded from the NCBI genome database, with accession numbers as follows: LT2: GCF_000006945.1, SL254: GCF_000016045.1, SL476: GCF_000020705.1, CVM19633: GCF_000020745.1, SL483: GCF_000020885.1, P125109: GCF_000009505.1, SPB7: GCF_000018705.1, CT18: GCF_000195995.1, Ty2: GCF_000007545.1, ATCC-9150: GCF_000011885.1, AKU-12601: GCF_000026565.1, CT-02021853: GCF_000020925.1, 287-91: GCF_000009525.1, SC-B67: GCF_000008105.1, RKS4594: GCF_000018385.1, with the corresponding database links provided in Supplementary Table 15. We generated a standardized homologous gene matrix from the supplementary table provided in their research. Only complete genes from the supplementary tables were chosen as test genes to account for differences in initial gene filtering criteria among pan-genome software tools (Supplementary Data 2).

Representative strain and outlier selection

Identifying outliers is a critical step in PGAP2 preprocessing, achieved through two distinct homologous gene-based and ANI similarity-based screening methods.

First, Homologous Gene-Based Screening: PGAP2 uses CD-HIT63 or MMseqs264 to eliminate redundancy and select representative sequences. These sequences are then aligned in an all-against-all manner to create a connectivity graph of the representative sequences. The distance between strains is represented by the strains found within their respective connected subgraphs. PGAP2 identifies all connected subgraphs that contain only a single strain and counts how many times each strain appears within these subgraphs. It calculates the third quartile (Q3) and interquartile range (IQR) of these counts, with the upper bound defined as Q3 + 1.5 × IQR. Strains that exceed this upper bound are classified as outliers.

Second, ANI Similarity-Based Screening: Average Nucleotide Identity (ANI) values between genomes are determined using FastANI v1.165 with the default settings. To ensure reliable estimates, there must be at least 150 kb of homologous genome sequences present between the two genomes being compared. Strains with ANI values below 95% compared to a representative genome are classified as outliers. Users have the option to specify the representative genome, or it can be automatically selected by PGAP2. When PGAP2 selects a representative genome, it identifies all connected subgraphs that contain more than half of the total strains and counts how often each strain occurs within them. The strain that appears most frequently is designated as the representative genome.

The combination of strains identified by these two methods forms the final set of outliers, which is displayed by PGAP2.

The core basis for the division of sub-clusters

PGAP2 processes each initial similarity network generated by MCL, merging nodes that are within a specified distance into sub-clusters. Each node in the sub-cluster represents a unique gene without paralogs, along with its associated members. Nodes within a sub-cluster must meet basic synteny and similarity requirements. Additionally, PGAP2 applies three additional core criteria to determine whether nodes within a subcluster should be merged:

  1. 1.

    Cluster Diversity: PGAP2 identifies high-quality initial core gene clusters from the initial connectivity graph. These clusters are defined as those without duplicate genes within the same strain and containing the number of strains that correspond to strict core genes. The algorithm calculates the range of similarities between nodes within these initial clusters and uses the maximum range as a threshold for merging other subclusters. Specifically, the similarity difference between any nodes in the candidate sub-clusters must not exceed this threshold.

  2. 2.

    Cluster Connectedness: This metric evaluates the number of edges connecting two candidate nodes for merging. Users can configure the strictness of this criterion using the ‘--sensitivity‘ parameter: Strict: all child nodes between two candidate nodes must be connected, forming a fully connected subgraph; Moderate: at least 50% of the child nodes between two candidate nodes must be connected; Soft: at least one pair of child nodes must be connected.

  3. 3.

    Bidirectional Best Hit (BBH): If two nodes are merged into a paralogous gene cluster, the similarity between the resulting paralogous genes must exceed the maximum similarity found among the child nodes of their respective original nodes.

PGAP2 evaluates all pairs of nodes within a subcluster iteratively until all nodes are either merged or rejected. This approach ensures that the resulting clusters are accurate and biologically relevant.

Pan-genome software selection for comparing

To assess the performance of PGAP2, we selected five popular software tools for prokaryotic pan-genome analysis: Panaroo (Version 1.5.0), PanTa (Version 1.0), PEPPAN (Version 1.0.5), PPanGGOLiN (Version 2.1.0), and Roary (Version 3.12.0). These tools were chosen based on their popularity in the field, their distinct methodological principles, with Roary and PanTa relying on conserved gene neighborhood approaches, Panaroo and PPanGGOLiN employing graph-based models, and PEPPAN applying a tree-based phylogenetic framework, as well as their publication dates, which highlight their relevance and influence in pan-genome research.

Performance evaluation using simulated data

All test datasets were generated using SimPan. The parameter --idenOrtholog was varied, decreasing from 0.99 to 0.9 in steps of 0.02 to simulate different orthologous thresholds. Meanwhile, the --idenParalog parameter was reduced from 0.9 to 0.6 in steps of 0.1 to simulate varying paralogous thresholds. All other parameters were set to their default values. The default parameters were utilized for all pan-genome software tools tested, with Panaroo operating in “sensitive” mode and PGAP2 with ‘--ins’ parameters to turn off insertion sequence judgment.

Performance evaluation using gold standard data

When evaluating software performance using the E. coli gold-standard dataset, the paralogous threshold was set to 0.7 for all software tools that incorporated this parameter. For the S. enterica gold-standard dataset, the paralogous threshold was set to 0.9. Additionally, since both gold-standard datasets offered a “fully split paralogous gene clusters” mode as the reference matrix, PGAP2 was configured with the --exhaust_ortho option to ensure that no in-paralogs were included in the output.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.