Fig. 3: The data profiling of the Escherichia coli benchmark. | Nature Communications

Fig. 3: The data profiling of the Escherichia coli benchmark.

From: PGAP2: A comprehensive toolkit for prokaryotic pan-genome analysis based on fine-grained feature networks

Fig. 3: The data profiling of the Escherichia coli benchmark.The alt text for this image may have been generated using AI.

a The construction of the benchmark includes six steps: merging clusters based on annotations (Steps 1 and 2), splitting paralogs based on collinearity (Step 3), classifying and correcting clusters based on best-matching (Steps 4 and 5), and multiple verifications and manual curation to generate the complete version (Step 6). b The average protein sequence similarity across various types of clusters, with sample sizes: ntotal = 14,360, nstrict core = 2331, nsoft core = 732, nshell = 2494, ncloud = 8803. c The average semantic similarity of domain annotations across various types of clusters. The violin plots show the kernel density estimate of the semantic similarity distribution, where the width represents the relative frequency at the type of clusters, diamond symbols represent mean. The sample size only includes the clusters with annotations: ntotal = 9872, nstrict core = 2250, nsoft core = 672, nshell = 2011, ncloud = 4939. d Average gene identity of nucleic acid sequence across various types of clusters, with the same sample sizes as in b. e The Paired Wilcoxon Signed-Rank Test with no adjustment for multiple comparisons indicates a significant difference (p < 2.22e-16) that the average genetic distance within the clusters is significantly less than the nearest genetic distance between clusters. The sample size only includes the clusters which have the nearest genetic distance cluster: n = 13,860. Inner Dist represents the average genetic distance within the cluster, Outer Dist represents the genetic distance between the cluster and the nearest cluster, “****” indicates a significant difference with p < 0.0001. In b,d, and e, the box plots depict the median (central line), 25th and 75th percentiles (box bounds), outliers (gray points) and mean (diamond symbols).

Back to article page