Fig. 5: The alignment quality of SARST2 and the efficiency of its first two filtering steps. | Nature Communications

Fig. 5: The alignment quality of SARST2 and the efficiency of its first two filtering steps.

From: SARST2 high-throughput and resource-efficient protein structure alignment against massive databases

Fig. 5: The alignment quality of SARST2 and the efficiency of its first two filtering steps.The alternative text for this image may have been generated using AI.

a Pairwise sequence identities and TM-scores for SCOP family-level homologs computed by BLAST, SARST2, Foldseek, and TM-align. From each Qry400 SCOP family, 150 homolog pairs were randomly selected for alignment (total 60,000 pairs). The horizontal axis shows sequence identity by BLAST; the left and right vertical axes indicate average identity and TM-score, respectively, from structural alignments. BLAST was configured to detect distant relationships using BLOSUM45, word size 2, and an E-value cutoff of 108. SARST2 exhibited alignment quality comparable to Foldseek. They reported substantially higher identities than BLAST, particularly in the low-homology range, and produced comparable TM-scores across all levels. TM-align yielded higher TM-scores in the low-homology range but much lower structure-alignment-based sequence identities overall, suggesting a weaker ability to distinguish homologs from proteins with incidental structural similarity. b Distribution of sequence identities computed by various methods for SCOP family-level homologs. SARST2 showed the highest peak in sequence identity distribution across all methods. Fr-TM-align showed results nearly identical to TM-align and was omitted from the plot (a) for clarity. c Performance of SARST2 using only its first two steps. In SARST1 and iSARST, efficiency was limited by BLAST, which served as their external alignment engine. SARST2 overcame this bottleneck by replacing BLAST with in-house code implementing the proposed diagonal word-matching strategy and SARST structural sequence alignment. Grouped search and ML-based acceleration were disabled in this test. The precision at each recall level was obtained from a full SCOP-2.07 search using Qry400 query proteins (n = 400). The data of panels (a) and (b) are provided as a Source data file.

Back to article page