Abstract
The flood of protein structural Big Data is coming. With the belief that biotech researchers deserve powerful analysis engines to overcome the challenge of rapidly increasing computational demands, we are devoted to developing efficient protein structural alignment search algorithms to assist researchers as they push the frontiers of biological sciences and technology. Here, we present SARST2, an algorithm that integrates primary, secondary, and tertiary structural features with evolutionary statistics to perform accurate and rapid alignments. In large-scale benchmarks, SARST2 outperforms state-of-the-art methods in accuracy, while completing AlphaFold Database searches significantly faster and with substantially less memory than BLAST and Foldseek. It employs a filter-and-refine strategy enhanced by machine learning, a diagonal shortcut for word-matching, a weighted contact number-based scoring scheme, and a variable gap penalty based on substitution entropy. SARST2, implemented in Golang as standalone programs available at https://10lab.ceb.nycu.edu.tw/sarst2 and https://github.com/NYCU-10lab/sarst, enables massive database searches using even ordinary personal computers.
Similar content being viewed by others
Introduction
This work aims to develop an efficient structural alignment search method to support rapid biotechnology research and development in this protein structural Big Data era. Over the past two decades, although structures deposited in the Protein Data Bank (PDB)1 increased rapidly, their expansion lagged behind that of sequence databases. While PDB comprised 100 thousand structures in 2014, the Universal Protein Resource (UniProt) contained 100 million sequences. Since then, the size of UniProt (represented by the UniRef90 dataset) has doubled about every two years2,3. Nevertheless, this growth gap is now closing due to recent astonishing applications of AI in protein structure prediction. In July 2022, Google DeepMind began predicting structures for all known sequences using AlphaFold24,5. To date, AlphaFold DB has released 214 million predicted structures6. Such progress also motivated PDB to incorporate predicted structures from reliable sources, expanding its database from hundreds of thousands to over one million proteins in 2022. How to efficiently identify homologous proteins via structural alignment search has become a major challenge in the face of these massive and rapidly growing structure databases.
Protein structural alignment is more time-consuming than sequence alignment. In the past, when data were limited, the cost was tolerable for most researchers. Now, facing hundreds of millions of structures, existing structural alignment search methods are increasingly inefficient. For reference, searching one query sequence against the current UniRef-90 (172 million proteins; 75 GiB disk space) with the widely-used sequence alignment search algorithm PSI-BLAST may take an hour with one typical CPU (estimated with a 3.33 GHz Intel Xeon processor based on Table 2 of ref. 3), and this cost may double every two years. Given that sequence alignment searches already consume substantial time and space, large-scale structural studies demand far more. Improving the efficiency of structural alignment may facilitate progress in many fields of protein science and technology, such as structure modeling7,8, proteomics9,10, protein–protein interactomes11,12, and drug discovery13,14.
In the 1990s, there had been notable pairwise structure alignment algorithms, such as DALI15 and CE16. Though accurate, the run time of these pioneering algorithms, several seconds for one pair of structures, was too long for large-scale research. In the early 2000s, accurate and rapid algorithms aligning several structures per second were proposed, like the FAST17 and TM-align18. They did not support database searching, which must be implemented by one-against-all pairwise alignments between the query protein and all subject proteins in the target database. In the late 2000s, alignment search algorithms based on structural linear encoding quickly evolved, such as the TOPSCAN19, YAKUSA20, 3D-BLAST21, and SARST (Structural similarity search Aided by Ramachandran Sequential Transformation)22. They could screen several thousand proteins per second. Though rapid, their accuracy was substantially lower than that of pairwise algorithms like FAST and TM-align22.
In 2009, we developed the integrated SARST web server (iSARST) with a filter-and-refine strategy22,23. Linear encoding was utilized to discard subject proteins dissimilar to the query from the hit list. Detailed alignment by a third-party refinement engine, e.g., FAST17 or TM-align18, was applied to reorder the hits according to their structural similarities to the query. The iSARST ran with high speed while preserving the accuracy of these refinement engines23. However, its efficiency is by far insufficient to handle a database of hundreds of millions of structures. Using one 3.2-GHz CPU, iSARST took 30 s to retrieve 90% of known family-level homologs for one query from a target database of 34 thousand proteins23. With this performance, searching against the AlphaFold DB would take ~52 h. Using multiple CPUs might reduce the time to 10–20 h, but it is still costly.
Recently, versatile alignment methods have been published. MICAN-SQ enabled oligomeric structural alignment24. MADOKA25 and Foldseek26 supported parallel computation. MADOKA linearly encoded structures by the DSSP program27 into secondary structure element (SSE) sequences, which were aligned through dynamic programming (DP). It could compare ~480 pairs of structures per second using 32 Intel i9 CPU processors, equivalent to ~120 h for processing the AlphaFold DB. Foldseek, by applying deep learning through a Vector Quantized Variational Autoencoder, encoded tertiary interactions between residues of a protein into a 3Di structural string26, which substantially boosted alignment search performance. Moreover, by utilizing SIMD (Single Instruction, Multiple Data) instructions of modern CPUs in DP, Foldseek achieved an impressive speed ~1370 times higher than MADOKA. It could search the AlphaFold DB in two hours or less with 32 Intel i9 CPUs (see “Results”).
Reviewing previous works enlightened us that the linear encoding filter-and-refine strategy of iSARST remains promising for developing structural alignment search algorithms for massive databases. However, iSARST had some weaknesses: (1) it was written in PHP (Hypertext Preprocessor), an interpreted programming language slower than compiled languages; (2) it relied on inefficient external calls to third-party programs, including BLAST28 for structural string alignments and FAST/TM-align for structural comparisons; (3) its speedup was limited to a 3-fold increase due to the parallelization efficiency of its alignment engine BLAST; (4) it did not provide a standalone program.
Here, we propose the SARST2 algorithm. This filter-and-refine procedure utilizes amino acid and SSE sequences, linearly-encoded structural strings, atomic geometrics, and evolutionary information to perform structural alignment searches. Accelerated by decision tree (DT) and artificial neural network (ANN) machine learning, several word-matching and structural string comparison filters rapidly discard irrelevant hits. Subsequently, the weighted contact number (WCN)29 and position-specific scoring matrix (PSSM)-derived substitution entropy are applied to align the remaining hits with the query structure in a refined manner. SARST2 has been implemented into a self-contained standalone program in Golang, an efficient parallel computing language developed by Google. Evaluated with information retrieval (IR) using structural homologs determined by the SCOP database30,31, the alignment search accuracy of SARST2 (96.3%) was higher than FAST (95.3%), TM-align (94.1%), and Foldseek (95.9%). Notably, it ran several to thousands of times faster than state-of-the-art structure alignment programs and even outpaced the sequence alignment algorithm BLAST28. Tested with an answers-seeded AlphaFold DB using 32 Intel i9 processors, SARST2 completed a 100% answer-recalled search in 3.4 min using 9.4 GiB memory. Foldseek and BLAST took 18.6 and 52.5 min, using 19.6 and 77.3 GiB memory, respectively. Moreover, storing AlphaFold DB requires 59.7 TiB disk space, but SARST2’s grouped database formatting function reduces the requirement to only 0.5 TiB, compared to 1.7 TiB required by Foldseek. These capabilities allow researchers to search hundreds of millions of structures on ordinary personal computers. We sincerely hope SARST2 will help the community overcome the challenge of computing power for structural Big Data analysis and accelerate advances in biotechnology and medical science.
Results
Overview of the proposed algorithm
As illustrated in Box 1 and Fig. 1, SARST2 employs a filter-and-refine strategy, where coarse and fast filters precede accurate and slow alignment/scoring steps. After obtaining linearly-encoded information of the query and subject structures, four filters eliminate structurally irrelevant proteins from the subject pool. The remaining candidate homologs of the query are aligned and scored by two refinement procedures: (1) a synthesized DP (dynamic programming) step that considers the amino acid type (AAT), SSE, and WCN of each residue; (2) detailed structural similarity scoring by superimposition. The synthesized DP is equipped with a variable gap penalty (VGP) scheme based on the PSSM-derived residue substitution entropy of the query protein (see “Methods”).
First, sequence (seq) and structural information of the query and subject (Sbj) protein structures are extracted and encoded into text strings, according to which the subject structures are grouped, each group having a representative head. Second, four filters gradually remove subject groups, represented by their heads, that are unlikely homologs of the query: (1) a word-matching filter to rapidly count similar AA (amino acid) and SSE (secondary structure element) fragments between the query and subject group heads; (2) a DP (dynamic programming) filter to initially align and score the query and subject group heads by their structural strings; (3) a DP filter to adjust the alignments according to the SSE and AAT (amino acid type) at each residue position; (4) a quick TM-score filter to roughly compute structural similarities. After these filters, the subject pool expands by restoring the group members as individual subject proteins. Third, two refinement procedures produce the final hit list, in which subject proteins are ordered according to their structural similarities to the query. The hit list can be output to the screen or saved as an interactive HTML (hypertext markup language) document with structural superimpositions. Protein structure images in this figure were generated using PyMOL v2.0.6. ASize alignment size, CIF crystallographic information file, PDB Protein Data Bank, RMSD root-mean-square distance, TM template modeling, WCN weighted contact number.
Accuracy evaluations
The accuracies of SARST2 and several excellent alignment algorithms were assessed by information retrieval, a technique for evaluating data searching methods22,32,33,34. Dataset Qry400 (Supplementary Data 1) provided 400 query proteins for the algorithms to retrieve family-level homologs from the SCOP-2.07 target dataset (Supplementary Data 2). IR accuracy measures, recall and precision, were calculated by checking whether the retrieved subjects in the hit list were SCOP30 family-level homologs of the query. If a method supported database searching, it was used in that mode. Otherwise, one-against-all comparisons were performed to rank the subjects according to their similarity scores toward the query.
Despite being a sequence search method, BLAST28 was also evaluated because it was utilized as the structural string alignment engine in the former SARST program (SARST1) and the iSARST server. As shown in Fig. 2, SARST2 outperformed BLAST and other methods across all recall levels. Its average precision was 96.3%, exceeding iSARST (94.4%), FAST (95.3%), Foldseek (95.9%), and other algorithms.
a Results of information retrieval with Qry400 and SCOP-2.07 as the query and target datasets. Every query protein had a known number of family-level structural homologs in the target database. Based on the output hit list, precision measured how early and compactly homologs were retrieved (see “Methods”). Recall monitored the proportion of known homologs retrieved. The vertical axis indicates the average precision over 400 query proteins (n = 400). b Raw IR data. The average precision of the eleven recall levels is a standard accuracy measure for database searching22,32. SARST2 outperformed other methods in accuracy. The data of panel (a) are provided as a Source data file. clusteredDB clustered database, standardDB standard database.
Speed evaluations
Each alignment search program we applied provides parameters to adjust how many hits to retrieve. For instance, they all allowed users to set the maximum number of proteins in the hit list. Some also provided quality control cutoffs for discarding poorly scored hits, e.g., the pC-value in SARST2 (see “Methods”) and E-value in BLAST. Setting a smaller cutoff for these parameters would enhance the speed; however, fewer homologs would be retrieved, and the program might fail to reach high recall levels. Therefore, they were carefully tuned in IR experiments. At first, we fixed the hit-list size cutoff to be the target database’s size to ensure the potential to reach 100% recall. The quality control cutoff for each query gradually increased from 100, 101, to 108 until the 100% recall was accomplished. Then, the quality control cutoff was fixed at 108, and the hit-list size cutoff rose from 50, 500, 5000, etc. For programs without such parameters, one-against-all comparisons were conducted.
All the assessed alignment search programs performed better with the quality control cutoff than the hit-list size cutoff (Supplementary Table 1), and their optimal performances were summarized in Fig. 3a. Using one CPU, SARST2 achieved a speed of >3300 times faster than FAST and TM-align in database searching. Meanwhile, its precision was higher than those of these highly accurate pairwise alignment algorithms. SARST2 also ran faster than other database searching or batch computing methods, i.e., the SARST1, iSARST, MADOKA, and Foldseek. Notably, it was even faster than BLAST. Foldseek, impressively faster than BLAST as well, exhibited a similar speed to SARST2 without parallelization. Nevertheless, SARST2 showed a higher speed enhancement when multiple CPUs were available.
a Summary of the non-parallelized running time of the algorithms evaluated by IR using the Qry400 and SCOP-2.07 datasets. Most database search algorithms, like the SARSTs and BLAST, provide quality control and hit-list cutoff parameters for users to decide the amount of retrieved homologs. Such algorithms were evaluated by adjusting one cutoff at a time. For each query, the running time of an algorithm was measured under the minimal cutoff setting that allowed the algorithm to recall all family-level homologs. The right part lists the alignment speeds for algorithms that support one-step pairwise alignment. Although SARST2 was primarily designed for database search, its pairwise alignment was also efficient, e.g., around 12 times faster than the accurate structural alignment method FAST and 2 times faster than the pairwise BLAST sequence alignment program bl2seq. b Speed of database search algorithms that support parallel computation. As the number of CPUs increased, their speeds improved. Each point represents the average of 10 replicate runs using 400 query proteins (n = 400). SARST2 with 32 i9 CPUs reached an average speed of 966,000 alignments per second. A full search of the 145,000-protein SCOP-2.07 took only 0.15 s. The same task took Foldseek and BLAST 1.07 and 3.62 s, respectively. c Relative speed plot showing the degree of enhancement by parallel computing. SARST2 achieved the highest speedup, demonstrating its high scalability. The data of panels (b) and (c) are provided as a Source data file. CPU Central Processing Unit, s second.
To test the efficiency of search algorithms that supported parallelization, we gradually increased the number of processors. Foldseek and BLAST ran fast with 1–2 CPUs, but their speed met a plateau when the number was ≥4. The speed increment of SARST2 was remarkable (Fig. 3b, c). Using our 32-CPU machine, its speed could increase 10.6-fold, whereas Foldseek, MADOKA, and BLAST were accelerated by 2.1–3.0, 10.5, and 3.1-fold, respectively. Searching one query structure against the SCOP-2.07 to reach 100% recall with 32 i9 CPUs, SARST2 took 0.15 s on average, approximately 7 and 24 times faster than Foldseek and BLAST.
Most assessed programs supported pairwise alignment without extra database formatting commands. For each of them, we made all-against-all pairwise alignments between the Qry400 and SCOP-2.07 datasets and computed the average alignment time. As shown in Fig. 3a, SARST2 had a pairwise speed >11 times higher than accurate pairwise alignment methods like FAST and TM-align. Compared with other database search methods, SARST2 was 408 and 325 times faster than Foldseek and BLAST. We speculated that these search methods might still perform a database-formatting procedure for a single subject protein before it could be aligned with the query. Before the NCBI BLAST+ version, the old BLAST package provided a pure pairwise alignment program called bl2seq. Without database formatting, bl2seq ran much faster than the modern BLAST program (blastp). Nevertheless, SARST2 was twice as fast as bl2seq. In summary, in both database searching and pairwise alignment, SARST2 outperformed state-of-the-art structural alignment methods and was even faster than the popular BLAST sequence alignment algorithm.
Accuracy-enhancing factors
Combining primary, secondary, and tertiary structural information for alignment was a key contributor to SARST2’s accuracy. SARST2 linearly encoded a protein structure in five forms: (1) the amino acid (AA) sequence, (2) the AAT sequence using a 5-symbol amino acid type alphabet we developed35,36, (3) the 4-symbol SSE sequence from the TM-align alphabet18, (4) the SARST sequence based on Ramachandran codes22, and (5) the WCN values we proposed to measure residue packing density29. These elements were synthesized through a weighted scoring scheme to describe local similarity Sij between residue i from the query and residue j from the subject protein,
where wx denotes the weight for element x, and SMx is the scoring matrix for alphabet x. The weights were empirically tuned. For SMAA, we used the BLOSUMN provided by EMBOSS37. This synthesized similarity score was then used to perform DP alignment. The Needleman–Wunsch DP was used in this study, though the SARST2 program also supports Smith–Waterman.
Among the synthesized elements, WCN exerted a substantial effect on accuracy. As Table 1 shows, disabling the whole synthesized DP reduced precision by 1.9%, while disabling only WCN resulted in a 1.3% drop. Scoring matrices also enhanced accuracy. Many algorithms, like TM-align and MADOKA, aligned structures based on SSE with simple binary scores (e.g., 1 for match, 0 for mismatch). In contrast, SARST2 applied custom scoring matrices for SARST, SSE, and AAT alphabets derived from the BLOSUM algorithm38 (see “Methods”). Replacing SMSSE and SMAAT with binary scores caused precision to drop by 2.9% and 1.1%, respectively.
Variable gap penalty was supposed to improve alignment quality between distantly related homologs by penalizing gaps in SSE regions39. Here, we introduced an additional feature into VGP—suppressing gaps at conserved positions. After the last filtering DP alignment (Fig. 1), SARST2 generated a PSSM for the query protein. The Shannon entropy (H) was then applied to encode the query’s sequence conservation profile (C), as shown below,
where i denotes a residue position, a is one of the 20 AAs, and pi(a) stands for the frequency of a at position i in the PSSM. High entropy indicates poor conservation due to diverse substitutions. Entropies were normalized by Z-score, then transformed to conservation using β = 0.5 as an empirically selected base. The value of C(i), always ≥0, modulated the gap penalty for residue i in the final synthesized DP alignment. As Table 1 shows, disabling this VGP feature reduced precision by 0.6%.
Speed-enhancing factors
The high speed of SARST2 was mainly founded upon the filter-and-refine strategy. Specifically, the earliest word-matching step eliminated most non-homologs to reduce computation time. In SARST2, word-matching was done by a typical query-subject 2D matching table40, not on amino acid sequences but on AAT and SSE sequences. This innovation enabled fast identification of matching words near the diagonal of the table (see “Methods”), a pattern that often signals homologous alignment. Without this diagonal shortcut, SARST2’s running time would increase threefold (see Table 1, with each test repeated five times). Another speed enhancer was the grouped search. During filtering, only group heads were evaluated. Since the number of heads decreased after filtering, and each remaining head propagated its word-matching and DP results to group members, many redundant calculations were avoided. Disabling grouped search doubled the time cost. Machine learning (ML) also accelerated SARST2. A DT filter was used to quickly eliminate irrelevant subjects, while an ANN filter estimated the likelihood of homology and removed low-scoring candidates. Without ML, the running time increased by 5.5-fold.
Efficiency of data storage
The SARST2 program can be executed with a pre-formatted target database, where structural data are stored economically, especially for residue coordinates. First, only Cα atoms are stored, significantly reducing space. Second, we found that the coordinate range in a PDB/CIF file (–999.999 to 9999.999) could be encoded using 24 bits, which saves more space than a conventional 32/64-bit float representation. Furthermore, SARST2 provides a parameter to round coordinates to one decimal place. These strategies allow users with limited storage resources to manage huge databases. For example, the raw AlphaFoldDB-2022 consumed 59.7 TiB disk space (counting only the CIF-format files), whereas a SARST2-formatted version required only 0.5 TiB, saving 99.2% of the space.
Performance tests with the AlphaFold Protein Structure Database
The SARST2 program is user-friendly and flexible. In addition to the query structure, users can specify (1) a pre-formatted target database, (2) a batch of subject files, or (3) folders containing subject files. Formatting a database reduces future search time and disk space. We evaluated SARST2’s large-scale searching and database formatting efficiency using AlphaFold DB through random sampling. Running time, memory usage, and disk consumption were tracked as the subsets increased in size. For comparison, we also tested Foldseek (structure-based) and BLAST (sequence-based) alignment search algorithms.
We applied the Qry400 dataset against target databases constructed by combining SCOP-2.07 family-level homologs with randomly sampled AlphaFoldDB-2022 subsets of increasing size, the largest containing 215 million structures. Every query had known SCOP family-level homologs, allowing us to seed answers into a massive dataset for IR evaluation. As shown in Fig. 4, time and memory usage grew with dataset size. Using 32 i9 CPUs, SARST2 searched the full AlphaFoldDB in 3.4 min, about 5.4× and 15.3× faster than Foldseek and BLAST (× denotes multiplication, i.e., times). Impressively, SARST2 used only 9.4 GiB of memory, while BLAST needed 77.3 GiB. Foldseek, when run with clustered database mode, was also efficient (19.6 GiB), but required 3.0× more memory without enabling that clustering (see Supplementary Table 2).
a Search time increased with dataset size. Using 32 CPUs, SARST2 completed a full-database query (215 million proteins) in 3.4 min, several folds faster than Foldseek and BLAST. Each point represents the average of 3 replicate runs using 400 query proteins (n = 400). b Memory usage of SARST2 and Foldseek (clusteredDB mode) was substantially lower than that of BLAST. c For database formatting, BLAST was 8.2× and 32.7× faster than SARST2 and Foldseek, respectively. Each point represents the average of 3 replicate runs using 400 query proteins (n = 400). Foldseek’s clusteredDB setup required longer preprocessing. See Supplementary Table 2 for the performance of Foldseek with its standard-mode database searching and formatting. d SARST2 and BLAST used memory efficiently during formatting. SARST2 used 1.6× more memory than BLAST, while Foldseek consumed 45.8× more. e SARST2 and Foldseek required 0.5 TiB to store the formatted AlphaFoldDB. However, Foldseek generated many intermediate files, increasing its total disk usage to 3.4× that of SARST2. f AlphaFoldDB subsets were sampled from 1/1024 to 1/1 of the full dataset, with each successive subset being approximately four times larger than the previous one. These results help to estimate how many resources the algorithms will require as they face increasingly large amounts of data in the future. Source data of panels (a)–(e) are provided as a Source data file. DB database, GiB gibibytes, min minute.
SARST2 and Foldseek required more time, memory, and disk space than BLAST to format the AlphaFoldDB subsets, which is just as expected since BLAST only handles sequence data. Nonetheless, for the full AlphaFoldDB, SARST2’s memory usage (3.9 GiB) was close to BLAST (2.5 GiB) and far lower than Foldseek (113.2 GiB). SARST2’s formatting time was ~10× longer than BLAST’s for datasets >3.3 million proteins, but still less than half of Foldseek’s. The final file sizes of SARST2 and Foldseek databases were similar (~6× larger than BLAST’s), yet Foldseek’s use of temporary files increased peak disk space demand to 3.4× that of SARST2. SARST2 and BLAST did not generate temporary files during formatting.
Discussion
Using WCN29 to measure local structural similarities was an innovation in SARST2. Most structure alignment algorithms instead relied on geometric comparisons. For instance, TM-align performed superimposition after SSE-based DP and quantified local similarities by computing the distance between each residue i from the query and m from the subject protein18. FAST used a pair-graph DP to identify residue pairs with similar distances and evaluated the similarity between 5-residue fragments centered at residues i (i – 2 to i + 2) and m (m – 2 to m + 2), using several distance- and angle-based features from Cα and Cβ atoms17. In contrast, WCN relies solely on distances between Cα atoms. It estimates the packing density of i by summing over neighboring residues j, weighted by the reciprocal square of the i–j Cα distances (see “Methods”). Without superimposition or angle calculation, WCN is computationally efficient. To explain why it improves database searching accuracy, we suppose the key may be that it encodes the long-range interaction between residues. The hypothesis motivating our use of WCN was that similar structures might have similar packing patterns along the polypeptide. To explore it, we tested two other packing density measures, the contact number (CN)41 and closeness42. WCN and closeness outperformed CN in search accuracy (see Supplementary Table 3). Unlike CN, they do not apply a fixed distance cutoff to define neighbors, allowing contributions from distant residues. Long-range interactions are critical in protein structure prediction43,44; they may also be relevant to structure alignment. As most methods assess local similarities in strictly local terms, SARST2 may shed light on the protein structural alignment methodology.
The high accuracy of SARST2 (Figs. 2, 3) may be attributed to the quality of its alignments. Structural alignment methods are generally more effective than sequence-based approaches in detecting remote homology. As shown in Fig. 5a, for SCOP family-level homologs with low pairwise sequence identities (e.g., 1–20%), SARST2, Foldseek, and TM-align produced alignments with substantially higher sequence identities than BLAST. SARST2 and Foldseek exhibited comparable alignment quality, yielding similarly high identities in these low-homology regions. Figure 5b presents the distribution of pairwise sequence identities across methods. BLAST peaked at ~10% identity, suggesting limited sensitivity to distant relationships. TM-align, with an average IR precision of 94.1%, peaked at ~17%. SARST2 achieved the highest precision (96.3%) and peaked at ~22%, slightly above Foldseek (95.9%; ~21%).
a Pairwise sequence identities and TM-scores for SCOP family-level homologs computed by BLAST, SARST2, Foldseek, and TM-align. From each Qry400 SCOP family, 150 homolog pairs were randomly selected for alignment (total 60,000 pairs). The horizontal axis shows sequence identity by BLAST; the left and right vertical axes indicate average identity and TM-score, respectively, from structural alignments. BLAST was configured to detect distant relationships using BLOSUM45, word size 2, and an E-value cutoff of 108. SARST2 exhibited alignment quality comparable to Foldseek. They reported substantially higher identities than BLAST, particularly in the low-homology range, and produced comparable TM-scores across all levels. TM-align yielded higher TM-scores in the low-homology range but much lower structure-alignment-based sequence identities overall, suggesting a weaker ability to distinguish homologs from proteins with incidental structural similarity. b Distribution of sequence identities computed by various methods for SCOP family-level homologs. SARST2 showed the highest peak in sequence identity distribution across all methods. Fr-TM-align showed results nearly identical to TM-align and was omitted from the plot (a) for clarity. c Performance of SARST2 using only its first two steps. In SARST1 and iSARST, efficiency was limited by BLAST, which served as their external alignment engine. SARST2 overcame this bottleneck by replacing BLAST with in-house code implementing the proposed diagonal word-matching strategy and SARST structural sequence alignment. Grouped search and ML-based acceleration were disabled in this test. The precision at each recall level was obtained from a full SCOP-2.07 search using Qry400 query proteins (n = 400). The data of panels (a) and (b) are provided as a Source data file.
To further evaluate alignment quality, we calculated TM-scores based on the alignments from each method. As shown in Fig. 5a, BLAST produced the lowest TM-scores for homologs with <20% identity. SARST2 and Foldseek showed comparable TM-scores across the full identity range. TM-align, however, yielded higher TM-scores in the low-identity region. This observation may reflect algorithmic differences. Unlike TM-align, which aligns proteins solely based on structural information18, SARST2 and Foldseek incorporate amino acid sequence features into their algorithms26; this design may reduce sensitivity to marginal structural similarities while enhancing selectivity for alignments more consistent with evolutionary relatedness. Importantly, structural similarity alone does not necessarily indicate homology. Marginal or local resemblance can result from convergent evolution or domain-level motifs shared by unrelated proteins45,46. Given that SARST2 yielded higher sequence identities in low-homology regions and demonstrated superior IR precision compared to TM-align in complex database searches (Figs. 2, 3), its alignments may better reflect evolutionarily meaningful relationships. This advantage is likely supported by its use of WCN and PSSM, two features linked to improved recognition of evolutionary relatedness28,29,47,48,49.
SARST1 and iSARST ran BLAST as their structural string alignment engine and were benchmarked against it22,23. Although SARST2 no longer calls BLAST, we included it in comparisons to show improvements. SARST1 and iSARST were slower than BLAST; SARST2 now outperforms it in both speed and precision. These gains largely stem from the first two filtering steps: word-matching and structural sequence alignment. Unlike BLAST, which uses amino acid sequences in both steps, SARST2 applies SSE and AAT alphabets for word-matching and aligns SARST sequences. This structure-based design enables more effective removal of non-homologous subjects. SARST2 further improves efficiency by using longer words derived from reduced SSE and AAT alphabets. This concentrates true matches near the diagonal, leading to the design of a shortcut matching algorithm that substantially limits the search space (see “Methods”). In addition to these algorithmic benefits, the Golang implementation also improves SARST2’s efficiency. As shown in Fig. 5c, even with only the first two filters enabled, SARST2 surpasses BLAST in precision and speed. Further gains from downstream refinement are reported in Supplementary Table 4. Golang is a cross-platform programming language. To facilitate broad accessibility, we compiled the current version of SARST2 (v2.0.30) into executables for Linux, macOS, and Windows, which are provided as Supplementary Software 1, 2, and 3, respectively.
Although the grouped search function doubled the speed of SARST2 (Table 1), it is still relatively rudimentary compared with the database clustering approach in Foldseek. In SARST2, subject proteins were grouped just when they shared the same length and SSE sequence (“Methods”), whereas Foldseek clustered proteins based on alignments. The average group size in SARST2 was only four proteins, limiting its impact. In future work, we plan to use SARST2 itself to align and cluster subject proteins. Increasing the group size may enable greater speed improvements.
We are now witnessing the singularity of a protein structure Big Bang. During the peer-review of this article, Meta Platforms, Inc. released the ESM Metagenomic Atlas, a database of 772 million predicted structures, ~3.6 times larger than AlphaFold DB. In this rapidly expanding structural landscape, efficient and scalable structural alignment methods become increasingly critical. SARST2 offers a high-throughput solution that may support broad applications in structural biology and beyond. While BLAST is widely used in biological and medical research, SARST2 may offer complementary capabilities in structural analyses. Although the current implementation already exceeds BLAST in accuracy and speed, SARST2 remains under active development, with ongoing efforts to improve machine learning and refine algorithmic parameters. We look forward to seeing SARST2 contribute meaningfully to protein science and biotechnology by helping researchers navigate the challenges of the structural Big Data era. The program and several pre-formatted databases are available at https://10lab.ceb.nycu.edu.tw/sarst2 and https://github.com/NYCU-10lab/sarst (https://doi.org/10.5281/zenodo.17144647).
Methods
Hardware and software environments
The proposed algorithm was developed and evaluated on a personal computer running 64-bit CentOS 7 Linux, equipped with 16 Intel i9 5.2 GHz Hyper-Threading CPU cores (i.e., 32 logical processors), 125 GiB DDR5 RAM, a 2 TiB solid-state disk (SSD), and access to a 166 TiB Network Attached Storage (NAS) configured with RAID6. A 32 GiB swap partition was created on the SSD. The SARST2 programs were written in Golang (v1.22) using the Lo Library. C/C++ (gcc v4.8.5 with the glibc Library), Fortran (gcc v4.8.5 with the libgfortran Library), and PHP (v5.4.16 with the Zend Libraries) were used to compile or execute the software listed below when necessary. The amino acid sequence alignment search program blastp from the NCBI BLAST+ package (v2.11.0+)28 was utilized to represent the BLAST algorithm. In some pairwise experiments, the bl2seq of the classic BLAST package (v2.2.13) was used. Structure alignment programs FAST (v.2005)17, TM-align (release date: 2016 May 21)18, Fr-TM-align (v1.0)50, and MICAN-SQ (release date: 2019 Nov 27)24, and structure alignment search algorithms SARST22 (v1.0-PHP), iSARST23 (v1.5-PHP), MADOKA25 (v1.1.0), and Foldseek (release date: 2024 Jun 27) were applied in this work.
Sources of experimental datasets
Protein structure files were downloaded from the AlphaFold Protein Structure Database6 (AlphaFold DB; https://doi.org/10.1093/nar/gkab1061; accessed on 2022 Dec 31), the Structural Classification of Proteins30,31 (SCOP; v2), and the Protein Data Bank1 (PDB; accessed on 2022 Dec 31). AlphaFold DB contains protein structures predicted by AlphaFold24,5. The dataset we compiled, named AlphaFoldDB-2022, included 214,459,158 structures (available at https://10lab.ceb.nycu.edu.tw/sarst2). No structural predictions were performed in this study. SCOP is a gold standard for protein research depending on well-defined homologous protein clusters. Based on automated matching and manual curation, it classifies protein domains hierarchically into classes, folds, superfamilies, and families according to structure, sequence, functional similarities, and evolutionary relationships30,51. A lower level of SCOP classification stands for a higher level of homology. For instance, proteins in the same fold possess similar structural architecture and topology; proteins in the same family are closely related homologs with the same functions and clear evidence for their common evolutionary origin30,51.
SCOP homologous data were obtained from the SCOPe server (v2.07)31. After eliminating proteins with homology classifications only by automated matching but not yet by manual verifications, 144,879 domain entities were gained from 4022 families, which belonged to 1229 folds. This dataset is referred to as SCOP-2.07 (Supplementary Data 2). To prepare the materials for machine learning and evaluate the alignment algorithms by IR, we established two query datasets from SCOP-2.07 following these steps: (1) Consider only the families belonging to the major classes—all-alpha, all-beta, alpha/beta, and alpha+beta. (2) Calculate the number of proteins possessed by each family. (3) For each class, randomly select 150 families possessing ≥20 proteins. (4) From each selected family, randomly choose one protein as the representative query. (5) Split the 600 (150×4) representative query proteins into two datasets, 200 for training and 400 for testing. They are named Qry200 and Qry400 in this report (see Supplementary Data 1). The CATH protein structure classification database52 was also utilized for independent tests, as described in the subsection “Utilization of machine learning”.
PDB is the primary database for macromolecular structures determined by biophysical techniques such as X-ray crystallography, NMR spectroscopy, and Cryo-electron microscopy. The PDB dataset we prepared, PDB-2022, contained 713,183 experimentally determined structures (i.e., no predicted structures). It is pre-formatted in the SARST2 database format and available at https://10lab.ceb.nycu.edu.tw/sarst2.
All source databases listed above provided structural files in either PDB or mmCIF format. The implemented SARST2 programs supported both types of structural files. From either file type, the SARST2 programs extracted residue and atomic information only based on the ATOM descriptions. Sequence information file sections like SEQRES or pdbx_seq_one_letter_code and secondary structural information sections like HELIX or HELX_P were ignored.
Reduction of amino acid codes
Amino acid (AA) code reduction has been shown to enhance the performance of protein circular permutation site prediction and secondary structure prediction36,53. In this study, we encoded amino acid sequences using the standard 22-symbol AA alphabet (20 standard codes plus B for Asn/Asp and Z for Gln/Glu), as well as the reduced 3- to 6-symbol alphabets applied in our prior research35,36,53. The 3-symbol alphabet categorized AAs as hydrophobic (A, V, C, I, L, M, F, W), hydrophilic (N, D, E, K, Q, R), and neutral (G, S, P, T, H, Y)54. The 4-symbol alphabet grouped them as nonpolar (G, A, P, V, I, L, M, F, W), polar (S, T, C, N, Q, Y), acidic (D, E), and basic (K, H, R)55. The 5-symbol alphabet divided AAs into aromatic (F, Y, W), nonpolar aliphatic (G, A, V, I, L, M), polar uncharged (S, P, T, C, N, Q), positively charged (K, H, R), and negatively charged (D, E). The 6-symbol alphabet classified residues as aliphatic (G, A, V, I, L), hydroxyl/sulfur-containing (S, T, C, M), cyclic (P), aromatic (F, Y, W), basic (K, H, R), and acidic (N, D, E, Q)56. Among these, the 5-symbol alphabet showed the best performance in improving the efficiency of structural alignment search.
Determination of protein secondary structures
Protein backbone local conformations are commonly described using secondary structure types, such as helices and strands. The widely used DSSP (Definition of Secondary Structure of Proteins)27 algorithm classified protein secondary structures into α-helix (H), 310-helix (G), π-helix (I), extended strand (E), isolated β-bridge (B), hydrogen-bonded bend (S), turn (T), and loop/coil (C). In many studies, this 8-symbol alphabet was reduced into a 3-symbol version, grouping H, G, and I as H (helices), E and B as E (extended strands), and all others as C. DSSP assigns SSEs based on hydrogen bonding patterns and geometric features such as interatomic distances and angles between specific atomic configurations. TM-align proposed a simpler method for assigning 4-symbol SSEs (helices, strands, turns, and coils) based solely on Cα distances across several consecutive residues18. We tested the 8-/3-symbol DSSP alphabets and the 4-symbol TM-align scheme. The latter yielded the highest database searching efficiency.
Linear encoding of protein tertiary structures
The SARST linearly encodes protein structures using a 23-symbol Ramachandran code alphabet22. This alphabet was defined by partitioning the Ramachandran plot57 of residue backbone torsion angles (ϕ, ψ) into 22 regions. For each residue of interest i, ϕ is defined as the dihedral angle between the planes Ci-1-Ni-Cαi and Ni-Cαi-Ci, and ψ is that between Ni-Cαi-Ci and Cαi-Ci-Ni+1. N and C denote the backbone amino and carbon atoms, respectively. When a residue lacked the necessary atoms to compute either angle, it was encoded as X, the 23rd SARST code.
Grouping of subject structures in the SARST2-format target database
The grouping principle in SARST2 was that subject structures with the same SSE sequence and length were assigned to the same group, with the first member designated as the group head by the formatdb program. Under this strict criterion, the average group size was small. We supposed that enlarging group size could further accelerate database searching. To this end, we reduced the 4-symbol TM-align SSE alphabet to only two symbols: H and E were merged, as were T and C. Subject proteins sharing the same AAT sequence as a group head were also included in that group. These modifications increased the average group size by ~60% for the SCOP-2.07 dataset.
Scoring matrices
The scoring matrix for SARST sequence alignment, the SARSTSM22, was produced based on the BLOSUM algorithm38, which was also applied here in an iterative manner to generate the scoring matrices for SSE and AAT sequences, as described below.
-
1.
Randomly select a million pairs of family-level homologous proteins from the SCOP database. None of the selected pairs belonged to the SCOP families of the Qry400 proteins.
-
2.
Align the selected homologs by the EMBOSS needle37, a Needleman–Wunsch global alignment program, with BLOSUM45 as the scoring matrix.
-
3.
Transform the aligned amino acid sequences to the required sequence type, say, the SSE sequence. Now, we have well-aligned SSE sequences of the homologs.
-
4.
The scoring matrix is generated using Henikoffs’ BLOSUM formula38:
$${{{\rm{Score}}}}_{{ij}}={f}_{s}\times {\log }_{2}({b}_{{ij}}/{e}_{{ij}})$$(4)where i and j represent an aligned pair of SSEs, bij and eij are the observed and expected probabilities, respectively, of the occurrence of pair ij, and fs is a scaling factor. Following Henikoffs’ empirical setting, fs was appointed as 2.
-
5.
Evaluate the scoring matrix by performing an IR experiment with Qry200 and SCOP-2.07. If the average precision and speed registered by this matrix show an insignificant difference from those by the previous matrix, stop the iteration.
-
6.
Repeat the above procedure from Step 2. Starting from this new iteration, use the SARST2 program and the latest scoring matrix to align the homologs.
Word-matching strategy
SARST2 utilized code-reduced alphabets rather than large alphabets in word matching. Specifically, it used the 5-symbol AAT35,36 and 4-symbol TM-align SSE18 alphabets instead of the 20-symbol AA and 8-symbol DSSP SSE27 alphabets. The core approach followed the 2D word table design used in the FASTA algorithm40,58. Both query and subject protein sequences were broken into words of w residues. As illustrated in Supplementary Fig. 1, a smaller alphabet allows a larger w for matching, which may improve homology specificity. The math behind this is exemplified below,
-
1.
Take the 5-symbol AAT and 20-symbol AA alphabets for demonstration. For simplicity, let the occurring frequency of each symbol be even, meaning the occurrence of each AAT code was 20% (i.e., 100% ÷ 5), and that of each AA code was 5% (i.e., 100% ÷ 20).
-
2.
Let nsbj denote the number of residues of the subject protein. Assume nsbj = 401.
-
3.
The theoretical occurrence of an AAT and AA word of w residues is (20%)w and (5%)w, respectively.
-
4.
Let w = 2, which is used both in the FASTA58 and BLAST28 algorithms. Given a query AAT word, the expected number of its matches in the subject sequence would be (20%)w × (nsbj – w + 1), which equals 16.0, while that for a given query AA word would be (5%)w × (nsbj – w + 1), which equals 1.0.
-
5.
Let w be bigger, say 4. The expected number of matches for the AAT word would be ~0.6, while that for the AA word would be almost zero (2.5 × 10–3).
Using a larger word size improved speed, as fewer words would be processed. Moreover, we developed an algorithm to quickly identify the matches lying close to or in parallel with the diagonal of the matching table. This shortcut algorithm performed better with a big word size than with small ones, as illustrated in Supplementary Fig. 2 and described below,
-
1.
The original search space of the subject sequence is [1, nsbj].
-
2.
Take the starting word of the query sequence and find the position of its first occurrence in the subject sequence’s search space (denoted by psbj).
-
3.
Cut the subject sequence’s search space to [psbj + k, nsbj], where k is an empirical factor for cutting more of the search space and was set as 1 in this work.
-
4.
Repeat steps 2 to 3 with the next word of the query sequence.
In addition to shrinking the search space, this algorithm significantly accelerated the matching because it tried to make only one match on the subject sequence for each word from the query. Nevertheless, the specificity of the match was hence important. When the word size was small, there might be nonspecific short hits to distract the matching stream from the diagonal. We found that a word of 5–7 residues worked well with the 5-symbol AAT and 4-symbol SSE alphabets for most query proteins.
Most alignment methods use match extension to refine scores28,40,58; however, such extension was time-consuming. To save time, SARST2 combined complete word matching (Supplementary Fig. 1) and the proposed diagonal shortcut matching (Supplementary Fig. 2), summing the resulting match scores. A subject with a higher sum should more likely be homologous to the query.
Application of the weighted contact number
We have developed the WCN residue measure to describe the packing density of a protein structure29. In this work, we modified the original formula to include a weighting factor wij, yielding a more generalized form, as shown below,
where i denotes the residue of interest in a protein of n residues, and j represents any other residue in the same protein. The wij is added to weight each residue pair i and j. WCN was typically normalized by the Z-score equation29,35,36, which was also applied in this study.
A residue with many close neighboring residues would have a high WCN. We discovered that high-WCN residues are generally conserved in evolution29, and a circular permutation occurring at such residues would greatly affect the surrounding residues and, therefore, be deleterious to the stability of the permutant35,36. In the present study, we tested (1) wij = 1 as usual, (2) wij = mi, (3) wij = mj, and (4) wij = mi × mj, where m stands for the molecular weight of the amino acid of a residue. Among them, wij = mj outperformed the others and was applied in the SARST2 algorithm.
Variable gap penalty
Two gap penalties were applied in every DP alignment step of SARST2, one for gap opening (Go) and the other for extension (Ge). As stated in Results, besides the conventional concept that gaps should be suppressed in regular SSE segments, we proposed that VGP could be implemented by suppressing gaps in conserved residue positions. The PSSM algorithm of PSI-BLAST28 (applied with only one round in this work) and the Shannon information entropy were used to encode the sequence conservation profile (denoted by C; see Eqs. (2) and (3)) of the query protein. The SSE segmentation and conservation information of the residue of interest (i) of the query was integrated into a variable gap penalizing scheme shown below,
where o and e denote opening and extension, respectively, \(g\) is the basic penalty, f is a scaling factor for the conservation score, and W(ssei, l) is the weighting function for the SSE segment with l residues to which i belongs.
In this work, fo and fe were set to the same value, and so were the Wo and We. They can be individually optimized in the future. The \({g}_{{{\rm{o}}}}\) and \({g}_{{{\rm{e}}}}\) for each step are listed in Supplementary Table 5. The Wo and We functions were made for helices, strands, and turns; for coils, their values were set to 1. See also Supplementary Table 5 for details.
Structural similarity score between proteins
This work determines the structural similarity between proteins using the TM-score18. After structural superimposition based on a DP alignment, the TM-score between proteins was computed as follows:
where n denotes the number of residues, qry refers to the query protein, nali means the number of aligned residue pairs between the query and subject proteins, and da is the distance between the ath aligned pair of residues. The d0() is an empirical normalization function that makes the TM-score independent of protein size.
TM-score was initially proposed in TM-align18 and refined in Fr-TM-align50. Since its calculation depends on the alignment between structures, different alignment methods may yield different TM-scores for the same proteins. Based on the pairwise alignment data from Fig. 5b, we analyzed the distribution of TM-scores computed by SARST2, TM-align, and Fr-TM-align. Because the TM-align version we used had already incorporated the TM-score algorithm of Fr-TM-align, the results of TM-align and Fr-TM-align were nearly identical. Supplementary Fig. 3 illustrates that the TM-scores computed by SARST2 were slightly lower than those computed by TM-align, but their distributions were similar. The Pearson correlation coefficient between the TM-scores calculated by the two methods was 0.970. Moreover, comparing the TM-scores SARST2 computed with SCOP family- and superfamily-level homologs, we found that a TM-score ≥ 0.7 may indicate family-level homology between proteins.
Computation of the pC-value, a score quality control factor
In this work, we defined a confidence score (Conf) to describe how likely a hit subject is a structural homolog of the query protein. In addition, the negative logarithm with base two of Conf, i.e., the pC-value, was applied as a quality control factor for eliminating hit subjects with low structural similarities to the query, as shown below.
where S denotes the structural string similarity score computed by the synthesized DP between the subject and query proteins, and S′ is S being normalized. The theoretical minimum of S, Smin, is calculated according to the scoring matrices applied in the synthesized DP such that every query residue is aligned with a residue that produces the lowest score (see Eq. (1)). The theoretical maximum of S, Smax, is calculated by aligning the query structural string to itself. TM-score, defined above, quantifies structural similarity based on structural superimposition. The p-score is a probability score obtained by ANN, determining whether a subject is a family-level structural homolog of the query protein (see the next subsection).
The typical range of either S’, TM-score, or p-score was [0, 1]; thus, the Conf score also fell into this range. In some rare cases where Conf was lower than 0 or larger than 1, it was set to 0 or 1, respectively. Both Conf and pC-value could be used to remove poorly scored subjects. We actually implemented a Conf score threshold and a pC-value cutoff in the SARST2 standalone program. In contrast with the Conf score, the pC-value describes how unlikely a subject is a structural homolog of the query protein, where a large pC-value implies a poor structural similarity between the subject and the query. The purpose of the pC-value was to be a quality control parameter analogous to BLAST’s E-value28, for which a high value implies a poor sequence alignment quality between proteins. However, it is noteworthy that the E-value has a solid statistical meaning—the expected number of hits with similar scores (alignment qualities) that could be found in the target database only by chance, whereas the pC-value is simply the negative binary logarithm of the confidence score. A pC-value of x means the confidence score of a hit is 1/2x.
Utilization of machine learning
The integrated machine learning system35 we developed was utilized here. The applied ML algorithms included a decision tree and an artificial neural network with backpropagation, which had been used to predict circular permutations35,36 and secondary structures53,59. Applying ML in SARST2 accelerates filtering by classifying whether a hit subject is a family-level homolog of the query protein, i.e., a binary classification. The features applied in SARST2 fell into five categories: (1) basic properties of the subject, e.g., protein size; (2) relative properties of the subject, e.g., subject-to-query size ratio; (3) similarity scores from word matching; (4) similarity scores from DP alignments; (5) structural similarity scores, e.g., TM-score and root-mean-square distance. As illustrated in Fig. 1, four ML filtering gates were established for these steps: (1) word matching; (2) the initial DP for SARST sequences; (3) the DP for SSE plus AAT sequences; (4) the final TM-scoring. The number of applied features gradually increased across these stages. See Supplementary Table 6 for details.
Every ML model was trained with Qry200 and tested with Qry400. None of the proteins from Qry200 nor their family-level homologs overlapped with Qry400. Each Qry200 query was compared with all structures of the SCOP-2.07 dataset. Family-level homologs of the query were classified as positive, while the others were negative cases. The number of positive cases obtained with Qry200 was 24,462, and that of negative cases was 28,951,338. To handle this class imbalance, we applied random sampling to construct multiple balanced subsets and trained an ensemble of small ML models for the binary classification, as follows,
-
1.
Make a training subset by randomly choosing 20,000 cases from the positive and 20,000 from the negative sets.
-
2.
Compute the feature values for each case.
-
3.
Establish a small DT model for this training subset.
-
(1)
The classification branch of the CART algorithm (Classification And Regression Tree) is applied with Gini’s diversity index60.
-
(2)
The Gini index is an impurity measure for tree building. If ppos and pneg represent the probability of a subject being classified into the positive and negative sets, the index is 1 – ppos2 × pneg2.
-
(1)
-
4.
Establish a small ANN model for this training subset.
-
(1)
The ANN model comprises an input layer with ni nodes, two hidden layers, and an output layer with one node, where ni equals the number of features.
-
(2)
The first and second hidden layers contain ni × 1.5 and \(\sqrt{{n}_{{{\rm{i}}}}\times 1.5}\) nodes.
-
(3)
In the training process, the output answer for a positive case is 1, and that for a negative case is 0.
-
(1)
-
5.
Repeat Steps 1–4 for m times, establishing m small DT and m small ANN models.
-
6.
In SARST2 database searching, for each hit, use the small DT models as a filter to determine whether it can remain in the subject pool.
-
(1)
Calculate the feature values of the hit and input them into the small models.
-
(2)
If any model classifies the hit as a positive case, it remains in the pool and goes to Step 7; otherwise, it is discarded.
-
(1)
-
7.
For each hit remaining in the subject pool, compute its probability score of being a family-level homolog of the query, i.e., the p-score in Eq. (10).
-
(1)
Input the feature values of the hit into the m small ANN models.
-
(2)
Average the m output values of the small models to make the final output. This final output is defined as the p-score for the hit.
-
(3)
The output of each small ANN model is a real number ∈ [0, 1]; hence, the p-score also has a minimum of 0 and a maximum of 1.
-
(1)
-
8.
Based on Eqs. (10)–(12), calculate each hit’s confidence score and pC-value.
-
9.
Update the hit list by sorting the hits according to the confidence score from high to low (meaning, the pC-value from low to high).
Trained with Qry200 and tested with Qry400, SARST2 outperformed state-of-the-art methods like TM-align, MICAN-SQ, and MADOKA (compare Supplementary Table 7 with Fig. 2). Since the training and testing sets comprised different SCOP families, the evaluation may be considered independent. Nevertheless, we further evaluated SARST2 with more difficult cases. CATH, another structural classification database established through automated matching and manual inspections52, classified proteins hierarchically into Classes, Architecture, Topology, and Homology. SCOP and CATH had large differences in domain definitions and partitioning of the fold space across classification levels. We obtained the 40% identity non-redundant subset from the CATH website, or nrCATH40 in brief, as the independent test dataset (Supplementary Data 3). Moreover, this dataset was highly challenging due to the low sequence identity between homologs. Supplementary Table 7 demonstrates that, despite decreased precision, SARST2 still performed well on nrCATH40. The precision of Foldseek and TM-align decreased to the same level (~72%) under this challenge. Foldseek was the most accurate, with a precision 0.89% higher than SARST2, which slightly outperformed TM-align. SARST2 was the fastest, running 1.37× faster than Foldseek. For reference, the precision of BLAST (~26%) declined much more significantly than that of SARST2, Foldseek, and TM-align. SARST2 and Foldseek were superior in speed, running 2.49× and 1.82× faster than BLAST, respectively. SARST2 remained robust, even under these challenging independent tests. The final SARST2 program was then trained by combining Qry200 and Qry400 into a single query dataset. This final version was used for all other experiments described in this report, such as those shown in Figs. 2–5.
Performance evaluations by information retrieval (IR)
The Qry400 and SCOP-2.07 were utilized as the query and target datasets in most of our IR experiments. For each query protein in the Qry400, a hit list of candidate homologs was retrieved from the SCOP-2.07 using an assessed alignment method. The accuracy of the search method was defined by how condensed the known SCOP family-level homologs of the query were ranked in the hit list. Let Ntotal denote the total number of known homologs in the target database, and Nret the number of retrieved homologs counted from the top of the hit list to a specified position Ni. IR parameters, recall (r) and precision (P)34, were computed as follows:
These metrics exhibit an inverse relationship: increasing recall often decreases precision. To more intuitively evaluate the overall accuracy of a method, we computed the eleven-point average precision32 (referred to as “average precision” in this report for simplicity),
where P(0), the precision at 0% recall, was calculated as the precision when the first known homolog is retrieved (i.e., Nret = 1)22,32.
Parameter optimization for database searching algorithms
The parameter settings of SARST1 and iSARST followed configurations determined through performance benchmarking in our previous studies22,23, with word size = 3, gap opening and extension penalties = 9 and 2, and the low-complexity filter for query sequences turned off. The parameters of SARST2 were selected based on its IR performance using the Qry200 dataset, with Qry400 used to validate the chosen settings. In the current implementation, the primary user-adjustable parameter influencing speed and accuracy is the word size, which was fixed as 5 for all IR experiments reported in this paper. The main parameters of Foldseek and BLAST were tuned individually by evaluating their IR performance on Qry400, as detailed in Supplementary Table 1. For Foldseek, the settings were: sensitivity = 0.85, word size = 5, and CPU instruction set = AVX2 (Advanced Vector Extensions 2). For BLAST, the settings were: word size = 2, scoring matrix = BLOSUM62, and gap opening and extension penalties = 11 and 2. All algorithms provided user-adjustable quality control cutoffs for the reported hits (e.g., E-value, pC-value, or the number of hits). Foldseek also offered an adjustable clustered database search mode. These options were systematically tested and compared (Supplementary Table 1). The most effective quality control cutoff for SARST2 was the pC-value, while Foldseek and BLAST achieved optimal performance using E-values. In both the SCOP-2.07 dataset (145 thousand proteins; Figs. 2, 3) and the whole AlphaFold DB (214 million proteins; Supplementary Table 2), Foldseek exhibited improved performance when using the clustered database compared to its standard database.
Performance evaluations by pairwise alignment
All the assessed methods supported pairwise alignment without extra preparation steps. FAST, TM-align, and MICAN-SQ were implemented as pairwise programs. The classic BLAST package provides a pairwise program, bl2seq. Fr-TM-align50 and MADOKA programs could make batch computations, but their pairwise alignment modes reported more details. The iSARST web server had no pairwise function. The SARST1 program could not run pairwise alignment unless the target database of a single subject structure was first prepared by a formatdb command. SARST2, Foldseek, and the modern BLAST (blastp) were database search algorithms, but their programs could also perform pairwise alignments without any formatdb step. We evaluated the pairwise alignment performance of several algorithms. The average running time shown in Fig. 3a was calculated from all-against-all pairwise alignments between the Qry400 and SCOP-2.07, totaling ~58 million alignments (400 × 144,879). The identity mapping and distribution data exhibited in Fig. 5a, b were generated from a sampled dataset of homolog pairs to avoid the bias caused by unequal family sizes. A large SCOP family might contain several thousand proteins, potentially dominating the identity distribution compared to small families. The smallest SCOP family of Qry400 had 20 proteins, which could form 190 (20 × 19/2) unique homolog pairs, defining the upper sampling limit per family. We randomly selected 150 pairs from each family, forming a dataset of 60,000 homolog pairs to analyze alignment quality and identity distributions across different methods (Fig. 5).
The evaluated database search algorithms, SARST2, Foldseek, and BLAST, provided multiple parameters that could affect pairwise alignment results. The same optimal parameter settings used in IR experiments (Supplementary Table 1) were also applied in the pairwise alignment experiments.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Unless otherwise stated, all data supporting the results of this study can be found in the article, supplementary, and source data files. Protein structure data used in this study are available at the SCOPe [https://scop.berkeley.edu/], CATH [https://www.cathdb.info/], and the AlphaFold Protein Structure Database [https://alphafold.ebi.ac.uk/] web servers, with accession codes listed in Supplementary Data 1–3. These third-party datasets are publicly accessible without restriction. The numerical source data underlying all graphs in the main Figures (Figs. 2a, 3b, c, 4a–e, and 5a, b) and Supplementary Fig. 3 are compiled in a single Excel file labeled Source Data, with each figure panel presented on a separate worksheet. Source data are provided with this paper.
Code availability
The compiled executables of SARST2, v2.0.30, are provided as Supplementary Software 1–3 for Linux, macOS, and Windows 10+ systems. The source code and executables of the latest version are available at the 10lab website [https://10lab.ceb.nycu.edu.tw/sarst2] and GitHub [https://github.com/NYCU-10lab/sarst]. The GitHub repository has been archived on Zenodo (https://doi.org/10.5281/zenodo.17144647) to ensure long-term accessibility.
References
Rose, P. W. et al. The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res. 43, D345–D356 (2015).
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Juan, S. H., Chen, T. R. & Lo, W. C. A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy. PLoS ONE 15, e0235153 (2020).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50, D439–D444 (2022).
Arnold, K., Bordoli, L., Kopp, J. & Schwede, T. The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics 22, 195–201 (2006).
Chen, C. C., Hwang, J. K. & Yang, J. M. (PS)2: protein structure prediction server. Nucleic Acids Res. 34, W152–W157 (2006).
Wang, Z. et al. Protein function annotation with structurally aligned local sites of activity (SALSAs). BMC Bioinform. 14, S13 (2013).
Ruperti, F. et al. Cross-phyla protein annotation by structural prediction and alignment. Genome Biol. 24, 113 (2023).
Vakser, I. A. Protein-protein docking: from interaction to interactome. Biophys. J. 107, 1785–1793 (2014).
Tuncbag, N., Keskin, O., Nussinov, R. & Gursoy, A. Prediction of protein interactions by structural matching: prediction of PPI networks and the effects of mutations on PPIs that combines sequence and structural information. Methods Mol. Biol. 1558, 255–270 (2017).
Lee, H. S. & Im, W. Identification of ligand templates using local structure alignment for structure-based drug design. J. Chem. Inf. Model 52, 2784–2795 (2012).
Brylinski, M. Local alignment of ligand binding sites in proteins for polypharmacology and drug repositioning. Methods Mol. Biol. 1611, 109–122 (2017).
Holm, L. & Sander, C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123–138 (1993).
Shindyalov, I. N. & Bourne, P. E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11, 739–747 (1998).
Zhu, J. & Weng, Z. FAST: a novel protein structure alignment algorithm. Proteins 58, 618–627 (2005).
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Martin, A. C. The ups and downs of protein topology; rapid comparison of protein structure. Protein Eng. 13, 829–837 (2000).
Carpentier, M., Brouillet, S. & Pothier, J. YAKUSA: a fast structural database scanning method. Proteins 61, 137–151 (2005).
Yang, J. M. & Tung, C. H. Protein structure database search and evolutionary classification. Nucleic Acids Res. 34, 3646–3659 (2006).
Lo, W. C., Huang, P. J., Chang, C. H. & Lyu, P. C. Protein structural similarity search by Ramachandran codes. BMC Bioinform. 8, 307 (2007).
Lo, W. C., Lee, C. Y., Lee, C. C. & Lyu, P. C. iSARST: an integrated SARST web server for rapid protein structural similarity searches. Nucleic Acids Res. 37, W545–W551 (2009).
Minami, S., Sawada, K., Ota, M. & Chikenji, G. MICAN-SQ: a sequential protein structure alignment program that is applicable to monomers and all types of oligomers. Bioinformatics 34, 3324–3331 (2018).
Deng, L., Zhong, G., Liu, C., Luo, J. & Liu, H. MADOKA: an ultra-fast approach for large-scale protein structure similarity searching. BMC Bioinform. 20, 662 (2019).
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Shih, C. H., Chang, C. M., Lin, Y. S., Lo, W. C. & Hwang, J. K. Evolutionary information hidden in a single protein structure. Proteins 80, 1647–1657 (2012).
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).
Fox, N. K., Brenner, S. E. & Chandonia, J. M. SCOPe: Structural Classification of Proteins-extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, D304–D309 (2014).
Zhang, E. & Zhang, Y. in Encyclopedia of Database Systems (eds Liu, L. & Özsu, M. T.) 981–982 (Springer US, 2009).
Aung, Z. & Tan, K. L. Rapid 3D protein structure database searching using information retrieval techniques. Bioinformatics 20, 1045–1052 (2004).
Manning, C. D., Raghavan, P. & Schütze, H. Introduction to Information Retrieval (Cambridge University Press, 2008).
Lo, W. C. et al. Deciphering the preference and predicting the viability of circular permutations in proteins. PLoS ONE 7, e31791 (2012).
Lo, W. C. et al. CPred: a web server for predicting viable circular permutations in proteins. Nucleic Acids Res. 40, W232–W237 (2012).
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet.16, 276–277 (2000).
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).
Madhusudhan, M. S., Marti-Renom, M. A., Sanchez, R. & Sali, A. Variable gap penalty for protein sequence-structure alignment. Protein Eng. Des. Sel. 19, 129–133 (2006).
Barton, G. J. in Protein Structure Prediction: A Practical Approach (ed. Sternberg, M. J. E.) 31–63 (RL Press at Oxford University Press, 1996).
Nishikawa, K. & Ooi, T. Radial locations of amino acid residues in a globular protein: correlation with the sequence. J. Biochem 100, 1043–1047 (1986).
Amitai, G. et al. Network analysis of protein structures identifies functional residues. J. Mol. Biol. 344, 1135–1146 (2004).
Rost, B. Review: protein secondary structure prediction continues to rise. J. Struct. Biol. 134, 204–218 (2001).
Yang, Y. et al. Sixty-five years of the long march in protein secondary structure prediction: the final stretch?. Brief. Bioinform. 19, 482–494 (2018).
Tomii, K., Sawada, Y. & Honda, S. Convergent evolution in structural elements of proteins investigated using cross profile analysis. BMC Bioinform. 13, 11 (2012).
Wright, E. S. Tandem repeats provide evidence for convergent evolution to similar protein structures. Genome Biol. Evol. https://doi.org/10.1093/gbe/evaf013 (2025).
Liu, J. W. et al. On the relationship between residue structural environment and sequence conservation in proteins. Proteins 85, 1713–1723 (2017).
Mohammadi, A., Zahiri, J., Mohammadi, S., Khodarahmi, M. & Arab, S. S. PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biol. Methods Protoc. 7, bpac008 (2022).
Kinjo, A. R. & Nakamura, H. Nature of protein family signatures: insights from singular value analysis of position-specific scoring matrices. PLoS ONE 3, e1963 (2008).
Pandit, S. B. & Skolnick, J. Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score. BMC Bioinform. 9, 531 (2008).
Lo Conte, L. et al. SCOP: a structural classification of proteins database. Nucleic Acids Res. 28, 257–259 (2000).
Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).
Chen, T. R., Juan, S. H., Huang, Y. W., Lin, Y. C. & Lo, W. C. A secondary structure-based position-specific scoring matrix applied to the improvement in protein secondary structure prediction. PLoS ONE 16, e0255076 (2021).
Pommie, C., Levadoux, S., Sabatier, R., Lefranc, G. & Lefranc, M. P. IMGT standardized criteria for statistical analysis of immunoglobulin V-REGION amino acid properties. J. Mol. Recognit. 17, 17–32 (2004).
Timberlake, K. C. Chemistry. 5th edn (HarperCollins Publishers Inc., 1992).
Mathews, C. K., Van Holde, K. E. & Ahern, K. G. Biochemistry (Benjamin Cummings, 2000).
Efimov, A. V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201–239 (1993).
Lipman, D. J. & Pearson, W. R. Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985).
Chen, T. R., Lo, C. H., Juan, S. H. & Lo, W. C. The influence of dataset homology and a rigorous evaluation strategy on protein secondary structure prediction. PLoS ONE 16, e0254555 (2021).
Breiman, L. Classification and Regression Trees (Wadsworth International Group, 1984).
Acknowledgements
We thank the National Science and Technology Council (NSTC), Taiwan, for supporting this study through grants MOST 111-2221-E-A49-163-, NSTC 111-2218-E-A49-031-, NSTC 112-2221-E-A49-119-, and NSTC 114-2221-E-A49-147- to W.-C.L., and NSTC 112-2813-C-A49-032-E to S.-W.W. This work was financially supported by the “Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B)” from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan. We acknowledge Sheng-Hung Juan and Yu-Wei Huang for implementing the PSSM algorithm and establishing the evaluation procedures, and Yu-Jhen Huang for assisting in setting up the server room and collecting experimental data. We also thank Ling-Yen Kung, Chun-Kai Wu, Yu-Ru Shen, Jun Jie Lau, and Hao-Yu Peng for testing the SARST2 software package and preparing the multilingual user manuals.
Author information
Authors and Affiliations
Contributions
W.-C.L. conceived the project, developed the core library, designed the algorithm, supervised the study, and wrote the manuscript. A.W. participated in algorithm design and helped supervise the study. C.-H.L. and C.-Y.C. contributed to the development of the core library. C.-H.L., C.-Y.C., Y.-J.L., J.-Y.Y., and S.-W.W. contributed to algorithm development, dataset preparation, and performance benchmarking. S.-C.Y. contributed to machine learning refinement and data analysis. J.-Y.Y., Y.-J.L., and S.-W.W. prepared datasets and created the software documentation. All authors helped draft the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Jeffrey Skolnick and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lo, WC., Warshel, A., Lo, CH. et al. SARST2 high-throughput and resource-efficient protein structure alignment against massive databases. Nat Commun 16, 8691 (2025). https://doi.org/10.1038/s41467-025-63757-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-025-63757-9