SARST2 high-throughput and resource-efficient protein structure alignment against massive databases

Lo, Wei-Cheng; Warshel, Arieh; Lo, Chia-Hua; Choke, Chia Yee; Li, Yan-Jie; Yen, Shih-Chung; Yang, Jyun-Yi; Weng, Shih-Wen

doi:10.1038/s41467-025-63757-9

Download PDF

Article
Open access
Published: 30 September 2025

SARST2 high-throughput and resource-efficient protein structure alignment against massive databases

Nature Communications volume 16, Article number: 8691 (2025) Cite this article

432 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

The flood of protein structural Big Data is coming. With the belief that biotech researchers deserve powerful analysis engines to overcome the challenge of rapidly increasing computational demands, we are devoted to developing efficient protein structural alignment search algorithms to assist researchers as they push the frontiers of biological sciences and technology. Here, we present SARST2, an algorithm that integrates primary, secondary, and tertiary structural features with evolutionary statistics to perform accurate and rapid alignments. In large-scale benchmarks, SARST2 outperforms state-of-the-art methods in accuracy, while completing AlphaFold Database searches significantly faster and with substantially less memory than BLAST and Foldseek. It employs a filter-and-refine strategy enhanced by machine learning, a diagonal shortcut for word-matching, a weighted contact number-based scoring scheme, and a variable gap penalty based on substitution entropy. SARST2, implemented in Golang as standalone programs available at https://10lab.ceb.nycu.edu.tw/sarst2 and https://github.com/NYCU-10lab/sarst, enables massive database searches using even ordinary personal computers.

Protein remote homology detection and structural alignment using deep learning

Article Open access 07 September 2023

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Large protein databases reveal structural complementarity and functional locality

Article Open access 25 August 2025

Introduction

This work aims to develop an efficient structural alignment search method to support rapid biotechnology research and development in this protein structural Big Data era. Over the past two decades, although structures deposited in the Protein Data Bank (PDB)¹ increased rapidly, their expansion lagged behind that of sequence databases. While PDB comprised 100 thousand structures in 2014, the Universal Protein Resource (UniProt) contained 100 million sequences. Since then, the size of UniProt (represented by the UniRef90 dataset) has doubled about every two years^2,3. Nevertheless, this growth gap is now closing due to recent astonishing applications of AI in protein structure prediction. In July 2022, Google DeepMind began predicting structures for all known sequences using AlphaFold2^4,5. To date, AlphaFold DB has released 214 million predicted structures⁶. Such progress also motivated PDB to incorporate predicted structures from reliable sources, expanding its database from hundreds of thousands to over one million proteins in 2022. How to efficiently identify homologous proteins via structural alignment search has become a major challenge in the face of these massive and rapidly growing structure databases.

Protein structural alignment is more time-consuming than sequence alignment. In the past, when data were limited, the cost was tolerable for most researchers. Now, facing hundreds of millions of structures, existing structural alignment search methods are increasingly inefficient. For reference, searching one query sequence against the current UniRef-90 (172 million proteins; 75 GiB disk space) with the widely-used sequence alignment search algorithm PSI-BLAST may take an hour with one typical CPU (estimated with a 3.33 GHz Intel Xeon processor based on Table 2 of ref. ³), and this cost may double every two years. Given that sequence alignment searches already consume substantial time and space, large-scale structural studies demand far more. Improving the efficiency of structural alignment may facilitate progress in many fields of protein science and technology, such as structure modeling^7,8, proteomics^9,10, protein–protein interactomes^11,12, and drug discovery^13,14.

In the 1990s, there had been notable pairwise structure alignment algorithms, such as DALI¹⁵ and CE¹⁶. Though accurate, the run time of these pioneering algorithms, several seconds for one pair of structures, was too long for large-scale research. In the early 2000s, accurate and rapid algorithms aligning several structures per second were proposed, like the FAST¹⁷ and TM-align¹⁸. They did not support database searching, which must be implemented by one-against-all pairwise alignments between the query protein and all subject proteins in the target database. In the late 2000s, alignment search algorithms based on structural linear encoding quickly evolved, such as the TOPSCAN¹⁹, YAKUSA²⁰, 3D-BLAST²¹, and SARST (Structural similarity search Aided by Ramachandran Sequential Transformation)²². They could screen several thousand proteins per second. Though rapid, their accuracy was substantially lower than that of pairwise algorithms like FAST and TM-align²².

In 2009, we developed the integrated SARST web server (iSARST) with a filter-and-refine strategy^22,23. Linear encoding was utilized to discard subject proteins dissimilar to the query from the hit list. Detailed alignment by a third-party refinement engine, e.g., FAST¹⁷ or TM-align¹⁸, was applied to reorder the hits according to their structural similarities to the query. The iSARST ran with high speed while preserving the accuracy of these refinement engines²³. However, its efficiency is by far insufficient to handle a database of hundreds of millions of structures. Using one 3.2-GHz CPU, iSARST took 30 s to retrieve 90% of known family-level homologs for one query from a target database of 34 thousand proteins²³. With this performance, searching against the AlphaFold DB would take ~52 h. Using multiple CPUs might reduce the time to 10–20 h, but it is still costly.

Recently, versatile alignment methods have been published. MICAN-SQ enabled oligomeric structural alignment²⁴. MADOKA²⁵ and Foldseek²⁶ supported parallel computation. MADOKA linearly encoded structures by the DSSP program²⁷ into secondary structure element (SSE) sequences, which were aligned through dynamic programming (DP). It could compare ~480 pairs of structures per second using 32 Intel i9 CPU processors, equivalent to ~120 h for processing the AlphaFold DB. Foldseek, by applying deep learning through a Vector Quantized Variational Autoencoder, encoded tertiary interactions between residues of a protein into a 3Di structural string²⁶, which substantially boosted alignment search performance. Moreover, by utilizing SIMD (Single Instruction, Multiple Data) instructions of modern CPUs in DP, Foldseek achieved an impressive speed ~1370 times higher than MADOKA. It could search the AlphaFold DB in two hours or less with 32 Intel i9 CPUs (see “Results”).

Reviewing previous works enlightened us that the linear encoding filter-and-refine strategy of iSARST remains promising for developing structural alignment search algorithms for massive databases. However, iSARST had some weaknesses: (1) it was written in PHP (Hypertext Preprocessor), an interpreted programming language slower than compiled languages; (2) it relied on inefficient external calls to third-party programs, including BLAST²⁸ for structural string alignments and FAST/TM-align for structural comparisons; (3) its speedup was limited to a 3-fold increase due to the parallelization efficiency of its alignment engine BLAST; (4) it did not provide a standalone program.

Here, we propose the SARST2 algorithm. This filter-and-refine procedure utilizes amino acid and SSE sequences, linearly-encoded structural strings, atomic geometrics, and evolutionary information to perform structural alignment searches. Accelerated by decision tree (DT) and artificial neural network (ANN) machine learning, several word-matching and structural string comparison filters rapidly discard irrelevant hits. Subsequently, the weighted contact number (WCN)²⁹ and position-specific scoring matrix (PSSM)-derived substitution entropy are applied to align the remaining hits with the query structure in a refined manner. SARST2 has been implemented into a self-contained standalone program in Golang, an efficient parallel computing language developed by Google. Evaluated with information retrieval (IR) using structural homologs determined by the SCOP database^30,31, the alignment search accuracy of SARST2 (96.3%) was higher than FAST (95.3%), TM-align (94.1%), and Foldseek (95.9%). Notably, it ran several to thousands of times faster than state-of-the-art structure alignment programs and even outpaced the sequence alignment algorithm BLAST²⁸. Tested with an answers-seeded AlphaFold DB using 32 Intel i9 processors, SARST2 completed a 100% answer-recalled search in 3.4 min using 9.4 GiB memory. Foldseek and BLAST took 18.6 and 52.5 min, using 19.6 and 77.3 GiB memory, respectively. Moreover, storing AlphaFold DB requires 59.7 TiB disk space, but SARST2’s grouped database formatting function reduces the requirement to only 0.5 TiB, compared to 1.7 TiB required by Foldseek. These capabilities allow researchers to search hundreds of millions of structures on ordinary personal computers. We sincerely hope SARST2 will help the community overcome the challenge of computing power for structural Big Data analysis and accelerate advances in biotechnology and medical science.

Results

Overview of the proposed algorithm

As illustrated in Box 1 and Fig. 1, SARST2 employs a filter-and-refine strategy, where coarse and fast filters precede accurate and slow alignment/scoring steps. After obtaining linearly-encoded information of the query and subject structures, four filters eliminate structurally irrelevant proteins from the subject pool. The remaining candidate homologs of the query are aligned and scored by two refinement procedures: (1) a synthesized DP (dynamic programming) step that considers the amino acid type (AAT), SSE, and WCN of each residue; (2) detailed structural similarity scoring by superimposition. The synthesized DP is equipped with a variable gap penalty (VGP) scheme based on the PSSM-derived residue substitution entropy of the query protein (see “Methods”).

**Fig. 1: Flowchart of the SARST2 algorithm.**

Accuracy evaluations

The accuracies of SARST2 and several excellent alignment algorithms were assessed by information retrieval, a technique for evaluating data searching methods^22,32,33,34. Dataset Qry400 (Supplementary Data 1) provided 400 query proteins for the algorithms to retrieve family-level homologs from the SCOP-2.07 target dataset (Supplementary Data 2). IR accuracy measures, recall and precision, were calculated by checking whether the retrieved subjects in the hit list were SCOP³⁰ family-level homologs of the query. If a method supported database searching, it was used in that mode. Otherwise, one-against-all comparisons were performed to rank the subjects according to their similarity scores toward the query.

Despite being a sequence search method, BLAST²⁸ was also evaluated because it was utilized as the structural string alignment engine in the former SARST program (SARST1) and the iSARST server. As shown in Fig. 2, SARST2 outperformed BLAST and other methods across all recall levels. Its average precision was 96.3%, exceeding iSARST (94.4%), FAST (95.3%), Foldseek (95.9%), and other algorithms.

**Fig. 2: Accuracy of SARST2 and several state-of-the-art alignment algorithms.**

Speed evaluations

Each alignment search program we applied provides parameters to adjust how many hits to retrieve. For instance, they all allowed users to set the maximum number of proteins in the hit list. Some also provided quality control cutoffs for discarding poorly scored hits, e.g., the pC-value in SARST2 (see “Methods”) and E-value in BLAST. Setting a smaller cutoff for these parameters would enhance the speed; however, fewer homologs would be retrieved, and the program might fail to reach high recall levels. Therefore, they were carefully tuned in IR experiments. At first, we fixed the hit-list size cutoff to be the target database’s size to ensure the potential to reach 100% recall. The quality control cutoff for each query gradually increased from 10⁰, 10¹, to 10⁸ until the 100% recall was accomplished. Then, the quality control cutoff was fixed at 10⁸, and the hit-list size cutoff rose from 50, 500, 5000, etc. For programs without such parameters, one-against-all comparisons were conducted.

All the assessed alignment search programs performed better with the quality control cutoff than the hit-list size cutoff (Supplementary Table 1), and their optimal performances were summarized in Fig. 3a. Using one CPU, SARST2 achieved a speed of >3300 times faster than FAST and TM-align in database searching. Meanwhile, its precision was higher than those of these highly accurate pairwise alignment algorithms. SARST2 also ran faster than other database searching or batch computing methods, i.e., the SARST1, iSARST, MADOKA, and Foldseek. Notably, it was even faster than BLAST. Foldseek, impressively faster than BLAST as well, exhibited a similar speed to SARST2 without parallelization. Nevertheless, SARST2 showed a higher speed enhancement when multiple CPUs were available.

**Fig. 3: Speed of SARST2 and several state-of-the-art alignment algorithms.**

To test the efficiency of search algorithms that supported parallelization, we gradually increased the number of processors. Foldseek and BLAST ran fast with 1–2 CPUs, but their speed met a plateau when the number was ≥4. The speed increment of SARST2 was remarkable (Fig. 3b, c). Using our 32-CPU machine, its speed could increase 10.6-fold, whereas Foldseek, MADOKA, and BLAST were accelerated by 2.1–3.0, 10.5, and 3.1-fold, respectively. Searching one query structure against the SCOP-2.07 to reach 100% recall with 32 i9 CPUs, SARST2 took 0.15 s on average, approximately 7 and 24 times faster than Foldseek and BLAST.

Most assessed programs supported pairwise alignment without extra database formatting commands. For each of them, we made all-against-all pairwise alignments between the Qry400 and SCOP-2.07 datasets and computed the average alignment time. As shown in Fig. 3a, SARST2 had a pairwise speed >11 times higher than accurate pairwise alignment methods like FAST and TM-align. Compared with other database search methods, SARST2 was 408 and 325 times faster than Foldseek and BLAST. We speculated that these search methods might still perform a database-formatting procedure for a single subject protein before it could be aligned with the query. Before the NCBI BLAST+ version, the old BLAST package provided a pure pairwise alignment program called bl2seq. Without database formatting, bl2seq ran much faster than the modern BLAST program (blastp). Nevertheless, SARST2 was twice as fast as bl2seq. In summary, in both database searching and pairwise alignment, SARST2 outperformed state-of-the-art structural alignment methods and was even faster than the popular BLAST sequence alignment algorithm.

Accuracy-enhancing factors

Combining primary, secondary, and tertiary structural information for alignment was a key contributor to SARST2’s accuracy. SARST2 linearly encoded a protein structure in five forms: (1) the amino acid (AA) sequence, (2) the AAT sequence using a 5-symbol amino acid type alphabet we developed^35,36, (3) the 4-symbol SSE sequence from the TM-align alphabet¹⁸, (4) the SARST sequence based on Ramachandran codes²², and (5) the WCN values we proposed to measure residue packing density²⁹. These elements were synthesized through a weighted scoring scheme to describe local similarity S_ij between residue i from the query and residue j from the subject protein,

$${S}_{{ij}} = {w}_{{{\rm{WCN}}}}\times \left|{{{\rm{WCN}}}}_{i}-{{{\rm{WCN}}}}_{j}\right|\\ +{w}_{{{\rm{RM}}}}\times {{{\rm{SM}}}}_{{{\rm{SARST}}}}\left[{{{\rm{RM}}}}_{i}\right]\left[{{{\rm{RM}}}}_{j}\right]\\ +{w}_{{{\rm{SSE}}}}\times {{{\rm{SM}}}}_{{{\rm{SSE}}}}\left[{{{\rm{SSE}}}}_{i}\right]\left[{{{\rm{SSE}}}}_{j}\right]\\ +{w}_{{{\rm{AAT}}}}\times {{{\rm{SM}}}}_{{{\rm{AAT}}}}\left[{{{\rm{AAT}}}}_{i}\right]\left[{{{\rm{AAT}}}}_{j}\right]\\ +{w}_{{{\rm{AA}}}}\times {{{\rm{SM}}}}_{{{\rm{AA}}}}\left[{{{\rm{AA}}}}_{i}\right]\left[{{{\rm{AA}}}}_{j}\right]$$

(1)

where w_x denotes the weight for element x, and SM_x is the scoring matrix for alphabet x. The weights were empirically tuned. For SM_AA, we used the BLOSUMN provided by EMBOSS³⁷. This synthesized similarity score was then used to perform DP alignment. The Needleman–Wunsch DP was used in this study, though the SARST2 program also supports Smith–Waterman.

Among the synthesized elements, WCN exerted a substantial effect on accuracy. As Table 1 shows, disabling the whole synthesized DP reduced precision by 1.9%, while disabling only WCN resulted in a 1.3% drop. Scoring matrices also enhanced accuracy. Many algorithms, like TM-align and MADOKA, aligned structures based on SSE with simple binary scores (e.g., 1 for match, 0 for mismatch). In contrast, SARST2 applied custom scoring matrices for SARST, SSE, and AAT alphabets derived from the BLOSUM algorithm³⁸ (see “Methods”). Replacing SM_SSE and SM_AAT with binary scores caused precision to drop by 2.9% and 1.1%, respectively.

Table 1 Factors affecting the accuracy and speed of SARST2

Full size table

Variable gap penalty was supposed to improve alignment quality between distantly related homologs by penalizing gaps in SSE regions³⁹. Here, we introduced an additional feature into VGP—suppressing gaps at conserved positions. After the last filtering DP alignment (Fig. 1), SARST2 generated a PSSM for the query protein. The Shannon entropy (H) was then applied to encode the query’s sequence conservation profile (C), as shown below,

$${{\rm{H}}}\left(i\right)=-{\sum}_{a}{{{\rm{p}}}}_{i}(a)\times {\log }_{2}({{{\rm{p}}}}_{i}(a))$$

(2)

$${{\rm{C}}}\left(i\right)={\beta }^{{{\rm{Zscore}}}({{\rm{H}}}\left(i\right))};\beta \in (0,1)$$

(3)

where i denotes a residue position, a is one of the 20 AAs, and p_i(a) stands for the frequency of a at position i in the PSSM. High entropy indicates poor conservation due to diverse substitutions. Entropies were normalized by Z-score, then transformed to conservation using β = 0.5 as an empirically selected base. The value of C(i), always ≥0, modulated the gap penalty for residue i in the final synthesized DP alignment. As Table 1 shows, disabling this VGP feature reduced precision by 0.6%.

Speed-enhancing factors

The high speed of SARST2 was mainly founded upon the filter-and-refine strategy. Specifically, the earliest word-matching step eliminated most non-homologs to reduce computation time. In SARST2, word-matching was done by a typical query-subject 2D matching table⁴⁰, not on amino acid sequences but on AAT and SSE sequences. This innovation enabled fast identification of matching words near the diagonal of the table (see “Methods”), a pattern that often signals homologous alignment. Without this diagonal shortcut, SARST2’s running time would increase threefold (see Table 1, with each test repeated five times). Another speed enhancer was the grouped search. During filtering, only group heads were evaluated. Since the number of heads decreased after filtering, and each remaining head propagated its word-matching and DP results to group members, many redundant calculations were avoided. Disabling grouped search doubled the time cost. Machine learning (ML) also accelerated SARST2. A DT filter was used to quickly eliminate irrelevant subjects, while an ANN filter estimated the likelihood of homology and removed low-scoring candidates. Without ML, the running time increased by 5.5-fold.

Efficiency of data storage

The SARST2 program can be executed with a pre-formatted target database, where structural data are stored economically, especially for residue coordinates. First, only Cα atoms are stored, significantly reducing space. Second, we found that the coordinate range in a PDB/CIF file (–999.999 to 9999.999) could be encoded using 24 bits, which saves more space than a conventional 32/64-bit float representation. Furthermore, SARST2 provides a parameter to round coordinates to one decimal place. These strategies allow users with limited storage resources to manage huge databases. For example, the raw AlphaFoldDB-2022 consumed 59.7 TiB disk space (counting only the CIF-format files), whereas a SARST2-formatted version required only 0.5 TiB, saving 99.2% of the space.

Performance tests with the AlphaFold Protein Structure Database

The SARST2 program is user-friendly and flexible. In addition to the query structure, users can specify (1) a pre-formatted target database, (2) a batch of subject files, or (3) folders containing subject files. Formatting a database reduces future search time and disk space. We evaluated SARST2’s large-scale searching and database formatting efficiency using AlphaFold DB through random sampling. Running time, memory usage, and disk consumption were tracked as the subsets increased in size. For comparison, we also tested Foldseek (structure-based) and BLAST (sequence-based) alignment search algorithms.

We applied the Qry400 dataset against target databases constructed by combining SCOP-2.07 family-level homologs with randomly sampled AlphaFoldDB-2022 subsets of increasing size, the largest containing 215 million structures. Every query had known SCOP family-level homologs, allowing us to seed answers into a massive dataset for IR evaluation. As shown in Fig. 4, time and memory usage grew with dataset size. Using 32 i9 CPUs, SARST2 searched the full AlphaFoldDB in 3.4 min, about 5.4× and 15.3× faster than Foldseek and BLAST (× denotes multiplication, i.e., times). Impressively, SARST2 used only 9.4 GiB of memory, while BLAST needed 77.3 GiB. Foldseek, when run with clustered database mode, was also efficient (19.6 GiB), but required 3.0× more memory without enabling that clustering (see Supplementary Table 2).

**Fig. 4: Performance of SARST2, Foldseek, and BLAST on large-scale AlphaFold DB subsets.**

SARST2 and Foldseek required more time, memory, and disk space than BLAST to format the AlphaFoldDB subsets, which is just as expected since BLAST only handles sequence data. Nonetheless, for the full AlphaFoldDB, SARST2’s memory usage (3.9 GiB) was close to BLAST (2.5 GiB) and far lower than Foldseek (113.2 GiB). SARST2’s formatting time was ~10× longer than BLAST’s for datasets >3.3 million proteins, but still less than half of Foldseek’s. The final file sizes of SARST2 and Foldseek databases were similar (~6× larger than BLAST’s), yet Foldseek’s use of temporary files increased peak disk space demand to 3.4× that of SARST2. SARST2 and BLAST did not generate temporary files during formatting.

Discussion

Using WCN²⁹ to measure local structural similarities was an innovation in SARST2. Most structure alignment algorithms instead relied on geometric comparisons. For instance, TM-align performed superimposition after SSE-based DP and quantified local similarities by computing the distance between each residue i from the query and m from the subject protein¹⁸. FAST used a pair-graph DP to identify residue pairs with similar distances and evaluated the similarity between 5-residue fragments centered at residues i (i – 2 to i + 2) and m (m – 2 to m + 2), using several distance- and angle-based features from Cα and Cβ atoms¹⁷. In contrast, WCN relies solely on distances between Cα atoms. It estimates the packing density of i by summing over neighboring residues j, weighted by the reciprocal square of the i–j Cα distances (see “Methods”). Without superimposition or angle calculation, WCN is computationally efficient. To explain why it improves database searching accuracy, we suppose the key may be that it encodes the long-range interaction between residues. The hypothesis motivating our use of WCN was that similar structures might have similar packing patterns along the polypeptide. To explore it, we tested two other packing density measures, the contact number (CN)⁴¹ and closeness⁴². WCN and closeness outperformed CN in search accuracy (see Supplementary Table 3). Unlike CN, they do not apply a fixed distance cutoff to define neighbors, allowing contributions from distant residues. Long-range interactions are critical in protein structure prediction^43,44; they may also be relevant to structure alignment. As most methods assess local similarities in strictly local terms, SARST2 may shed light on the protein structural alignment methodology.

The high accuracy of SARST2 (Figs. 2, 3) may be attributed to the quality of its alignments. Structural alignment methods are generally more effective than sequence-based approaches in detecting remote homology. As shown in Fig. 5a, for SCOP family-level homologs with low pairwise sequence identities (e.g., 1–20%), SARST2, Foldseek, and TM-align produced alignments with substantially higher sequence identities than BLAST. SARST2 and Foldseek exhibited comparable alignment quality, yielding similarly high identities in these low-homology regions. Figure 5b presents the distribution of pairwise sequence identities across methods. BLAST peaked at ~10% identity, suggesting limited sensitivity to distant relationships. TM-align, with an average IR precision of 94.1%, peaked at ~17%. SARST2 achieved the highest precision (96.3%) and peaked at ~22%, slightly above Foldseek (95.9%; ~21%).

**Fig. 5: The alignment quality of SARST2 and the efficiency of its first two filtering steps.**

To further evaluate alignment quality, we calculated TM-scores based on the alignments from each method. As shown in Fig. 5a, BLAST produced the lowest TM-scores for homologs with <20% identity. SARST2 and Foldseek showed comparable TM-scores across the full identity range. TM-align, however, yielded higher TM-scores in the low-identity region. This observation may reflect algorithmic differences. Unlike TM-align, which aligns proteins solely based on structural information¹⁸, SARST2 and Foldseek incorporate amino acid sequence features into their algorithms²⁶; this design may reduce sensitivity to marginal structural similarities while enhancing selectivity for alignments more consistent with evolutionary relatedness. Importantly, structural similarity alone does not necessarily indicate homology. Marginal or local resemblance can result from convergent evolution or domain-level motifs shared by unrelated proteins^45,46. Given that SARST2 yielded higher sequence identities in low-homology regions and demonstrated superior IR precision compared to TM-align in complex database searches (Figs. 2, 3), its alignments may better reflect evolutionarily meaningful relationships. This advantage is likely supported by its use of WCN and PSSM, two features linked to improved recognition of evolutionary relatedness^{28,29,47,48,49}.

SARST1 and iSARST ran BLAST as their structural string alignment engine and were benchmarked against it^22,23. Although SARST2 no longer calls BLAST, we included it in comparisons to show improvements. SARST1 and iSARST were slower than BLAST; SARST2 now outperforms it in both speed and precision. These gains largely stem from the first two filtering steps: word-matching and structural sequence alignment. Unlike BLAST, which uses amino acid sequences in both steps, SARST2 applies SSE and AAT alphabets for word-matching and aligns SARST sequences. This structure-based design enables more effective removal of non-homologous subjects. SARST2 further improves efficiency by using longer words derived from reduced SSE and AAT alphabets. This concentrates true matches near the diagonal, leading to the design of a shortcut matching algorithm that substantially limits the search space (see “Methods”). In addition to these algorithmic benefits, the Golang implementation also improves SARST2’s efficiency. As shown in Fig. 5c, even with only the first two filters enabled, SARST2 surpasses BLAST in precision and speed. Further gains from downstream refinement are reported in Supplementary Table 4. Golang is a cross-platform programming language. To facilitate broad accessibility, we compiled the current version of SARST2 (v2.0.30) into executables for Linux, macOS, and Windows, which are provided as Supplementary Software 1, 2, and 3, respectively.

Although the grouped search function doubled the speed of SARST2 (Table 1), it is still relatively rudimentary compared with the database clustering approach in Foldseek. In SARST2, subject proteins were grouped just when they shared the same length and SSE sequence (“Methods”), whereas Foldseek clustered proteins based on alignments. The average group size in SARST2 was only four proteins, limiting its impact. In future work, we plan to use SARST2 itself to align and cluster subject proteins. Increasing the group size may enable greater speed improvements.

We are now witnessing the singularity of a protein structure Big Bang. During the peer-review of this article, Meta Platforms, Inc. released the ESM Metagenomic Atlas, a database of 772 million predicted structures, ~3.6 times larger than AlphaFold DB. In this rapidly expanding structural landscape, efficient and scalable structural alignment methods become increasingly critical. SARST2 offers a high-throughput solution that may support broad applications in structural biology and beyond. While BLAST is widely used in biological and medical research, SARST2 may offer complementary capabilities in structural analyses. Although the current implementation already exceeds BLAST in accuracy and speed, SARST2 remains under active development, with ongoing efforts to improve machine learning and refine algorithmic parameters. We look forward to seeing SARST2 contribute meaningfully to protein science and biotechnology by helping researchers navigate the challenges of the structural Big Data era. The program and several pre-formatted databases are available at https://10lab.ceb.nycu.edu.tw/sarst2 and https://github.com/NYCU-10lab/sarst (https://doi.org/10.5281/zenodo.17144647).

Methods

Hardware and software environments

The proposed algorithm was developed and evaluated on a personal computer running 64-bit CentOS 7 Linux, equipped with 16 Intel i9 5.2 GHz Hyper-Threading CPU cores (i.e., 32 logical processors), 125 GiB DDR5 RAM, a 2 TiB solid-state disk (SSD), and access to a 166 TiB Network Attached Storage (NAS) configured with RAID6. A 32 GiB swap partition was created on the SSD. The SARST2 programs were written in Golang (v1.22) using the Lo Library. C/C++ (gcc v4.8.5 with the glibc Library), Fortran (gcc v4.8.5 with the libgfortran Library), and PHP (v5.4.16 with the Zend Libraries) were used to compile or execute the software listed below when necessary. The amino acid sequence alignment search program blastp from the NCBI BLAST+ package (v2.11.0+)²⁸ was utilized to represent the BLAST algorithm. In some pairwise experiments, the bl2seq of the classic BLAST package (v2.2.13) was used. Structure alignment programs FAST (v.2005)¹⁷, TM-align (release date: 2016 May 21)¹⁸, Fr-TM-align (v1.0)⁵⁰, and MICAN-SQ (release date: 2019 Nov 27)²⁴, and structure alignment search algorithms SARST²² (v1.0-PHP), iSARST²³ (v1.5-PHP), MADOKA²⁵ (v1.1.0), and Foldseek (release date: 2024 Jun 27) were applied in this work.

Sources of experimental datasets

Protein structure files were downloaded from the AlphaFold Protein Structure Database⁶ (AlphaFold DB; https://doi.org/10.1093/nar/gkab1061; accessed on 2022 Dec 31), the Structural Classification of Proteins^30,31 (SCOP; v2), and the Protein Data Bank¹ (PDB; accessed on 2022 Dec 31). AlphaFold DB contains protein structures predicted by AlphaFold2^4,5. The dataset we compiled, named AlphaFoldDB-2022, included 214,459,158 structures (available at https://10lab.ceb.nycu.edu.tw/sarst2). No structural predictions were performed in this study. SCOP is a gold standard for protein research depending on well-defined homologous protein clusters. Based on automated matching and manual curation, it classifies protein domains hierarchically into classes, folds, superfamilies, and families according to structure, sequence, functional similarities, and evolutionary relationships^30,51. A lower level of SCOP classification stands for a higher level of homology. For instance, proteins in the same fold possess similar structural architecture and topology; proteins in the same family are closely related homologs with the same functions and clear evidence for their common evolutionary origin^30,51.

SCOP homologous data were obtained from the SCOPe server (v2.07)³¹. After eliminating proteins with homology classifications only by automated matching but not yet by manual verifications, 144,879 domain entities were gained from 4022 families, which belonged to 1229 folds. This dataset is referred to as SCOP-2.07 (Supplementary Data 2). To prepare the materials for machine learning and evaluate the alignment algorithms by IR, we established two query datasets from SCOP-2.07 following these steps: (1) Consider only the families belonging to the major classes—all-alpha, all-beta, alpha/beta, and alpha+beta. (2) Calculate the number of proteins possessed by each family. (3) For each class, randomly select 150 families possessing ≥20 proteins. (4) From each selected family, randomly choose one protein as the representative query. (5) Split the 600 (150×4) representative query proteins into two datasets, 200 for training and 400 for testing. They are named Qry200 and Qry400 in this report (see Supplementary Data 1). The CATH protein structure classification database⁵² was also utilized for independent tests, as described in the subsection “Utilization of machine learning”.

PDB is the primary database for macromolecular structures determined by biophysical techniques such as X-ray crystallography, NMR spectroscopy, and Cryo-electron microscopy. The PDB dataset we prepared, PDB-2022, contained 713,183 experimentally determined structures (i.e., no predicted structures). It is pre-formatted in the SARST2 database format and available at https://10lab.ceb.nycu.edu.tw/sarst2.

All source databases listed above provided structural files in either PDB or mmCIF format. The implemented SARST2 programs supported both types of structural files. From either file type, the SARST2 programs extracted residue and atomic information only based on the ATOM descriptions. Sequence information file sections like SEQRES or pdbx_seq_one_letter_code and secondary structural information sections like HELIX or HELX_P were ignored.

Reduction of amino acid codes

Amino acid (AA) code reduction has been shown to enhance the performance of protein circular permutation site prediction and secondary structure prediction^36,53. In this study, we encoded amino acid sequences using the standard 22-symbol AA alphabet (20 standard codes plus B for Asn/Asp and Z for Gln/Glu), as well as the reduced 3- to 6-symbol alphabets applied in our prior research^35,36,53. The 3-symbol alphabet categorized AAs as hydrophobic (A, V, C, I, L, M, F, W), hydrophilic (N, D, E, K, Q, R), and neutral (G, S, P, T, H, Y)⁵⁴. The 4-symbol alphabet grouped them as nonpolar (G, A, P, V, I, L, M, F, W), polar (S, T, C, N, Q, Y), acidic (D, E), and basic (K, H, R)⁵⁵. The 5-symbol alphabet divided AAs into aromatic (F, Y, W), nonpolar aliphatic (G, A, V, I, L, M), polar uncharged (S, P, T, C, N, Q), positively charged (K, H, R), and negatively charged (D, E). The 6-symbol alphabet classified residues as aliphatic (G, A, V, I, L), hydroxyl/sulfur-containing (S, T, C, M), cyclic (P), aromatic (F, Y, W), basic (K, H, R), and acidic (N, D, E, Q)⁵⁶. Among these, the 5-symbol alphabet showed the best performance in improving the efficiency of structural alignment search.

Determination of protein secondary structures

Protein backbone local conformations are commonly described using secondary structure types, such as helices and strands. The widely used DSSP (Definition of Secondary Structure of Proteins)²⁷ algorithm classified protein secondary structures into α-helix (H), 3₁₀-helix (G), π-helix (I), extended strand (E), isolated β-bridge (B), hydrogen-bonded bend (S), turn (T), and loop/coil (C). In many studies, this 8-symbol alphabet was reduced into a 3-symbol version, grouping H, G, and I as H (helices), E and B as E (extended strands), and all others as C. DSSP assigns SSEs based on hydrogen bonding patterns and geometric features such as interatomic distances and angles between specific atomic configurations. TM-align proposed a simpler method for assigning 4-symbol SSEs (helices, strands, turns, and coils) based solely on Cα distances across several consecutive residues¹⁸. We tested the 8-/3-symbol DSSP alphabets and the 4-symbol TM-align scheme. The latter yielded the highest database searching efficiency.

Linear encoding of protein tertiary structures

The SARST linearly encodes protein structures using a 23-symbol Ramachandran code alphabet²². This alphabet was defined by partitioning the Ramachandran plot⁵⁷ of residue backbone torsion angles (ϕ, ψ) into 22 regions. For each residue of interest i, ϕ is defined as the dihedral angle between the planes C_i-1-N_i-Cα_i and N_i-Cα_i-C_i, and ψ is that between N_i-Cα_i-C_i and Cα_i-C_i-N_i+1. N and C denote the backbone amino and carbon atoms, respectively. When a residue lacked the necessary atoms to compute either angle, it was encoded as X, the 23rd SARST code.

Grouping of subject structures in the SARST2-format target database

The grouping principle in SARST2 was that subject structures with the same SSE sequence and length were assigned to the same group, with the first member designated as the group head by the formatdb program. Under this strict criterion, the average group size was small. We supposed that enlarging group size could further accelerate database searching. To this end, we reduced the 4-symbol TM-align SSE alphabet to only two symbols: H and E were merged, as were T and C. Subject proteins sharing the same AAT sequence as a group head were also included in that group. These modifications increased the average group size by ~60% for the SCOP-2.07 dataset.

Scoring matrices

The scoring matrix for SARST sequence alignment, the SARSTSM²², was produced based on the BLOSUM algorithm³⁸, which was also applied here in an iterative manner to generate the scoring matrices for SSE and AAT sequences, as described below.

1.
Randomly select a million pairs of family-level homologous proteins from the SCOP database. None of the selected pairs belonged to the SCOP families of the Qry400 proteins.
2.
Align the selected homologs by the EMBOSS needle³⁷, a Needleman–Wunsch global alignment program, with BLOSUM45 as the scoring matrix.
3.
Transform the aligned amino acid sequences to the required sequence type, say, the SSE sequence. Now, we have well-aligned SSE sequences of the homologs.
4.
The scoring matrix is generated using Henikoffs’ BLOSUM formula³⁸:
$${{{\rm{Score}}}}_{{ij}}={f}_{s}\times {\log }_{2}({b}_{{ij}}/{e}_{{ij}})$$
(4)
where i and j represent an aligned pair of SSEs, b_ij and e_ij are the observed and expected probabilities, respectively, of the occurrence of pair ij, and f_s is a scaling factor. Following Henikoffs’ empirical setting, f_s was appointed as 2.
5.
Evaluate the scoring matrix by performing an IR experiment with Qry200 and SCOP-2.07. If the average precision and speed registered by this matrix show an insignificant difference from those by the previous matrix, stop the iteration.
6.
Repeat the above procedure from Step 2. Starting from this new iteration, use the SARST2 program and the latest scoring matrix to align the homologs.

Word-matching strategy

SARST2 utilized code-reduced alphabets rather than large alphabets in word matching. Specifically, it used the 5-symbol AAT^35,36 and 4-symbol TM-align SSE¹⁸ alphabets instead of the 20-symbol AA and 8-symbol DSSP SSE²⁷ alphabets. The core approach followed the 2D word table design used in the FASTA algorithm^40,58. Both query and subject protein sequences were broken into words of w residues. As illustrated in Supplementary Fig. 1, a smaller alphabet allows a larger w for matching, which may improve homology specificity. The math behind this is exemplified below,

1.
Take the 5-symbol AAT and 20-symbol AA alphabets for demonstration. For simplicity, let the occurring frequency of each symbol be even, meaning the occurrence of each AAT code was 20% (i.e., 100% ÷ 5), and that of each AA code was 5% (i.e., 100% ÷ 20).
2.
Let n_sbj denote the number of residues of the subject protein. Assume n_sbj = 401.
3.
The theoretical occurrence of an AAT and AA word of w residues is (20%)^w and (5%)^w, respectively.
4.
Let w = 2, which is used both in the FASTA⁵⁸ and BLAST²⁸ algorithms. Given a query AAT word, the expected number of its matches in the subject sequence would be (20%)^w × (n_sbj – w + 1), which equals 16.0, while that for a given query AA word would be (5%)^w × (n_sbj – w + 1), which equals 1.0.
5.
Let w be bigger, say 4. The expected number of matches for the AAT word would be ~0.6, while that for the AA word would be almost zero (2.5 × 10^–3).

Using a larger word size improved speed, as fewer words would be processed. Moreover, we developed an algorithm to quickly identify the matches lying close to or in parallel with the diagonal of the matching table. This shortcut algorithm performed better with a big word size than with small ones, as illustrated in Supplementary Fig. 2 and described below,

1.
The original search space of the subject sequence is [1, n_sbj].
2.
Take the starting word of the query sequence and find the position of its first occurrence in the subject sequence’s search space (denoted by p_sbj).
3.
Cut the subject sequence’s search space to [p_sbj + k, n_sbj], where k is an empirical factor for cutting more of the search space and was set as 1 in this work.
4.
Repeat steps 2 to 3 with the next word of the query sequence.

In addition to shrinking the search space, this algorithm significantly accelerated the matching because it tried to make only one match on the subject sequence for each word from the query. Nevertheless, the specificity of the match was hence important. When the word size was small, there might be nonspecific short hits to distract the matching stream from the diagonal. We found that a word of 5–7 residues worked well with the 5-symbol AAT and 4-symbol SSE alphabets for most query proteins.

Most alignment methods use match extension to refine scores^28,40,58; however, such extension was time-consuming. To save time, SARST2 combined complete word matching (Supplementary Fig. 1) and the proposed diagonal shortcut matching (Supplementary Fig. 2), summing the resulting match scores. A subject with a higher sum should more likely be homologous to the query.

Application of the weighted contact number

We have developed the WCN residue measure to describe the packing density of a protein structure²⁹. In this work, we modified the original formula to include a weighting factor w_ij, yielding a more generalized form, as shown below,

$${{\rm{WCN}}}\left(i\right)={\sum }_{j=1}^{n}\frac{{w}_{{ij}}}{{d}_{{ij}}^{2}}{;\; j}\, \ne\, i$$

(5)

where i denotes the residue of interest in a protein of n residues, and j represents any other residue in the same protein. The w_ij is added to weight each residue pair i and j. WCN was typically normalized by the Z-score equation^29,35,36, which was also applied in this study.

A residue with many close neighboring residues would have a high WCN. We discovered that high-WCN residues are generally conserved in evolution²⁹, and a circular permutation occurring at such residues would greatly affect the surrounding residues and, therefore, be deleterious to the stability of the permutant^35,36. In the present study, we tested (1) w_ij = 1 as usual, (2) w_ij = m_i, (3) w_ij = m_j, and (4) w_ij = m_i× m_j, where m stands for the molecular weight of the amino acid of a residue. Among them, w_ij = m_j outperformed the others and was applied in the SARST2 algorithm.

Variable gap penalty

Two gap penalties were applied in every DP alignment step of SARST2, one for gap opening (G_o) and the other for extension (G_e). As stated in Results, besides the conventional concept that gaps should be suppressed in regular SSE segments, we proposed that VGP could be implemented by suppressing gaps in conserved residue positions. The PSSM algorithm of PSI-BLAST²⁸ (applied with only one round in this work) and the Shannon information entropy were used to encode the sequence conservation profile (denoted by C; see Eqs. (2) and (3)) of the query protein. The SSE segmentation and conservation information of the residue of interest (i) of the query was integrated into a variable gap penalizing scheme shown below,

$${{{\rm{G}}}}_{{{\rm{o}}}}\left(i\right)={{{\rm{W}}}}_{{{\rm{o}}}}({{sse}}_{i},l)\times ({g}_{{{\rm{o}}}}+{f}_{{{\rm{o}}}}\times {{\rm{C}}}(i))$$

(6)

$${{{\rm{G}}}}_{{{\rm{e}}}}\left(i\right)={{{\rm{W}}}}_{{{\rm{e}}}}({{sse}}_{i},l)\times ({g}_{{{\rm{e}}}}+{f}_{{{\rm{e}}}}\times {{\rm{C}}}(i))$$

(7)

where o and e denote opening and extension, respectively, $g$ is the basic penalty, f is a scaling factor for the conservation score, and W(sse_i, l) is the weighting function for the SSE segment with l residues to which i belongs.

In this work, f_o and f_e were set to the same value, and so were the W_o and W_e. They can be individually optimized in the future. The ${g}_{{{\rm{o}}}}$ and ${g}_{{{\rm{e}}}}$ for each step are listed in Supplementary Table 5. The W_o and W_e functions were made for helices, strands, and turns; for coils, their values were set to 1. See also Supplementary Table 5 for details.

Structural similarity score between proteins

This work determines the structural similarity between proteins using the TM-score¹⁸. After structural superimposition based on a DP alignment, the TM-score between proteins was computed as follows:

$${{\rm{TM}}}-{{\rm{score}}}=\max \left[\frac{1}{{n}_{{{\rm{qry}}}}}{\sum }_{a=1}^{{n}_{{{\rm{ali}}}}}\frac{1}{1+{\left(\frac{{d}_{a}}{{{{\rm{d}}}}_{0}({n}_{{{\rm{qry}}}})}\right)}^{2}}\right]$$

(8)

$${{{\rm{d}}}}_{0}\left(n\right)=1.24\times \root {3}\of {n-15}-1.8$$

(9)

where n denotes the number of residues, qry refers to the query protein, n_ali means the number of aligned residue pairs between the query and subject proteins, and d_a is the distance between the ath aligned pair of residues. The d₀() is an empirical normalization function that makes the TM-score independent of protein size.

TM-score was initially proposed in TM-align¹⁸ and refined in Fr-TM-align⁵⁰. Since its calculation depends on the alignment between structures, different alignment methods may yield different TM-scores for the same proteins. Based on the pairwise alignment data from Fig. 5b, we analyzed the distribution of TM-scores computed by SARST2, TM-align, and Fr-TM-align. Because the TM-align version we used had already incorporated the TM-score algorithm of Fr-TM-align, the results of TM-align and Fr-TM-align were nearly identical. Supplementary Fig. 3 illustrates that the TM-scores computed by SARST2 were slightly lower than those computed by TM-align, but their distributions were similar. The Pearson correlation coefficient between the TM-scores calculated by the two methods was 0.970. Moreover, comparing the TM-scores SARST2 computed with SCOP family- and superfamily-level homologs, we found that a TM-score ≥ 0.7 may indicate family-level homology between proteins.

Computation of the pC-value, a score quality control factor

In this work, we defined a confidence score (Conf) to describe how likely a hit subject is a structural homolog of the query protein. In addition, the negative logarithm with base two of Conf, i.e., the pC-value, was applied as a quality control factor for eliminating hit subjects with low structural similarities to the query, as shown below.

$${{\mbox{Conf}}}={S}^{{\prime} }\times {{\rm{TM}}}-{{\rm{score}}}\times p-{{\rm{score}}}$$

(10)

$${S}^{{\prime} }=(S-{S}_{\min })/({S}_{\max }-{S}_{\min })$$

(11)

$${{\rm{pC}}}-{{\rm{value}}}=-{\log }_{2}({{\rm{Conf}}})$$

(12)

where S denotes the structural string similarity score computed by the synthesized DP between the subject and query proteins, and S′ is S being normalized. The theoretical minimum of S, S_min, is calculated according to the scoring matrices applied in the synthesized DP such that every query residue is aligned with a residue that produces the lowest score (see Eq. (1)). The theoretical maximum of S, S_max, is calculated by aligning the query structural string to itself. TM-score, defined above, quantifies structural similarity based on structural superimposition. The p-score is a probability score obtained by ANN, determining whether a subject is a family-level structural homolog of the query protein (see the next subsection).

The typical range of either S’, TM-score, or p-score was [0, 1]; thus, the Conf score also fell into this range. In some rare cases where Conf was lower than 0 or larger than 1, it was set to 0 or 1, respectively. Both Conf and pC-value could be used to remove poorly scored subjects. We actually implemented a Conf score threshold and a pC-value cutoff in the SARST2 standalone program. In contrast with the Conf score, the pC-value describes how unlikely a subject is a structural homolog of the query protein, where a large pC-value implies a poor structural similarity between the subject and the query. The purpose of the pC-value was to be a quality control parameter analogous to BLAST’s E-value²⁸, for which a high value implies a poor sequence alignment quality between proteins. However, it is noteworthy that the E-value has a solid statistical meaning—the expected number of hits with similar scores (alignment qualities) that could be found in the target database only by chance, whereas the pC-value is simply the negative binary logarithm of the confidence score. A pC-value of x means the confidence score of a hit is 1/2^x.

Utilization of machine learning

The integrated machine learning system³⁵ we developed was utilized here. The applied ML algorithms included a decision tree and an artificial neural network with backpropagation, which had been used to predict circular permutations^35,36 and secondary structures^53,59. Applying ML in SARST2 accelerates filtering by classifying whether a hit subject is a family-level homolog of the query protein, i.e., a binary classification. The features applied in SARST2 fell into five categories: (1) basic properties of the subject, e.g., protein size; (2) relative properties of the subject, e.g., subject-to-query size ratio; (3) similarity scores from word matching; (4) similarity scores from DP alignments; (5) structural similarity scores, e.g., TM-score and root-mean-square distance. As illustrated in Fig. 1, four ML filtering gates were established for these steps: (1) word matching; (2) the initial DP for SARST sequences; (3) the DP for SSE plus AAT sequences; (4) the final TM-scoring. The number of applied features gradually increased across these stages. See Supplementary Table 6 for details.

Every ML model was trained with Qry200 and tested with Qry400. None of the proteins from Qry200 nor their family-level homologs overlapped with Qry400. Each Qry200 query was compared with all structures of the SCOP-2.07 dataset. Family-level homologs of the query were classified as positive, while the others were negative cases. The number of positive cases obtained with Qry200 was 24,462, and that of negative cases was 28,951,338. To handle this class imbalance, we applied random sampling to construct multiple balanced subsets and trained an ensemble of small ML models for the binary classification, as follows,

1.
Make a training subset by randomly choosing 20,000 cases from the positive and 20,000 from the negative sets.
2.
Compute the feature values for each case.
3.
Establish a small DT model for this training subset.
1. (1)
  The classification branch of the CART algorithm (Classification And Regression Tree) is applied with Gini’s diversity index⁶⁰.
2. (2)
  The Gini index is an impurity measure for tree building. If p_pos and p_neg represent the probability of a subject being classified into the positive and negative sets, the index is 1 – p_pos² × p_neg².

4.
Establish a small ANN model for this training subset.
1. (1)
  The ANN model comprises an input layer with n_i nodes, two hidden layers, and an output layer with one node, where n_i equals the number of features.
2. (2)
  The first and second hidden layers contain n_i× 1.5 and $\sqrt{{n}_{{{\rm{i}}}}\times 1.5}$ nodes.
3. (3)
  In the training process, the output answer for a positive case is 1, and that for a negative case is 0.

5.
Repeat Steps 1–4 for m times, establishing m small DT and m small ANN models.
6.
In SARST2 database searching, for each hit, use the small DT models as a filter to determine whether it can remain in the subject pool.
1. (1)
  Calculate the feature values of the hit and input them into the small models.
2. (2)
  If any model classifies the hit as a positive case, it remains in the pool and goes to Step 7; otherwise, it is discarded.

7.
For each hit remaining in the subject pool, compute its probability score of being a family-level homolog of the query, i.e., the p-score in Eq. (10).
1. (1)
  Input the feature values of the hit into the m small ANN models.
2. (2)
  Average the m output values of the small models to make the final output. This final output is defined as the p-score for the hit.
3. (3)
  The output of each small ANN model is a real number ∈ [0, 1]; hence, the p-score also has a minimum of 0 and a maximum of 1.

8.
Based on Eqs. (10)–(12), calculate each hit’s confidence score and pC-value.
9.
Update the hit list by sorting the hits according to the confidence score from high to low (meaning, the pC-value from low to high).

Trained with Qry200 and tested with Qry400, SARST2 outperformed state-of-the-art methods like TM-align, MICAN-SQ, and MADOKA (compare Supplementary Table 7 with Fig. 2). Since the training and testing sets comprised different SCOP families, the evaluation may be considered independent. Nevertheless, we further evaluated SARST2 with more difficult cases. CATH, another structural classification database established through automated matching and manual inspections⁵², classified proteins hierarchically into Classes, Architecture, Topology, and Homology. SCOP and CATH had large differences in domain definitions and partitioning of the fold space across classification levels. We obtained the 40% identity non-redundant subset from the CATH website, or nrCATH40 in brief, as the independent test dataset (Supplementary Data 3). Moreover, this dataset was highly challenging due to the low sequence identity between homologs. Supplementary Table 7 demonstrates that, despite decreased precision, SARST2 still performed well on nrCATH40. The precision of Foldseek and TM-align decreased to the same level (~72%) under this challenge. Foldseek was the most accurate, with a precision 0.89% higher than SARST2, which slightly outperformed TM-align. SARST2 was the fastest, running 1.37× faster than Foldseek. For reference, the precision of BLAST (~26%) declined much more significantly than that of SARST2, Foldseek, and TM-align. SARST2 and Foldseek were superior in speed, running 2.49× and 1.82× faster than BLAST, respectively. SARST2 remained robust, even under these challenging independent tests. The final SARST2 program was then trained by combining Qry200 and Qry400 into a single query dataset. This final version was used for all other experiments described in this report, such as those shown in Figs. 2–5.

Performance evaluations by information retrieval (IR)

The Qry400 and SCOP-2.07 were utilized as the query and target datasets in most of our IR experiments. For each query protein in the Qry400, a hit list of candidate homologs was retrieved from the SCOP-2.07 using an assessed alignment method. The accuracy of the search method was defined by how condensed the known SCOP family-level homologs of the query were ranked in the hit list. Let N_total denote the total number of known homologs in the target database, and N_ret the number of retrieved homologs counted from the top of the hit list to a specified position N_i. IR parameters, recall (r) and precision (P)³⁴, were computed as follows:

$$r=\frac{{N}_{{{\rm{ret}}}}}{{N}_{{{\rm{total}}}}}\times 100(\%)$$

(13)

$${{\rm{P}}}(r)=r\times \frac{{N}_{{{\rm{total}}}}}{{N}_{{{\rm{i}}}}}(\%)$$

(14)

These metrics exhibit an inverse relationship: increasing recall often decreases precision. To more intuitively evaluate the overall accuracy of a method, we computed the eleven-point average precision³² (referred to as “average precision” in this report for simplicity),

$${{\rm{Average}}}\; {{\rm{precision}}}=\frac{{\sum }_{k=0}^{10}{{\rm{P}}}(10k)}{11}$$

(15)

where P(0), the precision at 0% recall, was calculated as the precision when the first known homolog is retrieved (i.e., N_ret = 1)^22,32.

Parameter optimization for database searching algorithms

The parameter settings of SARST1 and iSARST followed configurations determined through performance benchmarking in our previous studies^22,23, with word size = 3, gap opening and extension penalties = 9 and 2, and the low-complexity filter for query sequences turned off. The parameters of SARST2 were selected based on its IR performance using the Qry200 dataset, with Qry400 used to validate the chosen settings. In the current implementation, the primary user-adjustable parameter influencing speed and accuracy is the word size, which was fixed as 5 for all IR experiments reported in this paper. The main parameters of Foldseek and BLAST were tuned individually by evaluating their IR performance on Qry400, as detailed in Supplementary Table 1. For Foldseek, the settings were: sensitivity = 0.85, word size = 5, and CPU instruction set = AVX2 (Advanced Vector Extensions 2). For BLAST, the settings were: word size = 2, scoring matrix = BLOSUM62, and gap opening and extension penalties = 11 and 2. All algorithms provided user-adjustable quality control cutoffs for the reported hits (e.g., E-value, pC-value, or the number of hits). Foldseek also offered an adjustable clustered database search mode. These options were systematically tested and compared (Supplementary Table 1). The most effective quality control cutoff for SARST2 was the pC-value, while Foldseek and BLAST achieved optimal performance using E-values. In both the SCOP-2.07 dataset (145 thousand proteins; Figs. 2, 3) and the whole AlphaFold DB (214 million proteins; Supplementary Table 2), Foldseek exhibited improved performance when using the clustered database compared to its standard database.

Performance evaluations by pairwise alignment

All the assessed methods supported pairwise alignment without extra preparation steps. FAST, TM-align, and MICAN-SQ were implemented as pairwise programs. The classic BLAST package provides a pairwise program, bl2seq. Fr-TM-align⁵⁰ and MADOKA programs could make batch computations, but their pairwise alignment modes reported more details. The iSARST web server had no pairwise function. The SARST1 program could not run pairwise alignment unless the target database of a single subject structure was first prepared by a formatdb command. SARST2, Foldseek, and the modern BLAST (blastp) were database search algorithms, but their programs could also perform pairwise alignments without any formatdb step. We evaluated the pairwise alignment performance of several algorithms. The average running time shown in Fig. 3a was calculated from all-against-all pairwise alignments between the Qry400 and SCOP-2.07, totaling ~58 million alignments (400 × 144,879). The identity mapping and distribution data exhibited in Fig. 5a, b were generated from a sampled dataset of homolog pairs to avoid the bias caused by unequal family sizes. A large SCOP family might contain several thousand proteins, potentially dominating the identity distribution compared to small families. The smallest SCOP family of Qry400 had 20 proteins, which could form 190 (20 × 19/2) unique homolog pairs, defining the upper sampling limit per family. We randomly selected 150 pairs from each family, forming a dataset of 60,000 homolog pairs to analyze alignment quality and identity distributions across different methods (Fig. 5).

The evaluated database search algorithms, SARST2, Foldseek, and BLAST, provided multiple parameters that could affect pairwise alignment results. The same optimal parameter settings used in IR experiments (Supplementary Table 1) were also applied in the pairwise alignment experiments.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Unless otherwise stated, all data supporting the results of this study can be found in the article, supplementary, and source data files. Protein structure data used in this study are available at the SCOPe [https://scop.berkeley.edu/], CATH [https://www.cathdb.info/], and the AlphaFold Protein Structure Database [https://alphafold.ebi.ac.uk/] web servers, with accession codes listed in Supplementary Data 1–3. These third-party datasets are publicly accessible without restriction. The numerical source data underlying all graphs in the main Figures (Figs. 2a, 3b, c, 4a–e, and 5a, b) and Supplementary Fig. 3 are compiled in a single Excel file labeled Source Data, with each figure panel presented on a separate worksheet. Source data are provided with this paper.

Code availability

The compiled executables of SARST2, v2.0.30, are provided as Supplementary Software 1–3 for Linux, macOS, and Windows 10+ systems. The source code and executables of the latest version are available at the 10lab website [https://10lab.ceb.nycu.edu.tw/sarst2] and GitHub [https://github.com/NYCU-10lab/sarst]. The GitHub repository has been archived on Zenodo (https://doi.org/10.5281/zenodo.17144647) to ensure long-term accessibility.

References

Rose, P. W. et al. The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res. 43, D345–D356 (2015).
Article PubMed CAS Google Scholar
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article PubMed CAS Google Scholar
Juan, S. H., Chen, T. R. & Lo, W. C. A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy. PLoS ONE 15, e0235153 (2020).
Article PubMed PubMed Central CAS Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS PubMed PubMed Central CAS Google Scholar
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).
Article ADS PubMed PubMed Central CAS Google Scholar
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50, D439–D444 (2022).
Article PubMed CAS Google Scholar
Arnold, K., Bordoli, L., Kopp, J. & Schwede, T. The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics 22, 195–201 (2006).
Article PubMed CAS Google Scholar
Chen, C. C., Hwang, J. K. & Yang, J. M. (PS)2: protein structure prediction server. Nucleic Acids Res. 34, W152–W157 (2006).
Article PubMed PubMed Central CAS Google Scholar
Wang, Z. et al. Protein function annotation with structurally aligned local sites of activity (SALSAs). BMC Bioinform. 14, S13 (2013).
Article CAS Google Scholar
Ruperti, F. et al. Cross-phyla protein annotation by structural prediction and alignment. Genome Biol. 24, 113 (2023).
Article PubMed PubMed Central Google Scholar
Vakser, I. A. Protein-protein docking: from interaction to interactome. Biophys. J. 107, 1785–1793 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Tuncbag, N., Keskin, O., Nussinov, R. & Gursoy, A. Prediction of protein interactions by structural matching: prediction of PPI networks and the effects of mutations on PPIs that combines sequence and structural information. Methods Mol. Biol. 1558, 255–270 (2017).
Article PubMed PubMed Central CAS Google Scholar
Lee, H. S. & Im, W. Identification of ligand templates using local structure alignment for structure-based drug design. J. Chem. Inf. Model 52, 2784–2795 (2012).
Article PubMed PubMed Central CAS Google Scholar
Brylinski, M. Local alignment of ligand binding sites in proteins for polypharmacology and drug repositioning. Methods Mol. Biol. 1611, 109–122 (2017).
Article PubMed PubMed Central CAS Google Scholar
Holm, L. & Sander, C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123–138 (1993).
Article PubMed CAS Google Scholar
Shindyalov, I. N. & Bourne, P. E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11, 739–747 (1998).
Article PubMed CAS Google Scholar
Zhu, J. & Weng, Z. FAST: a novel protein structure alignment algorithm. Proteins 58, 618–627 (2005).
Article PubMed CAS Google Scholar
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Article PubMed PubMed Central CAS Google Scholar
Martin, A. C. The ups and downs of protein topology; rapid comparison of protein structure. Protein Eng. 13, 829–837 (2000).
Article ADS PubMed CAS Google Scholar
Carpentier, M., Brouillet, S. & Pothier, J. YAKUSA: a fast structural database scanning method. Proteins 61, 137–151 (2005).
Article PubMed CAS Google Scholar
Yang, J. M. & Tung, C. H. Protein structure database search and evolutionary classification. Nucleic Acids Res. 34, 3646–3659 (2006).
Article PubMed PubMed Central CAS Google Scholar
Lo, W. C., Huang, P. J., Chang, C. H. & Lyu, P. C. Protein structural similarity search by Ramachandran codes. BMC Bioinform. 8, 307 (2007).
Article Google Scholar
Lo, W. C., Lee, C. Y., Lee, C. C. & Lyu, P. C. iSARST: an integrated SARST web server for rapid protein structural similarity searches. Nucleic Acids Res. 37, W545–W551 (2009).
Article PubMed PubMed Central CAS Google Scholar
Minami, S., Sawada, K., Ota, M. & Chikenji, G. MICAN-SQ: a sequential protein structure alignment program that is applicable to monomers and all types of oligomers. Bioinformatics 34, 3324–3331 (2018).
Article PubMed CAS Google Scholar
Deng, L., Zhong, G., Liu, C., Luo, J. & Liu, H. MADOKA: an ultra-fast approach for large-scale protein structure similarity searching. BMC Bioinform. 20, 662 (2019).
Article Google Scholar
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Article PubMed Google Scholar
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).
Article PubMed CAS Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article PubMed PubMed Central CAS Google Scholar
Shih, C. H., Chang, C. M., Lin, Y. S., Lo, W. C. & Hwang, J. K. Evolutionary information hidden in a single protein structure. Proteins 80, 1647–1657 (2012).
Article PubMed CAS Google Scholar
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).
Article PubMed CAS Google Scholar
Fox, N. K., Brenner, S. E. & Chandonia, J. M. SCOPe: Structural Classification of Proteins-extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, D304–D309 (2014).
Article PubMed CAS Google Scholar
Zhang, E. & Zhang, Y. in Encyclopedia of Database Systems (eds Liu, L. & Özsu, M. T.) 981–982 (Springer US, 2009).
Aung, Z. & Tan, K. L. Rapid 3D protein structure database searching using information retrieval techniques. Bioinformatics 20, 1045–1052 (2004).
Article PubMed CAS Google Scholar
Manning, C. D., Raghavan, P. & Schütze, H. Introduction to Information Retrieval (Cambridge University Press, 2008).
Lo, W. C. et al. Deciphering the preference and predicting the viability of circular permutations in proteins. PLoS ONE 7, e31791 (2012).
Article ADS PubMed PubMed Central CAS Google Scholar
Lo, W. C. et al. CPred: a web server for predicting viable circular permutations in proteins. Nucleic Acids Res. 40, W232–W237 (2012).
Article ADS PubMed PubMed Central CAS Google Scholar
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet.16, 276–277 (2000).
Article PubMed CAS Google Scholar
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).
Article ADS PubMed PubMed Central CAS Google Scholar
Madhusudhan, M. S., Marti-Renom, M. A., Sanchez, R. & Sali, A. Variable gap penalty for protein sequence-structure alignment. Protein Eng. Des. Sel. 19, 129–133 (2006).
Article PubMed CAS Google Scholar
Barton, G. J. in Protein Structure Prediction: A Practical Approach (ed. Sternberg, M. J. E.) 31–63 (RL Press at Oxford University Press, 1996).
Nishikawa, K. & Ooi, T. Radial locations of amino acid residues in a globular protein: correlation with the sequence. J. Biochem 100, 1043–1047 (1986).
Article PubMed CAS Google Scholar
Amitai, G. et al. Network analysis of protein structures identifies functional residues. J. Mol. Biol. 344, 1135–1146 (2004).
Article PubMed CAS Google Scholar
Rost, B. Review: protein secondary structure prediction continues to rise. J. Struct. Biol. 134, 204–218 (2001).
Article PubMed CAS Google Scholar
Yang, Y. et al. Sixty-five years of the long march in protein secondary structure prediction: the final stretch?. Brief. Bioinform. 19, 482–494 (2018).
PubMed CAS Google Scholar
Tomii, K., Sawada, Y. & Honda, S. Convergent evolution in structural elements of proteins investigated using cross profile analysis. BMC Bioinform. 13, 11 (2012).
Article CAS Google Scholar
Wright, E. S. Tandem repeats provide evidence for convergent evolution to similar protein structures. Genome Biol. Evol. https://doi.org/10.1093/gbe/evaf013 (2025).
Liu, J. W. et al. On the relationship between residue structural environment and sequence conservation in proteins. Proteins 85, 1713–1723 (2017).
Article PubMed CAS Google Scholar
Mohammadi, A., Zahiri, J., Mohammadi, S., Khodarahmi, M. & Arab, S. S. PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biol. Methods Protoc. 7, bpac008 (2022).
Article PubMed PubMed Central Google Scholar
Kinjo, A. R. & Nakamura, H. Nature of protein family signatures: insights from singular value analysis of position-specific scoring matrices. PLoS ONE 3, e1963 (2008).
Article ADS PubMed PubMed Central Google Scholar
Pandit, S. B. & Skolnick, J. Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score. BMC Bioinform. 9, 531 (2008).
Article Google Scholar
Lo Conte, L. et al. SCOP: a structural classification of proteins database. Nucleic Acids Res. 28, 257–259 (2000).
Article ADS PubMed PubMed Central CAS Google Scholar
Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).
Article PubMed CAS Google Scholar
Chen, T. R., Juan, S. H., Huang, Y. W., Lin, Y. C. & Lo, W. C. A secondary structure-based position-specific scoring matrix applied to the improvement in protein secondary structure prediction. PLoS ONE 16, e0255076 (2021).
Article PubMed PubMed Central CAS Google Scholar
Pommie, C., Levadoux, S., Sabatier, R., Lefranc, G. & Lefranc, M. P. IMGT standardized criteria for statistical analysis of immunoglobulin V-REGION amino acid properties. J. Mol. Recognit. 17, 17–32 (2004).
Article PubMed CAS Google Scholar
Timberlake, K. C. Chemistry. 5th edn (HarperCollins Publishers Inc., 1992).
Mathews, C. K., Van Holde, K. E. & Ahern, K. G. Biochemistry (Benjamin Cummings, 2000).
Efimov, A. V. Standard structures in proteins. Prog. Biophys. Mol. Biol. 60, 201–239 (1993).
Article PubMed CAS Google Scholar
Lipman, D. J. & Pearson, W. R. Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985).
Article ADS PubMed CAS Google Scholar
Chen, T. R., Lo, C. H., Juan, S. H. & Lo, W. C. The influence of dataset homology and a rigorous evaluation strategy on protein secondary structure prediction. PLoS ONE 16, e0254555 (2021).
Article PubMed PubMed Central CAS Google Scholar
Breiman, L. Classification and Regression Trees (Wadsworth International Group, 1984).

Download references

Acknowledgements

We thank the National Science and Technology Council (NSTC), Taiwan, for supporting this study through grants MOST 111-2221-E-A49-163-, NSTC 111-2218-E-A49-031-, NSTC 112-2221-E-A49-119-, and NSTC 114-2221-E-A49-147- to W.-C.L., and NSTC 112-2813-C-A49-032-E to S.-W.W. This work was financially supported by the “Center for Intelligent Drug Systems and Smart Bio-devices (IDS²B)” from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan. We acknowledge Sheng-Hung Juan and Yu-Wei Huang for implementing the PSSM algorithm and establishing the evaluation procedures, and Yu-Jhen Huang for assisting in setting up the server room and collecting experimental data. We also thank Ling-Yen Kung, Chun-Kai Wu, Yu-Ru Shen, Jun Jie Lau, and Hao-Yu Peng for testing the SARST2 software package and preparing the multilingual user manuals.

Author information

Authors and Affiliations

Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Wei-Cheng Lo, Yan-Jie Li, Shih-Chung Yen & Shih-Wen Weng
Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Wei-Cheng Lo, Chia-Hua Lo, Chia Yee Choke, Jyun-Yi Yang & Shih-Wen Weng
Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Wei-Cheng Lo
The Center for Bioinformatics Research, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Wei-Cheng Lo
Department of Chemistry, University of Southern California, Los Angeles, CA, USA
Arieh Warshel & Shih-Chung Yen
Institute of Bioinformatics and Structural Biology, National Tsing Hua University, Hsinchu, Taiwan
Chia-Hua Lo

Authors

Wei-Cheng Lo
View author publications
Search author on:PubMed Google Scholar
Arieh Warshel
View author publications
Search author on:PubMed Google Scholar
Chia-Hua Lo
View author publications
Search author on:PubMed Google Scholar
Chia Yee Choke
View author publications
Search author on:PubMed Google Scholar
Yan-Jie Li
View author publications
Search author on:PubMed Google Scholar
Shih-Chung Yen
View author publications
Search author on:PubMed Google Scholar
Jyun-Yi Yang
View author publications
Search author on:PubMed Google Scholar
Shih-Wen Weng
View author publications
Search author on:PubMed Google Scholar

Contributions

W.-C.L. conceived the project, developed the core library, designed the algorithm, supervised the study, and wrote the manuscript. A.W. participated in algorithm design and helped supervise the study. C.-H.L. and C.-Y.C. contributed to the development of the core library. C.-H.L., C.-Y.C., Y.-J.L., J.-Y.Y., and S.-W.W. contributed to algorithm development, dataset preparation, and performance benchmarking. S.-C.Y. contributed to machine learning refinement and data analysis. J.-Y.Y., Y.-J.L., and S.-W.W. prepared datasets and created the software documentation. All authors helped draft the manuscript.

Corresponding author

Correspondence to Wei-Cheng Lo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Jeffrey Skolnick and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Software 1

Supplementary Software 2

Supplementary Software 3

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lo, WC., Warshel, A., Lo, CH. et al. SARST2 high-throughput and resource-efficient protein structure alignment against massive databases. Nat Commun 16, 8691 (2025). https://doi.org/10.1038/s41467-025-63757-9

Download citation

Received: 02 March 2024
Accepted: 26 August 2025
Published: 30 September 2025
DOI: https://doi.org/10.1038/s41467-025-63757-9