Fig. 1: Flowchart of the SARST2 algorithm.
From: SARST2 high-throughput and resource-efficient protein structure alignment against massive databases

First, sequence (seq) and structural information of the query and subject (Sbj) protein structures are extracted and encoded into text strings, according to which the subject structures are grouped, each group having a representative head. Second, four filters gradually remove subject groups, represented by their heads, that are unlikely homologs of the query: (1) a word-matching filter to rapidly count similar AA (amino acid) and SSE (secondary structure element) fragments between the query and subject group heads; (2) a DP (dynamic programming) filter to initially align and score the query and subject group heads by their structural strings; (3) a DP filter to adjust the alignments according to the SSE and AAT (amino acid type) at each residue position; (4) a quick TM-score filter to roughly compute structural similarities. After these filters, the subject pool expands by restoring the group members as individual subject proteins. Third, two refinement procedures produce the final hit list, in which subject proteins are ordered according to their structural similarities to the query. The hit list can be output to the screen or saved as an interactive HTML (hypertext markup language) document with structural superimpositions. Protein structure images in this figure were generated using PyMOL v2.0.6. ASize alignment size, CIF crystallographic information file, PDB Protein Data Bank, RMSD root-mean-square distance, TM template modeling, WCN weighted contact number.