Fig. 1: Schematic method overview.
From: Protein remote homology detection and structural alignment using deep learning

a, An integrated TM-Vec + DeepBLAST pipeline could consist of two stages: retrieval and alignment. First, TM-Vec takes a query protein sequence and rapidly retrieves proteins that are predicted to have similar structures (TM-scores) to the query. Then, DeepBLAST produces alignments for the proteins with the highest predicted structural similarity. Note that benchmarking was carried out for TM-Vec and DeepBLAST separately. b, TM-Vec is trained on pairs of amino acid sequences and their TM-scores. We first input a pair of sequences (domains, chains, proteins) and use a pretrained deep protein language model to extract embeddings for every residue of the sequence. Next, we apply a twin neural network, called ϕ, to the embeddings of each sequence and produce a vector representation, z, for each sequence. The ϕ network is trained on millions of pairs of sequences, and its architecture is detailed in Supplementary Fig. 1. Finally, we compute the cosine similarity of the vector representations, which is our prediction for the TM-score of the pair. c, We build a TM-Vec database by encoding large databases of protein sequences using a trained TM-Vec model. As an example, we input the sequences from Swiss-Prot, extract vector representations for every sequence and finally build an indexed database of TM-Vec’s structure-aware vector representations of proteins. d, Demonstration of protein structure search using the TM-Vec pipeline. As the indexed database of vector representations has already been built, protein search consists of first encoding the query sequence using the trained TM-Vec model and then performing fast vector search and TM-score prediction using cosine similarity as the search metric. As search results, we return the k nearest neighbors with the highest predicted structural similarity (TM-score) to the query sequence. e, As a last step, we apply DeepBLAST to produce structural alignments for the k nearest neighbors to a query sequence.