Fig. 3: PLMSearch accurately detects remote homology pairs.
From: PLMSearch: Protein language model powers accurate and fast sequence search for remote homology

a Case study. The sequence identity between the blue structure and the green structure is low (0.216 < 0.3) but they share similar structures. Foldseek, Foldseek-TM, TM-align, and PLMSearch recall the remote homology pair that is missed by MMseqs2 and Blastp. b Definition diagram. Legend a marks three pairs with a TM-score > 0.5, usually assumed to have the same fold52,53. Legend b marks six pairs with a TM-score between 0.2 and 0.5. Legend c marks six pairs with a TM-score < 0.2, usually assumed as randomly selected irrelevant pairs52,53. Legend d marks six filtered pairs. Legend e marks the pair at (3,3) with a TM-score > 0.5 but is not filtered out, which is a “Missed" pair. Correspondingly, protein pairs in (1,4) and (2,2) are “Recalled" pairs. Legend f marks the pair at (3,5) with a TM-score < 0.2 but is filtered out, which is a “Wrong" pair. c–h From the search results of five randomly selected queries to avoid oversampling (with Swiss-Prot as the target dataset, a total of 2,150,700 query-target pairs), we selected the 5000 pairs with the highest similarity for different search methods and counted the recalled and missed pairs: c MMseqs2. d Blastp. e Foldseek. f Foldseek-TM. g SS-predictor. h PLMSearch. For recalled pairs (left) and missed pairs (right) in each subplot, the TM-score (x-axis) and sequence identity (y-axis) are shown on the 2D scatter plot. The thresholds, sequence identity = 0.351 and TM-score = 0.552,53, are shown by dashed lines. All methods successfully recall the easy pairs in the first quadrant. But for remote homology pairs in the fourth quadrant, SS-predictor & PLMSearch did the best, followed by Foldseek & Foldseek-TM, and MMseqs2 & Blastp were the worst. Supplementary Table 7 records the specific values of each metric. Source data are provided as a Source Data file.