Extended Data Fig. 1: Evaluation of Hit Rates for Enzyme Commission (EC) Numbers and CRISPR-associated Proteins Using ProTrek’s Text-to-Sequence Retrieval Function.
From: A trimodal protein language model enables advanced protein searches

Textual descriptions were used to query the UniProt50 database and extract the top 100 proteins not included in the training data. a, Bar plot depicts hit rate for six EC numbers selected from six top-level EC classes, categorized into exact matches and varying levels of mismatches (Mismatch-1: mismatch in the last digit; Mismatch-2: mismatch in the last two digits; Mismatch-3: only match the first digit; Mismatch: mismatch all digits). Each bar shows the proportion of exact matches and progressively less accurate matches. The results demonstrate ProTrek’s performance in identifying enzyme categories, with variations in match accuracy across different EC numbers. b, Bar plot illustrates hit rate for six extensively studied CRISPR-associated proteins (Cas10, Cas3, Cas9, Cas12, Cas13, and Csm/Cmr). Each bar is segmented into matches and mismatches, showcasing ProTrek’s retrieval performance across these protein categories. If the corresponding keywords such as ‘Cas10’, ‘Cas3’, ‘Cas9’, ‘Cas12’, or ‘EC:3.4.15.1’ appear in the protein’s ground truth, it is considered a true hit.