Fig. 1: Study Design and Motivation for Protein Homology Search Using Conformal Prediction. | Nature Communications

Fig. 1: Study Design and Motivation for Protein Homology Search Using Conformal Prediction.

From: Functional protein mining with conformal guarantees

Fig. 1: Study Design and Motivation for Protein Homology Search Using Conformal Prediction.

A A query sequence q is compared against a lookup database D using a protein search model (e.g., Protein-Vec). The model generates similarity scores Sij, which are compared against a threshold \(\hat{\lambda }\) determined through calibration. Scores above the threshold are included in the retrieval set \({C}_{\hat{\lambda }}\). Scores below the threshold (e.g., F98079 with 0.943) are highlighted in red to indicate their exclusion. B The process involves computing scores on calibration data, obtaining quantiles, and constructing prediction sets. This approach provides statistical guarantees on the validity of the returned sets, enhancing the interpretability and reliability of protein search results. C The distribution of Protein-Vec similarity scores for UniProt motivates the need for effective thresholds and confidence measures in protein homology searches, particularly given the high similarity scores clustering near 1. D Illustration of the error loss calculation for two enzymes: EC 2.1.1.12 (Methionine S-methyltransferase) and EC 2.1.1.13 (Methionine synthase). The loss function â„“(qC) assigns a value based on the maximum hierarchical loss of the enzymes in a retrieval set C ⊆ D, with 0 meaning every retrieved protein is an exact match. The hierarchical classification tree for part of transferases (EC 2) is shown, with methionine synthase being the ground-truth EC number, and methionine S-methyltransferase being in the model-retrieved retrieval set. This results in a â„“(qC) = 1 hierarchical loss, due to a 4th-level family mismatch. Source data are provided as a Source Data file.

Back to article page