Extended Data Fig. 4: Identity between positive and negative samples and prediction accuracy on the Central Dogma task. | Nature Machine Intelligence

Extended Data Fig. 4: Identity between positive and negative samples and prediction accuracy on the Central Dogma task.

From: Generalized biological foundation model with unified nucleic acid and protein language

Extended Data Fig. 4: Identity between positive and negative samples and prediction accuracy on the Central Dogma task.

a. and b. The relationship between sequence identity metrics and LucaOne model prediction accuracy: NCBI blastn sequence identity for nucleic acid and protein sequences before and after mutation. c. and d. Embedding Euclidean distances based on mean pooling and their prediction accuracy in LucaOne for nucleic acid and protein sequences before and after mutation. Upper panels: Sample distributions across sequence similarity, change ratio, or embedding Euclidean distance ranges. Lower panels: Prediction counts and accuracy of the LucaOne embedding within each respective range. Note: Data for a. and b. includes all nucleic acid and protein-negative samples from the validation and testing sets. Data for c. and d. includes only positive-negative sample pairs that are both present in the combined validation and testing datasets. Divide the statistical intervals of the metrics into quartiles according to the data distribution.

Back to article page