Table 1 Summary of encodings of protein sequences, models, and acquisition functions tested in this work

From: Active learning-assisted directed evolution

Encoding

Dimension per Residue

Description

AAIndex

4

Continuous fixed amino acid descriptors

Georgiev71

19

Continuous fixed amino acid descriptors

Onehot

20

Categorical (which amino acid)

ESM233

1280

Learned embedding from a protein language model (ESM2 with 650 million parameters)

Model

Bayesian?

Deep Learning?

Description

Boosting Ensemble

N

N

An ensemble of 5 boosting models

Gaussian Process (GP)

Y

N

A collection of continuous functions described by a posterior

DNN Ensemble

N

Y

An ensemble of 5 multilayer perceptrons (deep neural networks, DNNs)

Deep Kernel Learning (DKL)29

Y

Y

A GP on the last layer of a deep neural network

Acquisition Function

Deterministic?

Description

Greedy

Y

Acquires the maximum value of the mean from the posterior

Upper Confidence Bound (UCB)

Y

Acquires the maximum value of a certain confidence interval from the posterior (tuned by a hyperparameter)

Thompson Sampling (TS)

N

Acquires the maximum value of a random function sampled from the posterior