Introduction

Proteins have been leveraged in human society for a variety of applications such as therapeutics, food processing, textiles, commodity materials, waste remediation, and have the potential for future impacts such as plastic upcycling1,2,3,4,5,6,7,8. The efficacy of these applications is often hampered by a key challenge: enhancing the high temperature stability of active proteins towards more favorable system conditions. Researchers have been searching for methods to rapidly increase the thermal stability of proteins for decades, with many contradictory findings, revealing that there are no generally applicable “rules” that can predictably improve thermal stability. Instead, the thermal stability of each protein is convoluted and dependent on the protein itself, i.e. each protein is a unique engineering challenge9,10,11,12,13,14,15. This is made additionally challenging by the frequent coupling of high temperature stability to other extreme conditions, and to the evolutionary history of each exhibiting organism16.

Recently, deep learning has been an accelerating force for protein engineers, and large attention based models are at the forefront17,18,19. Successes include supervised predictors of a property of interest, including thermal stability20,21, zero-shot predictors of the same22,23, structure prediction models24,25, and sequence design models that can be used to sample mutations or variants less likely to inactivate the protein than random mutants26,27. Existing supervised strategies help rank proteins among a pool of variants after training on a specific thermal stability target, but require time and resource intensive labeled data from the specific protein of interest to be accurate28,29,30. Zero-shot predictors remove the need for labeled data by either learning from evolutionary scale sequence datasets or by conditioning on homologs of the protein of interest and impressively achieve some predictive performance on observable properties. Unfortunately, the latter requires many known homologs to retain fidelity, and both are not strictly targeting thermal stability, unable to reliably rank the global optimum among a dataset28. Structure to sequence designers are able to create amino acid sequences that likely fold to a desired 3D structure at ambient temperatures, but do not specifically target high temperature stability and none have yet been show to reliably target function at a particular temperature. These deficiencies motivate the current work—can evolutionary sequence data be organized in such a way that we can accelerate the design of high temperature proteins in a context dependent manner without supervised training?

Presented is a method for designing and ranking high temperature proteins by learning a translation between ambient and high temperature protein space using neural machine translation: Neural Optimization for Melting-temperature Enabled by Leveraging Translation (NOMELT). An overview of our work is given in Fig. 1. By training a neural machine translator on a dataset greater than 4 million pairs of protein homologs, where one protein occurs in an ambient temperature prokaryote and the other in a high temperature prokaryote, we show that a protein language model can target thermal stability31,32. The model was used to autoregressively (allowing for insertions and deletions) inject mutations to create a variant of the Drosophila melanogaster Engrailed Homeodomain (EnHD) that is stable at a higher temperature according to both molecular dynamics simulations and thermal melt experiments. The model is also shown to have zero-shot predictive capabilities for ranking variants by experimental melting temperatures and catalytic half activation temperatures for a number of experimental reference datasets. Unlike existing zero-shot predictors, NOMELT does not require MSA homologs as input and was not trained on tens of millions of data points. These results suggest that leveraging organism growth temperature improves the richness of data for designing high temperature proteins.

Fig. 1
figure 1

Overview of presented work. (A) We fine-tuned and autoregressive encoder-decoder protein language model to recapitulate thermophilic variants of mesophilic proteins, using 4 million meso-thermo homolog examples from bacterial origin. The dataset is limited by only those where thermophilic organism temperature could be acquired, yet covers 2.7 k protein families labeled using Pfam34. (B) We used the trained model to score protein variants on thermal stability benchmark datasets including single and multi mutant data. The model achieves statistically significant correlation with measured melting temperatures and catalytic half inactivation temperatures. (C) We use the trained model to redesign 1ENH. The model suggested 14 changes, including insertions, that increased the melting temperature of the protein by 15.5 K.

Results and discussion

Thermophilic sequences can be recapitulated from mesophilic homologs

The NOMELT model was trained, given an input mesophilic protein sequence, to construct a thermophilic homolog by causally building the amino acids from N to C terminus. It is known that proteins among a family can have as little as 20% identity and observe insertions and deletions, yet still retain same-or-similar function33. Thus, we would not expect nor want the model to perfectly recover every thermophilic amino acid in a particular protein pair. Instead, the data that the model was trained on are redundant among protein pairs, as seen in Figure S1, where the probability density of the number of times a particular sequence is seen in the training set for mesophilic and thermophilic sequences is shown. Here, each unique thermophilic or mesophilic protein in the dataset may be paired to one or more homologs. Many participate in only one homologous pair, however especially for thermophilic proteins, many examples are paired to multiple counterparts. Each thermophilic sequence to be constructed has on average 103.6 mesophilic counterparts, and 7.1 for the inverse. This is desirable as NOMELT has some redundancy across evolutionary space to learn from. The mesophilic sequences were labeled with Pfam 35 (E < 0.0001), finding 2.7 k total families from 348 clans in the dataset out of 19.6 k possible families, with family counts ranging from seen only once to seen > 1 m times, and 52 k examples having no significant Pfam label34. The data available were limited by only those prokaryotic proteins for which a thermophilic homolog could be associated to a known thermophile32. See Table S1 for counts of all families in the dataset, and Figure S2 for counts of Pfam clans, example 3D structures for some of the most observed clans shown.

To evaluate the model’s ability to recapitulate thermophilic counterparts, we consider a held-out test set of protein clusters. The test set was extracted from the overall dataset by 50% identity clusters such that no sequence had > 50% identity with another across development splits. Further, to reduce evaluation bias towards highly represented protein families and make test evaluation feasible, only a single sequence from 1000 random clusters was selected. The resulting sequences were diverse, exhibiting 479 unique Pfam labels, while 355 examples had no recognized family. On this test set we compute a number of metrics by comparing the thermophilic sequence to the model’s output, see Table 1. The test loss is determined causally with teacher forcing for previous amino acids, while all other metrics are determined by comparing to a BEAM search over autoregressive stochastic translations, yielding a single translated sequence35.

Table 1 Scores of the model for recapitulating the test set.

The model generates sequences with a 2.3% difference in length (shorter or longer) compared to the true thermophilic sequence, and has a transcription error rate of 47%, indicating that 53% of the time the model places the exactly correct amino acid in the correct position. Note that the error rate does not consider homology/analogous amino acids, or gaps and deletions, and is normalized to the total number of residues in the test set as opposed to the number of sequences. In order to ground the model’s understanding of the meso to thermo translation space, we leverage known natural variation to qualitatively compare to the test loss value. A Multiple Sequence Alignment (MSA) was built for each thermophilic sequence by searching the entire thermophilic dataset of homologs using jackhmmer36. The cross entropy for the thermophilic targets over the natural variation, residue-wise, is 0.94 on average. The model’s test categorical cross entropy loss is 0.90, indicating that the model is slightly better at predicting the true amino acid at a position than treating each position independently and sampling the most probable residue from natural amino acid distribution over thermophilic homologs. Critically, the model does not require an MSA of other thermophilic homologs, but a single input mesophilic sequence.

To consider the viability of the model to produce full sequences, as opposed to residue-wise prediction, we also took each generated sequence and aligned it to its true test thermophilic sequence and found full length alignment identities between 4.5 and 100% with an average of 43%, and bits-per-residue of 2.4 according to the BLOSUM62 scoring matrix. Only 19% of the time is the generated sequence further from both the mesophilic and thermophilic sequence in identity than they are from each other. Further, we labeled the secondary structure using pyDSSP following ESMFold structure prediction for each generated and thermophilic sequence25,85. The difference in distribution over Helix, Strand, and Loop structures for each pair of generated-ground truth sequence is measured by the Jenson-Shannon divergence. With a value of 0.01 on average, the model is correctly transcribing the expected secondary structures in target proteins. Lastly, 3D alignment of the generated and thermophilic predicted structures using FATCAT produces a mean P-value of 0.01 and a max of 0.1, indicating that the generated sequence is close in structure to the ground truth, according to ESMFold and FATCAT37. The variability of these metrics across the protein families in the test set is given in Figure S3. While we expect scores across families not in the test set to occupy a similar distribution, they are clearly variable and system-dependent. We observed a statistically significant worsening of metrics as the rarity of the protein family in the dataset increased. For 4 test examples, the BEAM search collapsed and the model did not produce a full protein sequence predicted to fold to a similar structure to the true thermophilic homolog (FATCAT P-value > 0.05). We recommend that the predicted structure of BEAM outputs from NOMELT be confirmed to be of the expected fold before considering variation suggested by the model.

Known thermophilic protein attributes are captured by the model

The community has been searching through the pool of thermophilic proteins for years in order to determine the rules and modes of high temperature stability. While the general consensus is that there is no universal set of rules, with high temperature stability being highly sequence-dependent, a few notable observations are widely accepted15,39. Firstly, it is known that thermophiles leverage a different distribution of amino acids than do mesophiles40. We measured the distributions of amino acids for model-generated sequences and found that the model uses a similar shift in amino acid propensities as has been previously reported. In Fig. 2, we can see the change in amino acid prevalence between thermophilic and mesophilic for model-generated sequences follows the previously reported values in direction and magnitude. Note that the literature values are derived from only 16 proteomes, and our data differs in magnitude for some amino acids. Model generated sequences generally follow a similar distribution as our test data. Uniquely, we found that Valine, which both diverged from the literature shift in our model and was found to be statistically significant, was most likely to be replaced with Isoleucine (24%), generally a good substitute, when not predicted correctly (65%). Indeed, the model produced more Isoleucine than observed in the test set.

Fig. 2
figure 2

Change in Amino Acids frequencies between mesophilic proteins and thermophilic ones. In black, values derived from 16 proteomes in literature40. Arrow starts indicate mesophilic usage of the amino acid, and arrow ends indicate the amino acid usage for the compared group. In orange, the test set of our data, and in purple, generated sequences by our model. Statistically significant shifts determined in literature are highlighted in blue. For almost all significant amino acids, our data has about the same direction in shift as identified from reference proteomes, with most being the correct order of magnitude. The generated sequences recapitulate the distribution observed in our data.

Next, while the evolutionary advantage of disulfide bonds for high temperature stability is unclear, it is accepted that thermophilic archaea tend to leverage them to a greater extent than do mesophiles41,42. The model outputs were probed for the predicted likelihood (Eq. 1) of cysteine to determine if it understands the importance of such bonds for high temperature stability. It was found that the model is significantly more likely (T-test P-value = 8.5e−22) to place a cysteine in the sequence to complete a disulfide bond than to place one that does not form a bond, as shown in Fig. 3. Note that disulfide bonds are posited using the heuristic of 7.5 Å alpha carbon distance on the ESMFold predicted structure25,43.

Fig. 3
figure 3

Model predicted log likelihood of Cysteine residues for positions that would or wouldn’t form a disulfide bond with another cysteine in the predicted structure (Cα distance < 7.5Å43). In green, the log likelihood of an amino acid assuming random uniform. For residues where the Cysteine would not form a bond with an existing Cysteine, the model has little Cysteine bias, sometimes predicting Cysteine with high probability and other times choosing different amino acids. For residue positions that would form a disulfide bond, the model has a very heavy bias towards predicting Cysteine, T-test P-value = 8.5e−22.

Strictly speaking, the model is trained to convert proteins to look more like thermophilic variants, however we know that these must be stable at high temperature, in order for the host organism to thrive. To confirm this and evaluate the model’s ability to capture it, we adapt the recent mAF-min method by normalizing by sequence length44. For the test set examples, we compared the mAF-min estimated stability of the ground truth thermophilic homolog and the model translation to the mesophilic protein, finding that both are estimated to be more stable with statistical significance (Fig. 4). While the model generated predicted stabilizing variants on average, it learned from natural thermophilic proteomes which have evolutionary baggage other than strictly thermally stabilizing variations. The model outputs should be considered possibly stabilizing variation to search over but are not guaranteed to produce a stabilized variant. We did not observe a correlation between length differences (insertions and deletions suggested by the model) and estimated thermal stabilization.

Fig. 4
figure 4

Shift in stability from mesophilic protein, according to the mAF-min method. Ground truth thermophilic sequences, are stabilizing on average, with 56% of examples in the dataset having > 95% confidence of having a high folding free energy change. The sequences generated by the model on the test set capture a similar conference of increased stability, with 72% meeting the > 95% interval.

Case study 1: use NOMELT to engineer thermally stabilizing sequence variation

We ran the model on the Engrailed homeodomain (EnHD), a three-helix-bundle transcription factor found in Drosophila melanogaster45,46. Note that while NOMELT’s data contains no eukaryotic proteins, and no sequence in the training set has a BLASTp E-value < 1.5 to EnHD, the helix-turn-helix motif was frequent in the prokaryotic dataset used for model training, see Figure S247. A BEAM search produced a sequence with 14 suggested mutations (insertions and deletions included) when aligned with the wild type protein, as seen in Figure S4.

To determine any incurred stability increase, we ran molecular dynamics on EnHD variants, which has been done extensively to study the stability of this particular protein48,49. Such computational studies have observed several thermal equilibrium ensembles on shorter timescales than sampled in this paper50. In order to effectively sample the range of unfolding pathways and resolve differences of ~ 10 K in melting temperatures, a large ensemble of replicas with randomized starting velocities is needed. Five replicates of 1 microsecond dynamics were run at a number of temperatures for each of the wild type protein, the NOMELT variant, and a previously engineered variant that is thermostable up to > 373 K48,51. The Cα RMSD over each simulation was averaged, yielding a distribution of 5 time-averaged values per protein per temperature. Note that the starting structure (prior to equilibration) for the NOMELT variant is an AF2 predicted structure. In Fig. 5, the average RMSD relative to 298 K is given as a function of temperature. The previously engineered variant, “UVF,” retains a tight distribution of displacement around its starting structure up to 370 K, while the wild type protein begins to open up at its melting temperature of 325 K, indicated by a tight distribution of RMSD values widening and an increase in magnitude discontinuously52. The NOMELT variant, while not nearly as stable as UVF, which required significant human and computational resources to design, retains its structure until at least 340 K before doing the same.

Fig. 5
figure 5

RMSD over 1 microsecond dynamics simulations of EnHD variants. RMSD values are relative to the average value of the RMSD of the variant over the 296 K simulations. Distributions are over 5 independent simulations. The UVF variant, experimentally determined to have a melting temperature > 372 K, remains tightly bound to its initial structure at all temperatures52. The wild type protein opens up from its starting structure at its melting temperature, while the NOMELT variant retains its structure at least another 10 K.

We next evaluated the NOMELT EnHD variant with circular dichroism (CD) to observe secondary structure components and thermal denaturation, as seen in Fig. 6. The NOMELT variant was helical in solution at room temperature (298 K, 25 °C), and its melting temperature was 340.0 ± 1.8 K (67.0 °C), an increase of 15.5 K relative to EnHD’s melting temperature of 324.5 ± 0.3 K (51.5 °C). While in this case the model outputs explicitly produced a stabilizing effect, this strategy needs to be tested on more systems to determine reliability. Indeed, the in silico metrics we observed on the test dataset were system dependent, as discussed earlier. Frequency of the family of the protein of interest in the dataset may be a heuristic for the model’s likelihood to not break the protein fold (Figure S3).

Fig. 6
figure 6

Experimental characterization of a stabilized EnHD variant. (a) Mean residue ellipticity at 298 K (25 °C) for EnHD (filled circles) and NOMELT (open circles). (b) Thermal denaturation monitored by CD at 222 nm with melting temperatures inlaid.

Case Study 2: NOMELT as a zero-shot estimator of thermophilicity

The output token likelihood of the decoder is a way to estimate the “thermophilicity” of a target mutation or variant according to the model’s learned distribution. The idea of using protein language models either zero-shot53,54 or after self-supervised training on a protein family23,55,56 to evaluate fitness even when labeled fitness data is unavailable is not novel. However, unlike the family specific techniques, NOMELT is pretrained and does not require an MSA input, and unlike both methods, NOMELT is trained to specifically consider high temperature proteins. For example, ESM2 or MSATransformer are trained on all available proteins sequences and model protein diversity across the evolutionary tree25,57. Models like these can be helpful to evaluate fitness by filtering out variation that is universally unnatural, but we argue does not account for interactions that may be specifically important at high temperatures far from typical natural conditions.

Here, we take protein variants with experimental data and evaluate the relationship between NOMELT likelihoods and ground truth thermal stability data. The parent or wild type protein is given to the encoder, and a variant given to the decoder, which predicts a probability vector. After aggregating, the quantity is normalized by the wild type value. See Methods for mathematical details. Firstly, the model was used to rank variants of LovD and LipA from a directed evolution campaign for which experimental melting temperature was measured, with up to 29 and 12 mutations away from wild type respectively58,59. Acetyltransferase LovD is a promising enzyme for the production of Lovastatin, used for treating heart disease, while lipases see a wide range of uses in industrial applications60. We observe model correlation with the experimental melting temperature, as shown in Fig. 7. Note that these variants have a minimum E-value of 1.0 with respect to the NOMELT training set, but homologs are in the ProtT5 pretraining set. For LovD, NOMELT is extremely predictive out of the box, with a Pearson correlation of 0.939 to the true melting temperature. When the original pretrained T5 model weights are used we observe a correlation of − 0.927. This is not just a sign switch; T5 scores for 10 variants with 10 random mutations of LovD receive an average log likelihood score of − 212, compared to the stabilized variants with 10 and 29 mutations with scores of − 173 and − 578, respectively. NOMELT on the other hand scores the random variants at − 34 while nearly all of the stabilized variants receive positive scores, thus differentiating random from stabilizing variants. Recalling that the temperature stability for this benchmark increases as mutations accumulate, this suggests that the original model trained on bulk natural sequences cannot be used to rank unnatural stabilizing mutations for this system, an ability granted by our training regime. For LipA, predictions are poorer; the model can qualitatively rank most high temperature variants, but it overestimates the wild type. We hypothesize that this is because the starting melting temperature of the LipA protein is already quite high, and the model was trained on protein pairs where growth temperature was not continuously differentiated, E.g. the model was not told the difference between a 338 K protein and a 353 K one. This suggests that NOMELT has some ability out of the box to rank protein variants in terms of high temperature stability.

Fig. 7
figure 7

Experimental melting temperature measurements vs NOMELT logP sequence scores for two protein variants sets. Black line is the wild type score. (A) NOMELT is highly predictive of LovD variants, with a low wild type melting temperature. Note that for 10 variants each with 10 random mutations, NOMELT scores at -34 on average. (B) For LipA variants, the model can qualitatively rank some of the high temperature variants, but overall ranks the wild type sequence, which is already a relatively high melting temperature, too high.

Lipase has also been the subject of a deep mutational scan (DMS) with catalytic half activation temperature as a target61. This experiment of 2172 single point mutants serves as the only benchmark in ProteinGym with a thermal stability target62. When our model is used to rank variants, we note a Spearman correlation to the true target of 0.32, which is state of the art for models that do not require MSAs, homologs, or structures as inputs as of Jan, 2024. NOMELT matches the score of ESM2, which was trained on significantly more data, indicating that focusing on thermophilicity has merit25. Note that in the original data, replicates were taken for each variant, while ProteinGym only considers the mean. When we consider only the 34% (N = 1.7 mil) of pairwise comparisons with T-test P-value < 0.05, NOMELT qualitatively distinguishes the mutations 64% of the time. This is impressive in that variants in the DMS library typically have less than 2 K differences. When the pretrained model weights are used, the correlation is insignificant at < 0.001 and P > 0.95.

Lastly, in order to include zero-shot results for many more protein folds than thus far explored, we consider thermal stability data found within FireProtDB63. Note that these data, unlike the previous data in this case study, come from many different experiments and assays and the quality of each point cannot be verified. When the model was used to score mutants of proteins with fewer than 500 amino acids, resulting in 13 k proteins from at least 97 Pfam families, we found a Spearman’s correlation of 0.3 (P-value 7.55e−266) with actual changes in melting temperature. The database is known to be biased towards destabilizing mutations—If we consider the model qualitatively and balanced over label bias, it predicted the correct direction of melting temperature change with an area under the receiver operating curve of 0.61. The original T5 model weights produce no better than random Spearman and AUROC.

Methods

Dataset

We leveraged our recent dataset, learn2thermDB, to train and test this model32. The data contains pairs of homologous proteins across mesophilic and thermophilic temperature regimes. As a first pass, we rigorously filtered this data to contain the strongest possible protein pairs. We only considered pairs with a > 20 K difference in optimal growth temperature, with the thermophilic source organism of at least 333 K and mesophilic less than 313 K. The protein pairs themselves were filtered such that only those with > 95% sequence alignment coverage of both strands were kept, and the difference in sequence length less than or equal to 10%. This produced 4.6 million protein pairs.

The dataset was clustered using MMSeqs2 to 50% identity on the mesophilic sequences. The parameters used were as follows: min-seq-id = 0.5, cluster-reassign = 1, cluster-steps = 5, s = 7, max-seqs = 1000, c = 0.95, cov-mode = 0, similarity-type = 2, e = 1e−3, cluster-mode = 164.

Train, validation, and test sets (80/10/10) were split by cluster such that no sequences in a particular data fold are within similarity thresholds of any sequence in another data fold. For the test set, we randomly sampled 1000 clusters and selected from each a single random sequence. This reduces the bias of large clusters on the evaluation and makes computed predicted structure based metrics computationally feasible.

NOMELT model

NOMELT is an encoder-decoder style transformer neural network. We started from the pretrained ProtT5 foundational protein Language Model, trained to mask-fill protein sequences on the BFD database65. The HuggingFace ecosystem of code was used to write the training scripts66. Out of the box, the model can be used to generate latent embeddings of proteins, and using a causal language modeling head, the decoder can recapitulate an encoded protein sequence with > 99% accuracy. By starting from this model, we leverage general protein knowledge that has been shown to be captured by large protein language models, and we save a significant amount of model training carbon cost. In this work, the model was fine tuned to instead causally generate a thermophilic protein homolog of an input mesophilic protein.

We trained the model in a supervised sequence-to-sequence manner using categorical cross entropy loss on each amino acid in the output sequence in a causal, teacher-forcing strategy. We used a linear ramp-up, ramp-down learning rate and an Adam optimizer with maximum learning rate 1e-4 and a ramp rate of 10% of 3 epochs. The model was stopped according to early stopping on validation set loss using a patience of 4 and an improvement threshold of 0.1. This required 7 NVIDIA a40 GPUs 6 h before the validation loss stopped improving, and cost approximately 2.4 kg of carbon emissions. DeepSpeed ZeRO Stage 3 was used to shard and parallelize batches, parameters, and optimization states. BF16 parameter precision was used. A batch size of 140 per device with 2 accumulation steps, was used, for an effective batch size of 1,960 pairs. The model saw each meso-thermo pair independently, receiving no explicit information about which examples were homologous.

Model likelihood

We refer to the softmax probability distribution over amino acids output by the causal model at a specific position \(i\) in the sequence \({P}_{i}(\cdot |{x}_{0},{x}_{1},...,{x}_{i-1})\) as \({P}_{i}\) where \({x}_{i}\) is the amino acid identity at position \(i\). The normalized probability of a particular amino acid \(j\) from the vector at that position is written as \({P}_{i}(j)\). Following recent work that leverages the probability of language models to evaluate the fitness/target quality due to mutation or of a particular sequence23,67, we measure the log likelihood L of a sequence or a set of mutations at specific positions according to Eq. 1 below, where \(M\) is some set of positions in the sequence:

$$L\left( x \right) = \frac{1}{M}\mathop \sum \limits_{i \in M}^{{}} ln\left( {P_{i} \left( {x_{i} } \right)} \right)$$
(1)

Thus, we can measure the likelihood of a specific mutation at a particular position given a fixed set of all previous amino acids, the combined effect of a set of mutations, or to evaluate the causal likelihood of an entire sequence of arbitrary length. This quantity is useful when comparing to another mutation set or sequence, such as evaluating the likelihood of a mutation occurring relative to the wild type, WT, in Eq. 2 below:

$$\Delta L\left( {x_{variant} } \right) = L\left( {x_{variant} } \right) - L\left( {x_{WT} } \right)$$
(2)

Here, a positive value indicates an increase in likelihood relative to the wild type according to the model. Keeping in mind that the model was trained to generate a thermophilic looking sequence, this quantity carries a different meaning than seen before in canonical zero-shot predictors. This quantity does not represent the estimated likelihood of the mutation/variant among variation, but instead the estimated likelihood of the mutation/variant resembling a thermophilic homolog of the input mesophilic sequence. Note that for evaluating variants that do not have insertions or deletions, we only compare probabilities at mutated residues to the wild type, while instead we must consider the whole sequence probability for instances where there are variable residue counts in variants.

Model and natural entropy

The categorical cross entropy of the model for some ground truth residue in a sequence at position \(i\):

$$CCE_{i} = \mathop \sum \limits_{j}^{AA} - 1\left( {j = y_{i} } \right) ln(P_{i} \left( j \right))$$
(3)

where AA is all amino acids, plus a number of additional tokens with near 0.0 probability that can be generally discarded. The true amino acid label is \({y}_{i}\). This quantity is averaged over all tokens in the dataset and considered the loss function of the model. In order to compare this quantity evaluated on the test set to nature, we took each thermophilic test-set sequence and built an MSA using jackhammer against all other thermophilic sequences > 333 K OGT34. Search parameters are listed below. For each column in the MSA that appears in the original sequence, we compute a distribution over natural diversity, \({P}_{i}\), and use this vector to evaluate the test set entropy on natural variation. This represents a model that treats each position independently instead of causally, and relies upon already existing thermophilic homologs.

Parameters for the jackhmmer search were F1 = 0.0005, F2 = 0.00005, F3 = 0.0000005, incE = 0.0001, E = 0.0001, and N = 1 (phmmer). We used 32 cores.

Disulfide bonds

For probing the model’s understanding of Disulfide bonds, we view the test dataset on a teacher-forcing residue basis and consider only positions within the dataset where the true thermophilic sequences have a cysteine. For each of these positions (N = 1687) we measure \({P}_{i}(Cystein)\) according to the model. The ESMFold predicted structure of each thermophilic test sequence is used to identify which cysteines are likely forming Disulfide bonds. Here, we use the heuristic of 7.5 Å between cysteine Cα to label a Disulfide bond43. The second cysteine in bond-forming pairs, causally, are labeled as Bond-forming, while all others are labeled as non-bond forming. Note that the model is only aware of residues to the left in sequence, thus from the model’s perspective, the first cysteine in a bonded pair occurring in the sequence is not forming a bond at the time of inference. We computed the one-sided T-statistic between \(ln{(P}_{C-bond}(C))\) and \(ln{(P}_{C-non-bond}(C))\).

Sequence alignment

BLOSUM62 was used to score all sequence alignments38. For searching case study proteins against the training set, BLASTp with the following parameters were used: evalue = 2.0, word_size = 3, qcov_hsp_perc = 8047. For aligning model designs to ground truths, full Smith-Waterman was used using BioPython with the following parameters: match_score = 1, mismatch_score = − 1, gapopen = − 4, gapextend = − 1, penalize_end_gaps = false68.

Structural prediction

For visualization purposes, Disulfide bond estimation, and test set metrics we use ESMFold due to the cost advantage over AlphaFold24,25. Note that crystal structures were used wherever possible. For the mAF-min method, we use AlphaFold using the same parameters originally described e.g. reduced database size and no force field relaxation44.

Thermal stability estimation

The mAF-min method was used in order to qualitatively compare the high temperature stability of variants of a protein44. This method qualitatively ranks variants accumulating many mutations with high accuracy, for a number of protein families. The raw output of the method is an average over Rosetta energy scores for an ensemble of predicted protein structures69. Such a score resembles a free energy of folding. Comparing two of these values would seem akin to, but does not equate, a change in Gibbs folding free energy, \(\Delta \Delta G\). Such values are used to describe the stability of the protein in its folded state, which differs from thermal stability better described by a metric such as melting temperature. Yet, when the Rosetta scores over the ensembles of this method are compared to thermal stability metrics such as melting temperature, they are qualitative for a number of protein families. While a perfectly accurate predictor of thermal stability that generalizes over protein folds does not yet exist, mAF-min with its use of ensembles of structure, represents one of the best available thermal stability oracles70.

Here, we use the method exactly as described in the original work, with the following exceptions: an ensemble size of 25 instead of 100 was used as suggested by the authors, which was shown to have little loss in accuracy. Even with this reduction, the expense of calling AF2 many times is high, and we randomly selected 50 test set examples to contribute to the results depicted in Fig. 4. Additionally, to account for small numbers insertions and deletions not present in the original work, we normalize the method output which qualitatively represents free energy of the system by the number of residues. To validate this choice, we conducted the method with sequence length normalization on 5 variants of Ribonucleases with sequences differing in length from 96 to 110 and melting temperatures between 314.3 and 326.4 K71. This resulted in a Spearman’s correlation of − 0.999 with P-value 1.4e−24 to melting temperature. Note that the original authors did not provide open source code, so the method was rewritten according to their manuscript in house and is given in this work’s repository.

Molecular dynamics

GROMACS was used to conduct molecular dynamics simulations72. All simulations were set up identically with the exception of temperature. The protein was placed in a cubic periodic boundary box with at least 1 nm between protein and box edge to eliminate unwanted protein–protein interactions across periodic boundary conditions. System was solvated with TIP3P and charge balanced with ions73. Data presented is from 1 microsecond long production runs after 100 ps NPT equilibration. The charmm36m force field was used74. The full configuration files used, including additional information such as timesteps, constraints, velocity cutoffs, etc. are given in a separate repository75. Five replicates were conducted for each protein variant at each temperature. Using MDAnalysis 2.3, RMSD values were aligned to energy-minimized PDB reference structures and calculated over residues 10–52 on all variants to capture only the core residues of the 3-helix bundle76,77. The RMSD values averaged over each 1 microsecond simulation make up an ensemble of five time-averaged RMSD values, which are depicted in Fig. 5. The RMSD of each simulation was normalized to its ambient temperature behavior by subtracting the variant’s mean simulation RMSD at 298 K. This emphasizes the difference of variants over temperatures as opposed to differences between the baseline behavior of the variants.

Cloning, protein expression, and purification

Genes for EnHD were built using oligos designed by DNAWorks 2.078 and ordered as a gBlock for NOMELT (IDT). Both were cloned into a pET21 expression plasmid containing an N-terminal, TEV-cleavable His6 tag under a T7 promoter and ampicillin selection using restriction ligation (NEB). Sequences were confirmed by Sanger sequencing (Sequetech or Genewiz). The plasmids were transformed into BL21(DE3) pLysS E. coli cells (Novagen) for expression.

The E. coli cultures were grown in 0.05 mg/mL ampicillin to an OD600 of 0.7–0.8 at 37 °C, induced with 1 mM IPTG (Apex), grown for another 3 h, and then pelleted and stored at -80 °C overnight. The pellets were lysed using a freeze–thaw method and resuspended in 50 mM Tris pH 8.0, 250 mM NaCl, 10 mM imidazole with benzonase (EMD Millipore) and lysozyme (RPI). Lysates were cleared via centrifugation, and the supernatant was run over a HisTrap column (Cytiva) via FPLC (ÄKTA). Peak fractions were pooled and dialyzed back into the same buffer for cleavage by AcTEV protease (ThermoFisher). Cleaved proteins were purified over a Supradex 75 Increase 10/300 GL (Cytiva). Protein size and purity were confirmed by polyacrylamide gel electrophoresis and concentrations were determined based on UV/Vis A280 (Cary). Masses were additionally confirmed within 1% mass tolerance by MALDI-TOF (Shumatzu).

Circular dichroism spectroscopy

The proteins were dialyzed into 50 mM sodium phosphate buffer pH 5.5, 100 mM NaCl and diluted to a concentration of 5 μM. CD measurements were made on an Olis rapid-scanning monochromator controlled by a Peltier heat exchange system with a Quantum TC 125 temperature controller in a 1-cm quartz cuvette (Hellma) with stirring. Spectra were collected in triplicate from 200 to 250 nm in 1-nm increments at 25 °C, 95 °C, then again at 25 °C (298 and 368 K) to confirm reversible folding, and the buffer blank was collected in triplicate at 25 °C. The plotted spectra (Fig. 6) are the averaged initial spectra at 25 °C with the averaged buffer blank subtracted and millidegrees converted to mean residue ellipticity. Melts were monitored at 222 nm from 5 to 98 °C (278 to 371 K) in 3-degree increments with 5 min of equilibration at each temperature and 60 s of signal integration. Melt data were fit to a 2-state model with linear folded and unfolded baselines. For EnHD, the ΔCp was set to the experimentally determined value of 0.7 kcal/mol52 and for NOMELT, it was allowed to vary.