Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Transformer-based protein generation with regularized latent space optimization

A preprint version of the article is available at arXiv.

Abstract

The development of powerful natural language models has improved the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution and next-generation sequencing have allowed for the accumulation of large amounts of labelled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder, which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and a novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence–function landscape of large labelled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly available protein datasets, including variant sets of anti-ranibizumab and green fluorescent protein. We observe a greater sequence optimization efficiency (increase in fitness per optimization step) using ReLSO compared with other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly trained ReLSO models provide a potential avenue towards sequence-level fitness attribution information.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: ReLSO maps sequences to a regularized model fitness landscape.
Fig. 2: ReLSO learns smooth representations of protein sequences.
Fig. 3: Comparison of methods for ML-based protein sequence optimization efficiency.
Fig. 4: Protein sequence optimization of anti-ranibizumab antibodies.
Fig. 5: Optimization of fluorescence brightness of GFP.
Fig. 6: Leveraging attention relationships in ReLSO for fitness attribution.

Similar content being viewed by others

Data availability

The datasets used in this study are available in their processed form at the project’s GitHub repository (https://github.com/KrishnaswamyLab/ReLSO-Guided-Generative-Protein-Design-using-Regularized-Transformers/tree/main/data). Additionally, we include links to the original data sources in the README file of the repository.

Code availability

Implementations of the models and optimization algorithms are available at the project’s GitHub repository (https://github.com/KrishnaswamyLab/ReLSO-Guided-Generative-Protein-Design-using-Regularized-Transformers) and archived on Zenodo (https://doi.org/10.5281/zenodo.6946416).

References

  1. Tiessen, A., Pérez-Rodríguez, P. & Delaye-Arredondo, L. J. Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes. BMC Res. Notes 5, 85 (2012).

    Article  Google Scholar 

  2. Starr, T. N. & Thornton, J. W. Epistasis in protein evolution. Protein Sci. 25, 1204–1218 (2016).

    Article  Google Scholar 

  3. Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).

    Article  Google Scholar 

  4. Chen, K. & Arnold, F. H. Engineering new catalytic activities in enzymes. Nat. Catal. 3, 203–213 (2020).

    Article  Google Scholar 

  5. Arnold, F. H. Design by directed evolution. Acc. Chem. Res. 31, 125–131 (1998).

    Article  Google Scholar 

  6. Rohl, C. A., Strauss, C. E. M., Misura, K. M. S. & Baker, D. Protein structure prediction using Rosetta. Methods Enzymol. 383, 66–93 (2004).

    Article  Google Scholar 

  7. Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl Acad. Sci. USA 118, e2017228118 (2021).

    Article  Google Scholar 

  8. Brookes, D. H. & Listgarten, J. Design by adaptive sampling. Preprint at https://arxiv.org/abs/1810.03714 (2018).

  9. Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Proceedings of the 36th International Conference on Machine Learning 97, 773–782 (2019).

    Google Scholar 

  10. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    Article  Google Scholar 

  11. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-n protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    Article  Google Scholar 

  12. Linder, J. & Seelig, G. Fast differentiable DNA and protein sequence optimization for molecular design. Preprint at https://arxiv.org/abs/2005.11275 (2020).

  13. Angermueller, C. et al. Model-based reinforcement learning for biological sequence design. In International Conference on Learning Representations (2019).

  14. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  Google Scholar 

  15. Liu, G. et al. Antibody complementarity determining region design using high-capacity machine learning. Bioinformatics 36, 2126–2133 (2020).

    Article  Google Scholar 

  16. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  17. Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).

  18. Rao, R., Ovchinnikov, S., Meier, J., Rives, A. & Sercu, T. Transformer protein language models are unsupervised structure learners. Preprint at bioRxiv https://doi.org/10.1101/2020.12.15.422761 (2020).

  19. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  Google Scholar 

  20. Vig, J. et al. BERTology meets biology: interpreting attention in protein language models. Preprint at https://arxiv.org/abs/2006.15222 (2020).

  21. Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 6, 107–116 (1998).

    Article  Google Scholar 

  22. Detlefsen, N. S., Hauberg, S. & Boomsma, W. Learning meaningful representations of protein sequences. Nat. Commun. 13, 1914 (2022).

    Article  Google Scholar 

  23. Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).

    Article  Google Scholar 

  24. Castro, E., Benz, A., Tong, A., Wolf, G. & Krishnaswamy, S. Uncovering the folding landscape of RNA secondary structure using deep graph embeddings. 2020 IEEE International Conference on Big Data. 4519–4528 (2020).

  25. Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    Article  Google Scholar 

  26. Rodrigues, C. H., Pires, D. E. & Ascher, D. B. Dynamut2: assessing changes in stability and flexibility upon single and multiple point missense mutations. Protein Sci. 30, 60–69 (2021).

    Article  Google Scholar 

  27. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).

  28. Yoshida, Y. & Miyato, T. Spectral norm regularization for improving the generalizability of deep learning. Preprint at https://arxiv.org/abs/1705.10941 (2017).

  29. Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).

    Article  Google Scholar 

  30. Wu, N. C., Dai, L., Olson, C. A., Lloyd-Smith, J. O. & Sun, R. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016).

    Article  Google Scholar 

Download references

Acknowledgements

E.C. is funded by a National Library of Medicine training grant (LM007056). D.B. is funded by a Yale–Boehringer Ingelheim Biomedical Data Science Fellowship. S.K. acknowledges funding from the NIGMS (R01GM135929, R01GM130847), NSF Career Grant (2047856), Chan–Zuckerberg Initiative Grants (CZF2019-182702 and CZF2019-002440) and the Sloan Fellowship (FG-2021-15883). We thank A. Tong and other members of the Krishnaswamy Lab for their suggestions and discussion on this project.

Author information

Authors and Affiliations

Authors

Contributions

E.C. and S.K. conceived and planned the work presented here. E.C. implemented the models and performed numerical experiments, with help from A.G., J.R. and K.G. K.G. and D.B. performed in silico validation of the model predictions. E.C. and S.K. authored the manuscript with assistance from D.B. in proofreading and formatting. All authors discussed the results and commented on the manuscript.

Corresponding author

Correspondence to Smita Krishnaswamy.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Brandon Allgood, Markus Buehler and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Tables 1–4 and Figs. 1 and 2.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Castro, E., Godavarthi, A., Rubinfien, J. et al. Transformer-based protein generation with regularized latent space optimization. Nat Mach Intell 4, 840–851 (2022). https://doi.org/10.1038/s42256-022-00532-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-022-00532-1

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing