The 2024 Nobel Prize in Chemistry, awarded to Demis Hassabis & John Jumper from Deepmind and David Baker from the Institute for Protein Design at the University of Washington, recognizes transformative achievements in artificial intelligence-driven protein structure prediction and design. It certainly ushers in a new era for chemistry and biology, in particular acknowledging the profound impact of artificial intelligence (AI) on scientific research and on practical applications across disciplines, which was also acknowledged more broadly with the 2024 Nobel Prize in Physics.

At the core of the advancements behind the Chemistry prize is a full computational understanding of living matter at atomic level, particularly by AI models capable of predicting, analyzing and designing the 3D structures of proteins, alone or more recently forming complexes with other molecules such as nucleic acids, ions, and small ligands. Such capabilities tackle one of biology’s most enduring challenges and are the reason why the irruption of AI in the field represented a revolution already since AlphaFold 2 “won” the 14th edition of CASP in 2020. After seeing at best some incremental improvements for nearly 25 years, CASP was finally giving its first really fleshy and tasty fruits.

Artificial neural networks that understand biomolecules

Deepmind had already entered CASP in its previous edition with its AlphaFold (version 1) model, which engineered the most out of the same techniques that the top academic groups were applying at the time, all mainly capitalizing on the recent breakthroughs in residue contact (and distance and orientation) prediction from multiple sequence alignments (MSAs) through coevolution calculations1. AlphaFold 2 was not at all a new version, but rather a full redesign and rethinking of the protein structure prediction problem, whose performance left scientists both in awe and initially frustrated after which a period of illumination came that changed the future of structural biology forever. It turns out the AlphaFold 2 paper2 put forward several innovations that other scientists could subsequently build on. Two key innovations included an Evoformer module and the integration of attention mechanisms to model proteins as spatial graphs right as part of the AI model itself, unlike all other methods - including the first AlphaFold model - which only predicted contacts, distances and angles that were then fed into a regular proteinfolding program. In particular, the Evoformer module allowed AlphaFold 2 to process multiple sequence alignments to extract coevolutionary information in an indirect way that made the system more tolerant to problems in the MSAs. The attention mechanisms in turn allowed the system to process evolutionary relationships and physical interactions between distant residues, enabling highly accurate 3D predictions even for protein complexes. Importantly, too, the integration of the structure calculation stage as part of the neural network itself connected fluently (in mathematical terms) the input data (sequences, alignments, and 3D structures of candidate templates) with the outputs (modeled structures together with various confidence scores). This meant that the system could be run iteratively to better process the information and achieve better convergence. It was also critical from the users’ points of view that AlphaFold 2 returned not just structural models but also various metrics (global Tm score, residue-wise pLDDT, and pairwise PAE scores) that are reporting the quality of its own predictions - something that CASP had always pushed for but rarely assessed1.

The availability of such a powerful tool as AlphaFold 2 meant a dramatic acceleration of all research in structural biology, as thousands of previously unknown protein structures became accessible through computational means, especially when backed up by high-quality metrics. Rapidly, Deepmind paired with the European Bioinformatics Institute3 to produce millions of structural models that soon became available as part of UniProt and the Protein Data Bank themselves. Far from competing against experimental structure determination methods, AlphaFold 2 became their perfect ally, boosting the efficiency of scientists and software processing experimental data by orders of magnitude. Already in CASP14, when AlphaFold 2 came out, its models helped to solve the phase problem on X-ray diffraction data available for some of the targets4; Cryo-EM structures can now be solved much faster when at least parts of the volumetric maps can be filled with AlphaFold 2 models to then optimize conformations as the experimental densities are fit5; and NMR structure determination was driven to almost full automation by tools like NMRtist especially when assisted with reliable AlphaFold 2 models6.

Beyond AlphaFold 2’s direct applications, the number of new concepts, methods and algorithms presented by the AlphaFold 2 paper inspired many academic and private groups to either recycle, build on, or adapt the new knowledge and tools into their own methods and software. That is how a burst of new tools for computational structural biology came about that facilitated all kinds of studies on biomolecular structures, from predicting interacting surfaces7,8 or stabilizing mutations9 given a structure to filling them with ligands10, modeling the 3D structures of RNA (although notably, non-AI methods seemed to perform best in CASP’s only assessment for RNA folding11), predicting structures of proteins complexed with non-protein molecules (pioneered by Baker with RoseTTAFold-AllAtoms12), processing MSAs to explore protein structure and evolution13, and designing proteins in whole new ways14,15,16,17—the latter developed below.

Expanding AI to all biomolecules

While the initial focus of AlphaFold and similar models was on predicting the structure of proteins, the latest advancements have expanded their scope to other biomolecules, including nucleic acids, ions, lipids, and other small molecules. This broader application marks a critical shift from studying proteins in isolation to modeling complex molecular environments, and promises a new revolution in biology as the new generation of AI models can essentially understand all the different kinds of molecules and interactions relevant to biology.

The first program capable of parsing and modeling more than protein atoms was RoseTTAFold-AllAtoms from the Baker lab12. Then, AlphaFold 318 came out in an extremely simple-to-use web server within the Google domain, but with serious limitations that the community did not welcome: no source code, only a limited number of jobs per day and only for academic not-for-profit work, and only handling a limited set of small molecules and ions despite the program’s intrinsic capabilities to actually handle, in theory at least, any small molecule. Today, new programs are coming out that incorporate these “all atoms” functionalities in more permissive ways, such as Chai-119 from Chai Discovery, which can be executed locally or through a web interface similar to AlphaFold 3’s but without limitations on the small molecule inputs, accepting any molecule provided as a Simplified Molecular Input Line Entry System (SMILES) string.

These all-atoms models not only advance ways to model life at atomic level like never before, but also stand as new ways for computers to assist drug development. While the “canonical” protocol for testing whether a ligand binds to a target protein involves knowing their 3D structures and sampling possible binding poses in silico, with little hope for any required conformational changes to take place during the docking procedure, the new AI models can simultaneously sample ligand and protein target conformations as they are “co-folded”. As these all-atoms models become more efficient, we can expect a shift toward AI-driven drug discovery through the “co-folding method”. This will have profound implications for the pharma, biotechnology and healthcare, likely reducing costs and experimental research time in drug development pipelines. This application is so important that various companies are working on it and CASP started dedicating a specific track to this problem since its 15th edition20.

Understanding protein structure enables protein engineering

Prof. David Baker’s pioneering efforts in de novo protein design21, initially void of any AI methods at its core but in the last years largely relying on them especially through its RoseTTAFold12 and MPNN methods14,22, set the stage for a future where AI not only deciphers natural biology but also designs new molecular entities for use in biotechnology, medicine, and beyond. Baker’s group at the University of Washington’s Institute for Protein Design pioneered methods to create novel proteins from scratch, a feat that became significantly more powerful with the advent of AI - especially diffusion models to design protein conformations in space23 and message-passing neural networks to produce sequences that fold into the designed structures14,22. Along key proof of concept and concrete applications of these AI-based tools from the Baker lab, we count with efficiently designing new enzymes24 or stabilizing existing ones25, crafting complex multiprotein assemblies26, designing multi-state proteins27, engineering binders with therapeutic applications, and building protein crystals of use in material sciences, to mention some notable examples.

Proteins designed with AI methods are already proving powerful, for example as multivalent single-chain proteins of potential use as vaccines28, as soluble analogs of membrane proteins to facilitate their study29, and as high-affinity binders useful for therapies or as sensors30. Broader applications include protein function regulation through designed binders, even engineered clinical antibodies31, enzyme stabilization and computational evolution32, etc.

The future of AI in structural biology and of “holistic” AI models for biology

CASP16 is now rolling, with results expected for late 2024 and promising an assessment of the state of the art of structure prediction beyond static tertiary protein structures. As CASP15 revealed, modeling of multimeric assemblies still needs some tweaks, and now that protein-only modeling is close to solved, the new frontiers await modeling ligand binding to proteins, multiple protein conformations, and nucleic acid folding, all already tackled in CASP15. Besides, CASP16 reintroduced the track assessing integrative modeling, which involves modeling typically large multicomponent complexes from sparse and varied data and after years of rather poor results33 could take new heights as AI methods step in. All these special evaluation tracks in CASP16 point at the direction in which the field of computational structural biology will progress next, likely also carrying along that of protein design and, importantly, of small molecule discovery and drug development.

Another big piece of the AI-for-biology picture is that of multimodal foundational models for biology trained on, for the moment, massive amounts of DNA, RNA and protein sequences. Training on protein sequences “only” already proved useful to predict protein structures and detect structure-consistent evolutionary relationships, with Meta’s ESMFold at the pinnacle13. Meanwhile, foundational models centered around Biology’s central dogma hold promise for new applications in genomics, transcriptomics and proteomics34.

Next, multimodal foundational models that span also molecular structure are perfectly foreseeable with current technologies. Such models could bring a whole new series of tools to interrogate and understand biology holistically, for example explaining complex changes in gene expression patterns in molecular and structural terms and then inferring what molecular effectors could restore the disrupted pathways.