MassiveFold: unveiling AlphaFold’s hidden potential with optimized and parallelized massive sampling

Raouraoua, Nessim; Mirabello, Claudio; Véry, Thibaut; Blanchet, Christophe; Wallner, Björn; Lensink, Marc F.; Brysbaert, Guillaume

doi:10.1038/s43588-024-00714-4

Download PDF

Brief Communication
Open access
Published: 11 November 2024

MassiveFold: unveiling AlphaFold’s hidden potential with optimized and parallelized massive sampling

Nature Computational Science volume 4, pages 824–828 (2024)Cite this article

23k Accesses
47 Citations
82 Altmetric
Metrics details

Subjects

A preprint version of the article is available at Research Square.

Abstract

Massive sampling in AlphaFold enables access to increased structural diversity. In combination with its efficient confidence ranking, this unlocks elevated modeling capabilities for monomeric structures and foremost for protein assemblies. However, the approach struggles with GPU cost and data storage. Here we introduce MassiveFold, an optimized and customizable version of AlphaFold that runs predictions in parallel, reducing the computing time from several months to hours. MassiveFold is scalable and able to run on anything from a single computer to a large GPU infrastructure, where it can fully benefit from all the computing nodes.

Challenging AlphaFold in predicting proteins with large-scale allosteric transitions

Article Open access 26 November 2025

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Article 14 May 2024

Unmasking AlphaFold to integrate experiments and predictions in multimeric complexes

Article Open access 09 October 2024

Main

The protein structural space has been rendered substantially more accessible with DeepMind’s AlphaFold and the AlphaFold Protein Structure Database^1,2,3. AlphaFold was originally trained and parameterized on and for single protein chains only, but it has since been retrained for multimer applications⁴. The recent CASP15-CAPRI round of blind structure prediction has shown widespread use of its inference engine in the modeling of protein assemblies, with notable success^5,6. After demonstrating a proof of concept with the application of massive AlphaFold sampling to protein–peptide interaction⁷, the AFsample tool was successfully applied to the modeling of protein complexes, including difficult-to-model nanobody complexes^8,9, and ranked first in the CASP15-CAPRI assembly modeling category⁵. Very recently, it was shown that the massive sampling approach can also be applied to such specific binding as antigen–antibody interactions¹⁰. For monomeric structures too, increasing the sampling can help in the investigation of conformational variability¹¹. In addition, it has become evident that increasing the number of recycles in the process may also lead to an improvement in the quality of prediction¹², but at the cost of prolonged computing times for every single prediction. As a whole, the application comes with a high cost as it cannot run in parallel and is very greedy in terms of graphics processing unit (GPU) resources and time, making it impractical to run even for dedicated research teams.

The computing infrastructures that host GPU clusters and provide resources for such high computing demands often carry restrictive job walltimes due to the high demand on these clusters, preventing prolonged AlphaFold calculations. For large assemblies, it may even be that these walltimes prevent the conclusion of a ‘standard’ AlphaFold-Multimer run of 25 predictions.

In this Brief Communication we present MassiveFold, which combines the framework of AlphaFold¹ with the enhanced sampling of AFsample⁸ and the added functionality from ColabFold¹². MassiveFold is a parallelization engine that calls the structure prediction tool, which can be AFmassive, an extended version of AFsample that we developed alongside MassiveFold, or ColabFold, and then performs a post-treatment on the results. Other structure prediction engines can be integrated into MassiveFold in the future, provided they are massive sampling enabled. MassiveFold includes all versions of neural network (NN) models released by AlphaFold so far, contains multiple parameters that lead to an increase in structural diversity (a full list is provided in the Methods), and can be instructed to keep only the results of the most promising predictions. The program can run many instances in parallel, down to a single prediction per GPU, therefore making optimal use of the available computing infrastructure and allowing a substantial reduction in time required to obtain prediction results, from several months to hours. MassiveFold is easy to install through a conda environment and is easy to use, running a simple command line with a JavaScript Object Notation (JSON) parameter file.

To enable full access to the diversity parameters, MassiveFold integrates an optimized parallelization that consists of three parts (Fig. 1): (1) alignments computation on a central processing unit (CPU), (2) a structure inference split into many batches on GPUs and (3) a final post-processing step on a CPU that gathers the results, ranks all the predictions, and generates plots (details are provided in the Methods).

**Fig. 1: Computing processes handled automatically by MassiveFold.**

The post-processing of MassiveFold assembles all prediction results and produces several plots. These include the well-known predicted local distance difference test (pLDDT) and predicted aligned error (PAE) plots following AlphaFold and/or ColabFold coloring schemes (Supplementary Fig. 2a,b), but also the ColabFold alignment depths plots (Supplementary Fig. 2b), even if ColabFold was not selected as the inference engine. In addition, MassiveFold plots the distribution of confidence scores per AlphaFold NN version (Fig. 2a), per individual NN model (Fig. 2d–f) or all together (Supplementary Fig. 3). Because MassiveFold can be run with different parameter sets, a plot to compare the distribution of confidence scores between these sets can also be generated (Fig. 2b). The final plot shows the evolution of the AlphaFold confidence score over the recycling and the distance between consecutive structures, which is to be compared to the early-stop-tolerance parameter (Fig. 2c).

**Fig. 2: Plots generated by MassiveFold using AFmassive for structure prediction for CASP15 multimeric target H1140.**

The score distribution and recycle plots show the prediction behavior, as they highlight diversity in the AlphaFold confidence score as a function of the applied NN model. Figure 2d shows the diversity in the predictions for a default run of MassiveFold for CASP15 target H1140 (ref. ⁶). Here, 75 structures were generated (five per NN model) with a highest score not exceeding 0.6. Extending the calculation to 1,005 structures (67 per NN model, no other changes in parameters) already produces a few outliers with confidence scores above 0.8 (Fig. 2e). This distribution can even be improved by activating the dropout and excluding templates (Fig. 2f), demonstrating the added value of the massive sampling strategy. The figures also show that in this instance, the v1 NN model was the only model that produced high-confidence structures, and the computing time could have been reduced by only extending the sampling of the five first NN models (all v1).

An additional approach to massive sampling is through the recycling parameters, which play a non-negligible role in diversity generation. Figure 2c shows the recycling behavior of a structure prediction for CASP target H1140, using AFmassive with dropout and without templates and an early-stop tolerance of 0.1, while allowing up to 1,000 recycling steps. The figure shows low confidence scores for the first 160 recycling steps, which then suddenly jump to 0.846 and 0.908. With an early-stop tolerance of 0.5, only four of the ten best predictions show this jump, as opposed to all ten for an early-stop tolerance of 0.1 (Supplementary Table 1). One therefore has to consider extensive recycling as a viable alternate complementary to massive sampling. Splitting the computing with MassiveFold allows easy access to such an exploration.

MassiveFold can use either AFmassive or ColabFold as inference engines. In both cases, however, outliers with high confidence scores will only be generated by using a massive sampling strategy with diversity parameters activated (Fig. 2 and Supplementary Fig. 4).

Recently, DeepMind published AlphaFold3¹³, an all-in-one tool for structure predictions of biomolecular interactions, for which the authors claim it improves protein–protein complex predictions compared to their latest release, AlphaFold2.3, in particular for antibody–antigen predictions. To show the added value of using massive sampling via MassiveFold, we computed predictions with AlphaFold3 for the six CASP15 targets highlighted in Wallner’s massive sampling manuscript⁹ and for two additional CASP15 antibody–antigen targets for which massive sampling produced better models that were not recognized as such⁹. Supplementary Table 2 shows that AlphaFold3 only marginally outperforms massive sampling for three of the eight targets, whereas MassiveFold produces good models for seven of them. For the remaining target (H1167) neither approach produces acceptable models. However, the main advantage of AlphaFold3 is that it produces a more reliable score than AlphaFold2 for antibody–antigen targets, which fails to score these predictions efficiently, as demonstrated in refs. ^9,14. Depending on the target, either MassiveFold or AlphaFold3 may produce the best models, highlighting the benefit of having AlphaFold3 integrated into MassiveFold, which we intend to do, should the code be released by the authors.

MassiveFold was designed to facilitate access to diversity parameters and to optimally manage the computing. It takes full advantage of a GPU cluster for the inference step, while using a CPU for the multiple sequence alignment and post-processing, which do not require a GPU. It is also optimized for use on a single GPU machine, because massive sampling jobs can be run in low priority, thus allowing higher-priority jobs to insert themselves into the computing queue. MassiveFold is ready for a massive exploration of the AlphaFold protein structure prediction landscape.

Methods

MassiveFold was developed in bash and Python 3. MassiveFold v1.2.5 integrates the optimized parallelization into CPU and GPU jobs, including post-processing for reranking and plot generation (Fig. 1). The user can select either AFmassive v1.1.5 or ColabFold v1.5.5¹² for structure inference, both of which are included in the MassiveFold distribution.

AFmassive was developed in Python 3. It is an extended version of AFsample⁸ based on AlphaFold v2.3.2. It integrates all versions of the AlphaFold NN models currently available, that is, one for monomers and three for multimers, and includes additional parameters (listed in Supplementary Notes). These parameters can be set in the AFmassive JSON parameter file (ColabFold JSON parameter file for ColabFold).

Diversity parameters

The diversity parameters included in MassiveFold are the following: all NN models released by AlphaFold so far (including previous versions, that is, five for monomers and 15 for multimers), the activation of the dropout in the EvoFormer module and the structure module, the use of templates, and the number of recycle steps and the early-stop tolerance threshold, with the recycling stopping if the distance between the current and preceding structure falls below this threshold. In addition, MassiveFold accepts an additional JSON file as input, specifying individual dropout rates (Supplementary Fig. 1 presents a list of rates), thereby providing the user with additional options to increase structural diversity.

MassiveFold process

MassiveFold v1.2.5 integrates parallelization based on the Simple Linux Utility for Resource Management (SLURM) workload manager. Input given on the command line includes a FASTA file with protein sequence(s), a JSON parameter file, the inference engine to use (AFmassive or ColabFold so far), and the desired number of predictions per NN model, divided into batches of automatically calibrated or manually set size. An example of the JSON parameter file is provided. It contains the parameters for the computing infrastructure and individual runs, including, most importantly, the diversity parameters. The autocalibration adapts the batch size following an initial basic run of MassiveFold (for example, with five predictions per NN model) to keep the process duration under the walltime. The maximum prediction time will be used in comparison with the specified walltime to automatically calculate the number of batches. The minimum number of batches is the number of NN models used.

Once these parameters are set, the pipeline is as follows (Fig. 1): (1) the multiple sequence alignments running on CPU cores; (2) the structure inference processing each batch of calculations on a single GPU core, ensuring that the number of GPU cores used corresponds to the number of batches to run; (3) the post-processing running on CPU cores, to gather and rank the predictions (following AlphaFold metrics), and to generate plots.

In step (1), the alignments are either performed with JackHMMer and HHblits when using AFmassive, or MMseqs2 when using ColabFold. In step (2), the structure inference is either performed by AFmassive or ColabFold. In step (3), if ColabFold is used, outputs are converted to AlphaFold’s output format: structure file names are prefixed by its ranking index, ranking_debug.json file is created and pickle file names are reformatted. AFmassive uses this format natively. In both cases, a ‘light’ pickle option is available, which substantially reduces the size of the pickle files while keeping the main information. Steps (2) and (3) only start once the previous step is completed. It should be noted that it is possible to use pre-computed alignments by putting them in the output folder. They will be detected and not computed again unless a recalculation is forced.

In addition, a gather_runs.py script is provided to let the user collate several runs of MassiveFold. It gathers all the predictions and ranks them. This was used during the CASP16 MassiveFold generation allowing a consolidated ranking over all eight applied run conditions (including ranking_debug.json, pdb and pickle files), for a total of up to 8,040 predictions per CASP16 target.

Calculation

All inference calculations were performed on V100 or A100 GPUs. The five sets of parameters used for the massive sampling generation of predictions with AFmassive are listed in the Supplementary Notes, as well as the two sets of parameters used for ColabFold.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this Article.

Data availability

CASP15 target H1140 was used as a case study, and the sequence for this is available at https://predictioncenter.org/casp15/targetlist.cgi and corresponds to PDB structure 9ERT. All sequences in Supplementary Table 2 are also available at https://predictioncenter.org/casp15/targetlist.cgi. Source data are provided with this paper.

Code availability

The code for MassiveFold is publicly available at https://github.com/GBLille/MassiveFold, and release v1.2.5 is also available in Zenodo at https://doi.org/10.5281/zenodo.13870060 (ref. ¹⁵). The code for AFmassive is publicly available at https://github.com/GBLille/AFmassive and release v1.1.5 is also available in Zenodo at https://doi.org/10.5281/zenodo.13869932 (ref. ¹⁶). A conda environment is provided in their respective repository for each installation. The installation of ColabFold was performed with the ColabFold v1.5.5 conda environment accessible at https://anaconda.org/bioconda/colabfold. MassiveFold conda installation includes the installation of AFmassive and ColabFold. MassiveFold is under the CeCILL v2.1 license.

References

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article Google Scholar
Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)-Round XIV. Proteins 89, 1607–1617 (2021).
Article Google Scholar
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).
Article Google Scholar
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034 (2021).
Lensink, M. F. et al. Impact of AlphaFold on structure prediction of protein complexes: the CASP15-CAPRI experiment. Proteins 91, 1658–1683 (2023).
Article Google Scholar
Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)-Round XV. Proteins 91, 1539–1549 (2023).
Article Google Scholar
Johansson-Åkhe, I. & Wallner, B. Improving peptide-protein docking with AlphaFold-Multimer using forced sampling. Front. Bioinform. 2, 959160 (2022).
Article Google Scholar
Wallner, B. AFsample: improving multimer prediction with AlphaFold using massive sampling. Bioinform. Oxf. Engl. 39, btad573 (2023).
Article Google Scholar
Wallner, B. Improved multimer prediction using massive sampling with AlphaFold in CASP15. Proteins 91, 1734–1746 (2023).
Article Google Scholar
Yin, R. & Pierce, B. G. Evaluation of AlphaFold antibody-antigen modeling with implications for improving predictive accuracy. Protein Sci. Publ. Protein Soc. 33, e4865 (2024).
Article Google Scholar
Sala, D., Engelberger, F., Mchaourab, H. S. & Meiler, J. Modeling conformational states of proteins with AlphaFold. Curr. Opin. Struct. Biol. 81, 102645 (2023).
Article Google Scholar
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Article Google Scholar
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Article Google Scholar
Raouraoua, N., Lensink, M. F. & Brysbaert, G. Massive sampling strategy for antibody-antigen targets in CAPRI Round 55 with MassiveFold. Preprint at TechRxiv https://doi.org/10.22541/au.172592104.47153431/v2 (2024).
Raouraoua, N. et al. GBLille/MassiveFold: v1.2.5. Zenodo https://doi.org/10.5281/zenodo.13870060 (2024).
Raouraoua, N. et al. GBLille/AFmassive: v1.1.5. Zenodo https://doi.org/10.5281/zenodo.13869932 (2024).

Download references

Acknowledgements

This work was carried out as part of Work Package 4 of the Mutualised Digital Spaces for FAIR data in Life and Health Science (MUDIS4LS, ANR 21-ESRE-0048) project led by the Institut Français de Bioinformatique (IFB). The IFB is funded by the Programme d’Investissements d’Avenir (PIA), grant Agence Nationale de la Recherche no. ANR-11-INBS-0013. It was initiated at the IDRIS (Institut du Développement et des Ressources en Informatique Scientifique) Open Hackathon, part of the Open Hackathons program. We acknowledge OpenACC-Standard.org for their support. This work was granted access to the HPC resources of IDRIS (under allocations 2022-AD010713847, 2023-AD010713847R1 and 2024-A0160713847) made by GENCI (Grand Equipement National de Calcul Intensif). C.M. is financially supported by the Knut and Alice Wallenberg (KAW) Foundation as part of the National Bioinformatics Infrastructure Sweden (NBIS) and SciLifeLab’s Technology Development Project ‘BeyondFold’.

Author information

Authors and Affiliations

Université de Lille, CNRS, UMR 8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Université de Lille, CNRS, Lille, France
Nessim Raouraoua, Marc F. Lensink & Guillaume Brysbaert
Science for Life Laboratory, Department of Physics, Chemistry and Biology, National Bioinformatics Infrastructure Sweden, Linköping University, Linköping, Sweden
Claudio Mirabello
Institut du Développement et des Ressources en Informatique Scientifique (IDRIS), CNRS, Université Paris-Saclay, Orsay, France
Thibaut Véry
IFB-core, Institut Français de Bioinformatique (IFB), CNRS, INSERM, INRAE, CEA, Evry, France
Christophe Blanchet
Division of Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Linköping, Sweden
Björn Wallner

Authors

Nessim Raouraoua
View author publications
Search author on:PubMed Google Scholar
Claudio Mirabello
View author publications
Search author on:PubMed Google Scholar
Thibaut Véry
View author publications
Search author on:PubMed Google Scholar
Christophe Blanchet
View author publications
Search author on:PubMed Google Scholar
Björn Wallner
View author publications
Search author on:PubMed Google Scholar
Marc F. Lensink
View author publications
Search author on:PubMed Google Scholar
Guillaume Brysbaert
View author publications
Search author on:PubMed Google Scholar

Contributions

N.R. conceived the study, designed the algorithms, developed the software and wrote the paper. C.M. provided technical and conceptual guidance, and developed the software. T.V. provided technical guidance. C.B. provided conceptual guidance. B.W. provided conceptual guidance and developed the software. M.F.L. provided conceptual guidance and wrote the paper. G.B. conceived and supervised the study, provided technical and conceptual guidance, developed the software and wrote the paper.

Corresponding author

Correspondence to Guillaume Brysbaert.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes, Figs. 1–4 and Tables 1 and 2.

Reporting Summary

Supplementary Data 1

Data for Supplementary Figs. 3 and 4a,b.

Source data

Source Data Fig. 2

Statistical source data for Fig. 2.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Raouraoua, N., Mirabello, C., Véry, T. et al. MassiveFold: unveiling AlphaFold’s hidden potential with optimized and parallelized massive sampling. Nat Comput Sci 4, 824–828 (2024). https://doi.org/10.1038/s43588-024-00714-4

Download citation

Received: 16 September 2024
Accepted: 03 October 2024
Published: 11 November 2024
Version of record: 11 November 2024
Issue date: November 2024
DOI: https://doi.org/10.1038/s43588-024-00714-4

This article is cited by

Structural host-virus interactome profiling of intact infected cells
- Boris Bogdanow
- Lars Mühlberg
- Fan Liu
Nature Communications (2025)
AI meets physics in computational structure-based drug discovery for GPCRs
- Mayako Michino
- Jeremie Vendome
- Irina Kufareva
npj Drug Discovery (2025)