Abstract
Molecular dynamics (MD) simulations produce a substantial volume of high-dimensional data, and traditional methods for analyzing these data pose significant computational demands. Advances in MD simulation analysis combined with deep learning-based approaches have led to the understanding of specific structural changes observed in MD trajectories, including those induced by mutations. In this study, we model the trajectories resulting from MD simulations of the SARS-CoV-2 spike protein-ACE2, specifically the receptor-binding domain (RBD), as interresidue distance maps, and use deep convolutional neural networks to predict the functional impact of point mutations, related to the virus’s infectivity and immunogenicity. Our model was successful in predicting mutant types that increase the affinity of the S protein for human receptors and reduce its immunogenicity, both based on MD trajectories (precision = 0.718; recall = 0.800; \(\hbox {F}_1\) = 0.757; MCC = 0.488; AUC = 0.800) and their centroids. In an additional analysis, we also obtained a strong positive Pearson’s correlation coefficient equal to 0.776, indicating a significant relationship between the average sigmoid probability for the MD trajectories and binding free energy (BFE) changes. Furthermore, we obtained a coefficient of determination of 0.602. Our 2D-RMSD analysis also corroborated predictions for more infectious and immune-evading mutants and revealed fluctuating regions within the receptor-binding motif (RBM), especially in the \(\beta _{1}^{\prime }/\beta _{2}^{\prime }-C\) loop. This region presented a significant standard deviation for mutations that enable SARS-CoV-2 to evade the immune response, with RMSD values of 5Å in the simulation. This methodology offers an efficient alternative to identify potential strains of SARS-CoV-2, which may be potentially linked to more infectious and immune-evading mutations. Using clustering and deep learning techniques, our approach leverages information from the ensemble of MD trajectories to recognize a broad spectrum of multiple conformational patterns characteristic of mutant types. This represents a strategic advantage in identifying emerging variants, bypassing the need for long MD simulations. Furthermore, the present work tends to contribute substantially to the field of computational biology and virology, particularly to accelerate the design and optimization of new therapeutic agents and vaccines, offering a proactive stance against the constantly evolving threat of COVID-19 and potential future pandemics.
Similar content being viewed by others
Introduction
In bioinformatics, emerging solutions have transformed the growing volume of biological omics data into knowledge, through methodologies based on machine learning1. Deep learning (DL), a subfield of machine learning, has been successful in extracting meaningful patterns from data, as it can handle high-dimensional representations of data by modeling high-level abstractions2. This is achieved from multiple nonlinear transformations in neural networks, where the output of one node is connected to the input of each succeeding node. This arrangement of neuronal units forms dense layers, constituting a fully connected network3.
In general, deep learning-based pipelines for molecular dynamics simulations are composed of a feature engineering step that reduces the high-dimensional structure space to a low-dimensional feature space; pre-processing that adapts the simulation data to a representation appropriate for algorithm input; a deep neural network composed of feature extraction and classification modules; and a stage of evaluation and interpretability of the predictions made by the model2.
In structural bioinformatics, convolutional neural networks (CNNs) have been widely used to predict protein structure or function1, which can be attributed to two primary aspects: (i) CNNs benefit from the symmetries of structural representations (e.g., distance maps) through translation invariance, effectively learning patterns regardless of their spatial location in the input; (ii) convolutional layers primarily capture local dependencies within their receptive fields through learned filters, focusing on small spatial regions of the input rather than the entire global structure at once2,4,5. These architectures have represented the state-of-the-art in structure prediction1,6 and in the identification of functional states from MD trajectories7,8,9.
Molecular dynamics (MD) is another relevant computational method to investigate the dynamic behavior of biomolecular systems at the atomic level10. MD simulations combined with experimental data have allowed the investigation of biological processes that would be difficult or even impossible to observe experimentally11.
Given the voluminous and high-dimensional nature of MD trajectories, advances in MD simulation analysis applied to molecules, such as cluster analysis, have enabled more efficient exploration of the conformational states in protein complexes12. Integrating this method with DL-based approaches improves the prediction of intrinsic protein movements, reducing the risk of overlooking subtle but significant conformations, in which manual analysis could result9.
Moreover, remarkable progress has been made in the use of methodologies for in-depth analysis of MD trajectories. For example,7 and9 combined pixel representations and CNNs to identify functional states in MD trajectories of G protein-coupled receptors (GPCRs), as well as the residues underlying the active states, elucidating the activation mechanisms of these receptors. Both studies transform data extracted from trajectories, including atom coordinates (x, y, and z), into pixel-based representations using RGB components, generating a representation known as pixel maps. Although the model was successful at classifying the ligands, the pixel representation has the disadvantage of being sensitive to translational and rotational movements of proteins, requiring additional preprocessing of the trajectory to remove bias.
Distance maps (DMs) have provided an alternative to this problem as they are invariant to protein rotation and translation. By using distance maps to reduce high-dimensional MD data while preserving crucial spatial information, we gain an advanced approach for analyzing the dynamic behavior of proteins over time. The inherent equivariance of distance maps enables the model to generalize more effectively to unseen structures, regardless of their orientation or position, compared to the purely geometric13 or pixel-based representations7,9 used in earlier studies. This approach also minimizes the need for extensive preprocessing to remove bias or additional data augmentation steps to ensure robustness against transformations. Furthermore, their low dimensionality is desirable in AI applications5.
Recently, 3D ResNets have been used to identify conformational changes associated with spatiotemporal patterns of ligands bound to the \(\beta\)2-adrenergic receptor (\(\beta\)2AR), a specific type of GPCR8. For this purpose, the model used time series of protein distance maps (PDMs) derived from the trajectories of \(\beta\)2AR-ligand complexes, as input. The study revealed that ligands of the same type tend to share the same dynamics, characterized by conformational changes throughout the GPCR system induced by residues at the binding site8.
Despite the relevant contributions of these works, it is possible to observe that the analyzed conformations involve significant rearrangements within the protein structure, often referred to as large-scale conformational changes8,9. However, small localized changes, such as loop movements within protein motifs, are crucial for protein stability, functionality, and ligand binding and merit significant attention.
In this study, we focus on the Spike (S) protein of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) as a case study and model the trajectories from MD simulations of the SARS-CoV-2 S protein interacting with human receptor, specifically targeting the receptor binding domain (RBD) and represent them as interresidue distance maps. Subsequently, we developed a deep learning-based system to predict the functional impact of point mutations that could potentially increase the protein’s affinity for the human receptor while reducing its immunogenicity.
Our hypothesis is that specific point mutations in the RBD region should promote non-trivial conformational changes that are indicative of certain variants or even entire lineages of the virus. To support this hypothesis, a literature review was performed. The results highlight the importance of these mutations in the conformation of the virus and its pathogenicity14,15,16,17. In vitro data show that mutations in the SARS-CoV-2 and SARS-CoV-1 RBDs are capable of evading neutralizing antibodies18,19. In addition to structural changes, we discuss how the proposed conformational changes can influence transmissibility and immune response, highlighting the importance of these mechanisms for effective control of viral spread.
Some SARS-CoV-2 strains can increase infectivity and transmissibility, posing a challenge for antiviral drug and vaccine projects against COVID-19 in a short period of time. Because it is a hypermutable virus, its transmissibility tends to favor the emergence of variants with critical mutations and shared zoonotic biological characteristics. Therefore, identifying the patterns that characterize the conformations exhibited by these variants could help to accurately predict potential strains linked to the virus. These findings could be valuable in the development of new drugs and vaccines against COVID-19 or even for future epidemics.
Previous research has successfully employed graph neural networks13 or self-attention mechanisms20 to predict changes in binding affinity between RBD and ACE2, without requiring MD simulations beyond training data. However, these architectures require structural information from both viral and human proteins, making them dependent on the complex. Furthermore, the effects of some mutations (e.g., S:V367F, S:S477N) may be time-dependent and emerge only after prolonged molecular interactions, requiring simulations of the order of nanoseconds or longer to fully capture their functional impact21,22,23,24.
The inherent flexibility of the SARS-CoV-2 spike protein (S) is pivotal in modulating its binding affinity to the ACE2 receptor, directly influencing viral virulence22. Several studies underscore the importance of this dynamic adaptability, particularly within the receptor-binding motif (RBM), for optimizing interaction with the human receptor25. Molecular dynamics (MD) simulations have revealed a heightened flexibility in the receptor binding domain (RBD) of SARS-CoV-2 compared to its counterpart in SARS-CoV, notably in key regions of ACE2 recognition comprising residues Gln474—Gly485, Cys488—Phe490 and Ser494— Tyr50524,26. These simulations provide a nuanced understanding of the dynamic ACE2-RBD interaction network, revealing subtle conformational changes (e.g, flexible loops) that remain elusive to static structural analyzes22,27,28.
In this sense, the integration of MD simulations offers a comprehensive understanding of the dynamic interaction landscape between the SARS-CoV-2 spike protein’s receptor-binding domain (RBD) and the ACE2 receptor29. This approach reveals a broader spectrum of conformational patterns across various mutant types, making it crucial for unraveling the time-dependent structural evolution of RBD. Such insights are crucial to understanding the mechanisms of viral adaptation and immune evasion in the context of emerging variants30.
Furthermore, we verified whether it is possible to anticipate the effects of mutations by analyzing the most probable conformation along the MD trajectory. This approach avoids the need for a large number of simulation frames as input to the AI model, even though the model has been trained with MD trajectories.
Methods
System preparation and molecular dynamics protocol
We obtained the crystal structure of the SARS-CoV-2 spike receptor binding domain (RBD) bound to angiotensin-converting enzyme 2 (ACE2) from PDB: 6M0J and selected amino acid residues 333 to 527, corresponding to the RBD, using PyMOL (PyMOL Molecular Graphics System, Version 2.4 Schrödinger LLC)31. In total, we performed 37 systems based on the RBD FASTA sequence and used a homology model to determine the 3D structures of proteins in SWISS-MODEL32.
We estimated protonation states for all residues in both systems from a standard physiological blood pH of 7.4, a salinity of 0.15 M, and internal and external relative permittivities of 10 and 80, respectively, using the H\({++}\) web server33. We parameterized the systems using the tleap tool from AmberTools2134 and solvated them in a cubic box with a minimum edge distance of 12Å. We describe the water, protein, and glycidic fractions of the systems with the AMBER force fields TIP3P, ff14SB, and GLYCAM-06, respectively.
We also added ions \(Na^{+}\) and \(Cl^{-}\) to reach a physiological salinity of 0.15 M. After solvation, minimization, equilibration, and productive MD simulation steps were performed using NAMD 2.13 CUDA verbs35. with the AMBER force field. We performed all simulations with a time step of 1 fs using an NPT ensemble, with the temperature and pressure kept constant at 310 K and 1 atm, respectively, by means of a Langevin thermostat and a Langevin piston. We used periodic boundary conditions and a 10 Å electrostatic interaction cut-off point for non-bonded interactions and calculated the long-range electrostatic interactions using Particle Mesh Ewald.
A relaxation protocol was applied under NPT conditions to all systems before performing the productive MD simulation, and a multistep equilibration protocol was used. This protocol consisted of (1) 500 minimization steps with harmonic constraints on protein atoms; (2) 500 steps of unconstrained minimization; (3) 300 ps equilibrium with harmonic constraints on protein atoms; (4) 300 ps equilibrium with harmonic constraints on the backbone atoms; (5) 300 ps of unrestricted balance; and (6) 1 ns preproductive MD simulation with reset speeds and no restrictions. Finally, we performed three independent 100 ns MD simulations for each S RBD protein system with the corresponding mutations on final relaxation coordinates on the SDumont supercomputer at the Brazilian National Laboratory for Scientific Computing (LNCC).
Trajectory analysis
We performed a root mean square fluctuation analysis per residue (RMSF) against the average frame for the \(C\alpha\) carbons, aligning the entire RBD with the least mobile residues (333–437, 454–456, 492–494 and 509–526). We estimated RMSFs for each of the three replicates individually, as well as for their combined data. Additionally, to evaluate the convergence between the different simulations of each system, two-dimensional (2D) root mean square deviation (2D-RMSD) plots relative to the protein backbone were generated using the cpptraj plug-in36 of AmberTools 2134.
Database
Our database consists of 38 (thirty-eight) distinct systems, each derived from MD simulations, with most systems containing between 1 (one) and 3 (three) point mutations per mutant (a comprehensive table listing all systems utilized in this study, including their specific RBD mutations, is presented in Supplementary Table 1). Among these, 17 (seventeen) were selected for machine learning-based analyzes, which involved approximately 85K frames; These included the original 2019-nCoV strain, also known as the “wild type” (WT)37; neutral mutants; and variants currently monitored by the WHO (VBM)38. These variants, previously classified as variants of concern (VOCs) or variants of interest (VOIs), are known for their increased infectivity and transmissibility; We assign the label “\(++\)” to the corresponding systems (Table 1).
Furthermore, for systems that represent neutral mutants - those that neither increase receptor affinity for ACE2 nor enhance resistance to antibodies - we label them as “\({-}{-}\)”. The list of single-type mutations, presented in the S:mutation (label) format55,56, is as follows: S:V445L (\({-}{-}\)), S:K417Y (\({-}{-}\)), S:N439R (\({-}{-}\)), S:Q498I (\({-}{-}\)), S:S494K (\({-}{-}\)), S:Y489F (\({-}{-}\)), and S:Y505F (\({-}{-}\))57,58. We labeled the systems according to the published literature. The wild-type strain was labeled “\({-}{-}\)”.
Complementarily, we conducted a comprehensive analysis of all molecular dynamics simulations to characterize the systems based on their mutant types, with particular attention to mobility changes in key regions such as the RBD, RBM, and loop regions. In this context, systems with mutant types that exhibit decreased affinity for ACE2 and increased antibody resistance, labeled as “\(-+\)”, as well as those in which the mutation demonstrates increased affinity for ACE2 and decreased antibody resistance, labeled as “\(+-\)”, were also analyzed. Table 2 provides a summary of the mutant types present in the database, along with their associated effects on viral binding affinity and immune evasion.
The systems correspond to triplicates of 25,000-frame trajectories (dcd files) obtained through 100 ns simulations, totaling 75,000 frames (300 ns). We sampled 5000 frames using a skip of 15 frames. Next, we separated each of these 5000 frames into .pdb files and generated 5 (five) clusters from the hieragglo average linkage algorithm59, that is, using the average distance between members of two clusters calculated in cpptraj36 and a cutoff of 2 Å for each system. Thus, although there is a reduction in the sample space, the most likely conformations tend to occur with greater probability.
Feature transformation
Our approach focuses on a restricted data representation to structural information, specifically pairwise distances within the receptor binding domain (RBD) of the SARS-CoV-2 S protein (Fig. 1a). From the MD trajectories, we generate 2D distance maps that serve as the main geometric descriptors of the structure throughout the simulation. Distance maps are graphical representations that detail the spatial arrangements between all pairs of amino acid residues within a molecule, provided as a 2D matrix of the real-valued distances between residues (distance matrices)60,61,62. Consider a set of \(N\) points in a three-dimensional space, where each point is defined by its coordinates \((x_i, y_i, z_i)\) and is represented by the position vector \(\vec {r}_i\). The Euclidean distance \(\delta _{ij}\) between the \(i\)-th and \(j\)-th points is defined by the following equation:
In the context of structural biology, \(\delta _{ij}\) denotes the Euclidean distance between the \(i\)-th and \(j\)-th amino acid residue. Therefore, the pairwise distance matrix \([\delta _{ij}]_{N \times N}\) is constructed, where each element \(d_{ij}\) represents the Euclidean interresidue distances, \(d_{ij}\). The matrix is symmetric, with \(\delta _{ij} = \delta _{ji}\) for all \(i, j\), and its diagonal elements are zero, indicating that the distance from any residue to itself is zero60,63. The complete distance matrix can be represented as
We selected the coordinates (x, y, z) of 194 \(\alpha\) carbon atoms5,60 that constitute the RBD, resulting in distance matrices with dimensions of \(194\times 194\) (Fig. 1b,c). The asymptotic computational complexity of this method can be expressed as \(O(n^2)\), where n indicates the total number of residues in the protein. We converted these matrices into a 2D image (.png format) using the Matplotlib library in Python. To ensure compatibility between map dimensions and model input, we resized the dimensions of the maps to \(224\times 224\) pixels64 by adding padding zeros. We implemented the algorithms for generating the DMs in Python (version 3.9.13).
CNN-based model development
Since the problem of identifying subtle conformational changes in MD trajectories falls under the AI-complete category, and considering that DMs have a grid-like topology2, we use the DL technique known as convolutional neural networks (CNN)65. We choose the VGG architecture, known as Visual Geometry Group64 in our methodology. Specifically, we implemented the VGGNet-B variant, characterized by a configuration of 13 trainable layers that include weights.
We represent DMs as 4D tensors, structured based on image dimensions, number of image channels (RGB), and batch size (BS)66. In this context, we employ a batch size of 64 samples, as previous studies have shown that batch sizes ranging from 32 to 256 yield improved results67, as also observed in related applications9. Moreover, we preprocess DM by normalizing the pixel values to a range of \([0-1]\), which is more suitable for efficient processing by neural networks2.
Our implementation follows the VGG configuration64 (Fig. 1d), except for a slight variation, in which we use a single neuron in the output with a sigmoid activation function, thus transforming the CNN into a binary classifier, resulting in 129 M parameters. After each conv. and pooling layer64, we incorporated batch normalization (BN) and rectified linear unit (ReLU) activation. To avoid the problem of internal covariate shift and regularize the model, we added BN before the activation layer68.
Additionally, we use dropout after FC layers to mitigate potential overfitting and improve the generalizability of the model2. We choose a dropout rate of 0.5, as values within the range of \(\left[ 0.3-0.6\right]\) tend to reduce the error rate during training69. Dropout is particularly beneficial for datasets exceeding 5K samples, such as the data set of the present problem69. We developed the model using well-established ML and neural network libraries such as TensorFlow (version 2.15.0)70 and Keras66.
Model training
We partitioned the database into four main subsets of trajectories: WT, VOC, VOI, and neutral. For training the model, we used DMs obtained from MD trajectories related to wild-type (WT) and Beta and Delta variants of concern (VOCs). We selected these variants because they share most of the mutations observed in VOI and are commonly associated with an increased affinity for ACE2 and resistance to natural antibodies or vaccines71.
During training, we use cross-validation (CV), a heuristic to minimize the model’s generalization error and optimize hyperparameters72. We define a percentage \(\gamma < 0.5\) of the training data as a reference to the validation72, and used k-fold CV.
Deep learning-based pipeline. (a) Receptor Binding Domain. (b) Distance map referring to a frame of the WT MD trajectory. (c) Distance map referring to a frame of the Beta variant MD trajectory. It is evident that both maps are indistinguishable to the naked eye, highlighting the intricate nature of the problem. (d) VGG-B architecture64. (e) Feature map extracted from the initial block of conv layers. (Conv1_2), for the centroid of cluster 0 of the Gamma variant. (f) Projection of the high-intensity pixels of the feature map onto the 3D structure of the RBD.
We randomly partition the training set into k mutually exclusive subsets of the same size (n/k), where n is the total number of instances. Thus, validation takes place for one of the subsets, while the remaining \(k-1\) serve to train the model. This process occurs k successively, and we estimate the average cross-validation error rate (binary cross-entropy loss) as a reference to optimize the model hyperparameters72. We set \(k = 5\) since \(\gamma \ge 0.1\) is commonly recommended and proves effective in various applications72. We select the optimal parameter configuration based on the error rate estimates73. After adjusting the parameters, we utilize the entire training data set to make predictions on the independent test set.
We conducted the training in Google’s virtual environment, Colab, which provided access to a Jupyter Notebook. The computational resources included an NVIDIA A100 GPU with 40 GB of VRAM and an additional 89.6 GB of RAM.
Model evaluation
The test set comprises 14 (fourteen) systems, each distributed equally across classes, totaling 70,014 DMs. We evaluated mutations associated with VOIs such as Iota, Kappa, Lambda, Mu, and Theta, as well as VOCs such as Gamma and Omicron. Notably, key mutations including S:L452R, S:T478K, S:E484K, and S:N501Y were commonly observed in most of these systems, leading us to assign them the label “\(++\)”41,48,71,74. Furthermore, we analyzed point mutations S:V445L, S:K417Y, S:N439R, S:Q498I, S:S494K, S:Y489F, and S:Y505F, which are considered neutral57, and assigned them the label “\({-}{-}\)”, thus completing the evaluation of the test set.
During validation, our optimized model achieved an average error rate of 1% with the following specific parameters: learning rate of 1E-3, dropout of 0.5, BS of 64 and 100 training epochs (more details are provided in the Supplementary Material). We used the receiver operating characteristic (ROC) curve (see Fig. 1a in the Supplementary Material) to determine the optimal operating threshold, 0.5, which resulted in a better balance between the recall and false positive rate (1-specificity)75.
In the testing, we derived performance metrics from the confusion matrix, including precision (prec), true positive rate (TPR)/recall (rec), false positive rate (FPR), \(\hbox {F}_\beta\) score and Matthews’ correlation coefficient (MCC)75. In relation to the \(\hbox {F}_\beta\) score, we employ the \(\hbox {F}_1\) score, which balances precision and recall by setting \(\beta =1\). Furthermore, we calculate the area under the ROC curve (AUC). We calculated complementary metrics such as precision and recall for the test set, since relying solely on the error rate can be limiting as a measure of discriminability. Furthermore, we sought to determine the success rate of the model for each class of problem75.
Results and discussions
2D-RMSD analysis of the impact of mutations in the RBD
We developed an analysis that focused on structural variations between the original S protein and its strains, which exhibit different levels of affinity for interaction with the cellular receptor ACE2 and antibodies. In this sense, we generated 2D-RMSD plots to identify regions of high mobility, highlighting differences in conformational sampling between different mutant types in the RBD. Since RBD is the main target of vaccine-induced neutralizing antibodies (nAb), our objective was to understand how these mutations impact interactions between ACE2 and other antibodies76. It is relevant to determine whether these mutations compromise the effectiveness of mRNA vaccines and natural immunity, given the ongoing emergence of new variants40,76. The following table shows the labels adopted for each mutation phenotype.
Our MD analysis of all systems in the database, categorized by their mutant types, revealed that the most significant residue fluctuations and standard deviations were associated with the region spanning residues 360 to 374, as well as along the RBM. especially the \(\beta _{1}^{\prime }/\beta _{2}^{\prime }-C\) loop (residues 470 to 490), as illustrated in the RMSF plots (Fig. 2a,c).
2D-RMSD analysis. (a) Root-mean-square fluctuation (RMSF) for the entire receptor binding domain (RBD). (b) Two-dimensional root-mean-square deviation (2D-RMSD) graph for the receptor binding motif (RBM). (c) RMSF of the RBM region (residues 438–508), aligned and analyzed exclusively within this region. RMSF values are presented as average (lines) and deviation (shading) across three independent replicates, including an aggregated result of all three for each mutation. (d) 2D-RMSD graph for the \(\beta _{1}^{\prime }/\beta _{2}^{\prime }-C\) loop. The color scheme, ranging from blue to red, represents changes in mobility for the \(C\alpha\) atoms of the most mobile ensembles. This coding helps to distinguish the alignment of mutant poses designated as “\({-}{-}\)”, “\(++\)”, “\(+-\)” (mutations more infectious and less resistant to the antibody), and “\(-+\)”(mutations less infectious and more resistant to the antibody) across all systems and simulations. Each position on the x-axis represents a frame, compared to each frame on the y-axis; the diagonal, always zero, indicates a frame compared against itself.
In the RBM region, we observed a higher standard deviation for mutations characterized by a lower affinity for interaction with the ACE2 cell receptor and increased resistance to antibodies, labeled “\(-+\)”. Notably, fluctuations were pronounced in the RBM loops at positions 484 and 501, as indicated in Fig. 2a. These fluctuations were even more marked in the C-terminal extension region of the loop \(\beta _{1}^{\prime }/\beta _{2}^{\prime }\), as shown in Fig. 2c. The 2D-RMSD graph presents an all-against-all comparison of molecular dynamics poses, with warmer colors denoting increased mobility. In particular, this loop includes position 484, where the critical mutation S:E484K, common to most VOCs identified by the WHO, occurs. We observed a greater fluctuation in the mobility of the \(\beta _{1}^{\prime }/\beta _{2}^{\prime }-C\) loop, particularly associated with the “\(-+\)” mutations, as shown in Fig. 2b,d. Analyzing these movements relative to the RBD and their internal mobility within the RBM provides significant information.
The magnitude of fluctuation across simulations may be related to mutations that exhibit greater antibody resistance, leading to greater diversity of RBM conformations. This increased structural variability may be a strategy adopted by mutations to ’evade’ the host’s immune response. The same pattern is observed for the key mutation S:N501Y, which exhibits high floating points. In particular, the lambda variant (lineage C.37, mutations S: L452Q and S: F490S) and single-form mutations S: F490S, S: F490L and S: G446V consistently showed RMSD values of 5Å throughout the simulation. These findings suggested that these mutations may help the virus ’escape’ from the immune response and reduce the effectiveness of vaccines, potentially promoting the gain of RBM mobility to evade immune responses77,78.
In mutations that increase affinity for the ACE2 interaction and confer antibody resistance, our analyzes revealed a reduced occurrence of conformations compared to the WT strain and neutral mutations denoted “\({-}{-}\)”. This pattern was clearly visible on the RMSF plots (Fig. 2a,c), particularly within the \(\beta _{1}^{\prime }/\beta _{2}^{\prime }-C\) loop (Fig. 2c). Upon analyzing the 2D-RMSD plot, which compares the “\(++\)” and “\({-}{-}\)” systems, it is evident that the “\({-}{-}\)” mutants exhibit a slight increase in mobility within the RBM region, compared to the “\(++\)” mutations. Notably, the blue square representing the “\({-}{-}\)” mutations, there is a discernible increase in warmer colors, indicating enhanced mobility close to 5 Å in the RBM. In contrast, the \(\beta _{1}^{\prime }/\beta _{2}^{\prime }-C\) loop shows increased mobility for the “\({-}{-}\)” mutations, compared to the “\(++\)” mutations, demonstrated by the warmer colors in the 2D-RMSD graph. For the orange square representing the “\(++\)” mutations, an increase in cooler colors is observed, suggesting reduced mobility near 0 Å. These observations may provide insight into why mutations with increased ACE2-binding affinity in S RBD could influence viral infectivity and immunogenicity. These findings align with those presented in Figs. 2c and 2 of the Supplementary Material.
Molecular dynamics trajectory classification
As observed in related works8,9, we evaluated the generalizability of our predictor to unseen conformations in MD trajectories from an independent test set. These trajectories refer to strains of identical lineage to those included in the training set, in this case belonging to the B.1 lineage55. We represent these trajectories using DMs, allowing us to evaluate the predictor’s performance under novel conditions.
To evaluate the potential of a new MD simulation product to match a more infectious and immune-evading variant, it is essential to determine the proportion of correctly predicted instances belonging to the positive class. In this scenario, we use the precision of the positive class, recall, and \(\hbox {F}_1\) score to identify trajectories that exhibit distance patterns similar to those observed in the variants present in the training set75. These patterns may suggest subtle conformational changes in RBD associated with a gain in affinity for ACE215,79,80,81, as well as transmissibility30,82.
Considering the independent test set, we estimate the recall and FPR for the systems assigned the labels \(++\) and \({-}{-}\), respectively. Additionally, we calculated the mean (\(\mu\)) and standard deviation (\(\sigma\)) of the metrics derived by the model for both classes. Figure 3 summarizes the results, demonstrating the effectiveness of the model in predicting instances with the label \(++\). Given \(\mu\) and \(\sigma\), the recall is considerably higher (\(0.800 \pm 0.132\)) than the FPR (\(0.314 \pm 0.148\)), highlighting the need to also reduce the FPR. The model achieved a precision of 0.718, resulting in an \(\hbox {F}_1\) score of 0.757, indicating a balanced performance between precision and recall. From the confusion matrix, the MCC was calculated to be 0.488. Furthermore, the estimated AUC was 0.800. These results provide robust evidence of the predictive capacity of the model, particularly for more infectious and less immunogenic mutant types, underscoring its utility to predict the impact of point mutations on the function of the S protein75.
Comparison with state-of-the-art methods
Mutation-induced binding free energy (BFE), represented as \(\Delta \Delta G_{\text {Bind}}\) or simply \(\Delta \Delta G\), is defined as the difference between the BFE of the mutant type and that of the wild type (WT). Specifically, \(\Delta \Delta G_{\text {Bind}} = \Delta G_{\text {Bind}}^{\text {WT}} - \Delta G_{\text {Bind}}^{\text {MT}}\), where \(\Delta G_{\text {Bind}}^{\text {WT}}\) is the BFE of the WT and \(\Delta G_{\text {Bind}}^{\text {MT}}\) is the BFE of the mutant. This definition reflects the energetic impact of mutation on binding affinity58,83,84,85. Recent studies have established that the transmissibility/infection of viral variants in host cells is proportional to changes in the BFE between the S RBD and ACE285. A positive change in BFE (\(\Delta \Delta G_{Bind} > 0\)) reveals the ability of the mutation to increase binding between S RBD and ACE2, while a negative change (\(\Delta \Delta G_{Bind} < 0\)) or a value close to zero (\(\Delta G^{MT}_{Bind} \approx \Delta G^{WT}_{Bind}\)) suggests reduced or no impact on protein function57,83,85,86.
In this context, our objective was to determine the biological significance of our model’s predictions by comparing them with those produced by the TopNetTree model84,85, a topology-based network tree methodology specifically developed to forecast changes in the binding free energy of protein-protein interactions (PPI) resulting from mutations. In this study, the authors consolidated more than 1.4 million SARS-CoV-2 genomic sequences from patients and identified 683 point mutations specifically within the receptor binding domain (RBD). They also used a comprehensive library of 130 SARS-CoV-2 antibody structures. Using sophisticated techniques such as viral genotyping, algebraic topology algorithms, and deep learning, they evaluated that RBD comutations influence both binding free energy and antibodies interactions85. Their BFE predictions demonstrated a Pearson correlation coefficient of 0.78 when juxtaposed with experimental data83,86,87. This high level of precision underscores the potential of these analyses to reliably predict the impacts of viral mutations on vaccine efficacy and antibody neutralization, thereby informing future therapeutic strategies.
We developed our analysis based on changes in BFE (kcal mol−1) in the RBD-ACE2 complex, sourced from reputable references57,85,86,88. Additionally, for point mutations, we also integrated BFE change data from the online tool “SARS-CoV-2 Mutation Analyzer” (https://weilab.math.msu.edu/MutationAnalyzer/)86. Both studies used the TopNetTree model84. We calculate the average probability, \({\overline{p}}\), obtained from the output layer’s sigmoid function for the frames of the trajectories, which determines the likelihood of the instances belonging to the positive class. To ensure uniformity and scale the values between 0 and 1, similar to the probability values, we normalize the changes values in BFE using the Min-Max technique, denoted \(\Delta \Delta G_{norm}\) (Table 3). Using these calculated values, we developed a scatter plot in which each data point in the scatter plot represents a database-specific mutation or variant, the coordinates representing a tuple of values \({\overline{p}}\) and \(\Delta \Delta G_{norm}\) (Fig. 4).
Correlation for model predictions and BFE changes. The estimated product-moment correlation coefficient (\(\rho\)) indicates a strong positive correlation, with a value of 0.776, and a coefficient of determination (\(R^2\)) of 0.602. The values of BFE changes (\(\Delta \Delta G\) (kcal mol−1)) were obtained from TopNetTree application and normalized (\(\Delta \Delta G_{norm}\))84,85.
Our analysis identified two distinct clusters: the first contains data points associated with neutral mutations, where changes in BFE exhibit close alignment with changes in WT BFE (\(\Delta G^{MT}_{Bind} \approx \Delta G^{WT}_{Bind}\))57. The second group included variants correlated with increased viral infectivity and decreased immunogenicity and exhibited significant increases in BFE changes85. To quantify the relationship between these groups, we calculated the product-momentum correlation coefficient (also known as Pearson’s correlation coefficient, \(\rho\)) between the average probability, \({\overline{p}}\) and the normalized change in BFE, \(\Delta \Delta G_{norm}\). This produced a strong positive correlation, with \(\rho\) at 0.776 and a coefficient of determination (\(R^2\)) of 0.602 (Fig. 4).
Prediction from centroids
As mentioned previously, the centroids derived from hierarchical clustering offer the most representative conformations of the structure. Thus, we aimed to estimate the impact of the most infective and least immunogenic types of mutants on the mobility of the RBD on the basis of the most probable conformation represented by the centroid. Considering the centroid during the model prediction, rather than analyzing all frames of the complete MD trajectory, we can reduce the time required for the MD simulation.
Table 4 presents the distribution of frames per cluster, \(c_i\), revealing that the first cluster contains the highest number of frames in the MD trajectories. In most systems, the number of frames corresponds to a percentage greater than 60.0%. To determine whether predicting the centroid for the first cluster allows us to predict the class, we calculate the probability (p) for each of the n centroids. We estimate the average of these probabilities, weighting them by the ratio between the frame frequency of the ith cluster and the number of frames in the trajectory (w), mathematically described as
The results indicate that the prediction of the centroid concerning cluster 0 already corresponds to the mutant type class, a finding supported by the weighted average, \({\overline{p}}_w\). This behavior agrees with the predictions made for the MD trajectories, as the centroid represents the most representative conformation of the cluster.
Feature visualization
We use a technique known as feature visualization to make the learned features explicit by maximizing activation89. The aim is to find the input that maximally activates a specific unit. In CNNs, these units correspond to feature maps. From these maps, we can extract the learned patterns from the MD trajectories, allowing for a better understanding of the conformations.
Initially, we considered several approaches developed in related studies, which involved selecting a percentage of frames from the DM trajectory to identify key residues from deep CNNs to correctly predict test frames7,8. In our context, clusters provide a simplified analysis of the conformations sampled for mutant types, and each cluster presents a corresponding centroid that represents the most representative conformation.
Due to their representativeness, we focused on the centroids of cluster 0 of the MD trajectories used in model training, that is, the WT and Beta variants, in addition to the trajectory referring to the Gamma variant. In the WT trajectory, 60.4% of the frames are in cluster 0, with an average interpoint distance of 1.238 Å (\(\pm \, 0.15\) Å). The Beta variant has 87.3% of frames in cluster 0, with an average distance of 1.223 Å (\(\pm \, 0.193\) Å). In turn, the Gamma variant has 91.8% of frames in cluster 0, with an average distance of 1.405 Å (\(\pm \, 0.319\) Å). We used a 2 Å cutoff distance.
We generated the corresponding DM for the centroids and used it as input to the model. Initially, we extracted an instance of a feature map from the initial conv block. layers (Conv1_2), for the centroid of cluster 0 of the Gamma variant. We observed that the highlighted region in the feature maps corresponds to the C-terminal extension of the loop \(\beta _{1}^{\prime }/\beta _{2}^{\prime }-C\) (comprising residues 470–490 of the RBD) (Fig. 1e). In sequence, we projected the high-intensity pixels of the map onto the 3D structure of the RBD (Fig. 1f). We observe that this segment presents greater flexibility than the structure (see Fig. 2 in the Supplementary Material), which is consistent with the results of 2D-RMSD analyses and corroborated by previous studies.
Conclusions and perspectives
The application of MD simulations to the analysis of conformational states and functional mechanisms of macromolecules is a well-established approach. However, due to their high dimensionality and large scale, analyses involving MD data typically pose substantial computational demand. In this study, we represent the trajectories of MD simulations of the SARS-CoV-2 spike protein-ACE2 as 2D distance maps, and combine cluster analysis and convolutional neural networks for identifying discriminative and nontrivial conformational changes in these trajectories.
Our model was successful in predicting the functional impact related to mutant types that increase the affinity of the S protein for ACE2 and reduce its immunogenicity, considering both MD trajectories (\(prec = 0.718\); \(rec = 0.800\); \(\hbox {F}_1 = 0.757\); MCC = 0.488; AUC = 0.800) and its centroids. We also obtained a strong positive Pearson correlation between the average probability of sigmoid for MD trajectories and the \(\Delta \Delta G_{Bind}\) corresponding to neutral mutants and variants, represented by \(\rho\) equal to 0.776. Furthermore, we obtained a coefficient of determination (\(R^2\)) of 0.602.
We also observed a strong positive Pearson correlation (\(\rho = 0.776\)) between the average probability of the sigmoid function for MD trajectories and the \(\Delta \Delta G_{\text {Bind}}\) corresponding to neutral mutants and variants. Furthermore, we achieved a coefficient of determination (\(R^2 = 0.602\)). Our 2D-RMSD analysis aligned with the visualization of features for more infectious and immune-evading mutant types and revealed fluctuating regions within the RBM, especially in the \(\beta _{1}^{\prime }/\beta _{2}^{\prime }-C\) loop. This region presented a significant standard deviation for mutations that allow SARS-CoV-2 to evade the immune response, with RMSD values of 5Å.
The proposed method represents an efficient alternative to identify potential SARS-CoV-2 strains, which may be potentially linked to more infectious and transmissible mutations. Our approach utilizes clustering and deep learning techniques to extract valuable insights from ensembles of MD trajectories, enabling recognition of a wide range of conformational patterns characteristic of mutant types. This provides a significant advantage in the early identification of emerging variants. Although MD simulations continue to supply the detailed structural data required by the model, our deep learning framework significantly minimizes the need for time-consuming post-simulation analyses. Furthermore, when we consider that the most frequently observed conformation in an MD trajectory often resembles the average conformation in X-ray crystallography or even in NMR experiments, this approach could identify variants of a single structure obtained from these methods without the need for a simulation.
Furthermore, our approach provides a novel perspective by demonstrating that the increase in infectivity and immune evasion associated with spike protein mutations is not solely attributable to direct protein-receptor contacts but is also significantly influenced by intrinsic changes in protein mobility. By considering the dynamic flexibility of the protein, our study offers a relevant understanding of the underlying mechanisms driving viral evolution.
The ability to accurately predict the increased binding affinity of the SARS-CoV-2 spike protein’s RBD to the human ACE2 receptor without any structural data from the ACE2 protein being provided to the learning algorithm. This is achieved by focusing solely on changes in spike protein mobility, representing a methodological advancement not explored in previous studies. This capability highlights the model’s effectiveness in predicting functional impacts based solely on dynamic protein behavior, offering a novel perspective on protein interaction analysis.
The present work tends to contribute substantially to the field of computational biology and virology, particularly to accelerate the design and optimization of new therapeutic agents and vaccines, offering a proactive stance against the constantly evolving threat of COVID-19 or future pandemics. Our findings indicate that the deep learning model can serve as a viable alternative to expensive energy calculations, providing a level of accuracy comparable to that of MD simulation analyses.
Data availability
The results relating to the analyses conducted as well as the source code are available in the public repository: https://github.com/LBS-UFMG/MD-ML-Project. The data used during this research are available from authors G.B.R (gbr@quimica.ufpb.br) and R.C.M. (raquelcm@dcc.ufmg.br) on request.
Abbreviations
- 2019-nCoV:
-
Novel coronavirus
- \(\beta\)2AR:
-
\(\beta\)2-Adrenergic receptor
- ACE2:
-
Angiotensin-converting enzyme 2
- AI:
-
Artificial intelligence
- BFE:
-
Binding free energy
- BN:
-
Batch normalization
- BS:
-
Batch size
- CNN:
-
Convolutional neural networks
- COVID-19:
-
Coronavirus disease 2019
- CV:
-
Cross-validation
- DM:
-
Distance maps
- DL:
-
Deep learning
- FPR:
-
False positive rate
- GPCR:
-
G protein-coupled receptors
- LNCC:
-
Brazilian National Laboratory for Scientific Computing
- MD:
-
Molecular dynamics
- ML:
-
Machine learning
- mRNA:
-
Messenger ribonucleic acid
- nAbs:
-
neutralizing antibodies
- NMR-spectroscopy:
-
Nuclear magnetic resonance spectroscopy
- NPT:
-
Constant-temperature, constant-pressure
- PPI:
-
Protein–protein interactions
- RBD:
-
Receptor-binding domain
- RBM:
-
Receptor-binding motif
- 3D-ResNet:
-
3D residual networks
- ReLU:
-
Rectified linear unit
- RMSD:
-
Root-mean-square deviation
- RMSF:
-
Root-mean-square-fluctuation
- ROC:
-
Receiver operating characteristic curve
- S protein:
-
Spyke protein
- SARS-CoV-2:
-
Severe Acute Respiratory Syndrome Coronavirus 2
- VBM:
-
Variants being monitored
- VGGNet:
-
Visual Geometry Group Network
- VOC:
-
Variants of concern
- VOI:
-
Variants of interest
- TopNetTree:
-
Topology-based network tree
- TPR:
-
True positive rate
- WHO:
-
World Health Organization
- WT:
-
Wild-type
References
Min, S., Lee, B. & Yoon, S. Deep learning in bioinformatics. Brief. Bioinform.18, 851–869. https://doi.org/10.1093/bib/bbw068 (2016).
Goodfellow, I. J., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Zhang, A., Lipton, Z. C., Li, M. & Smola, A. J. Dive into Deep Learning (Cambridge University Press, 2023).
Gao, W. et al. Deep learning in protein structural modeling and design. Patterns1, 100142. https://doi.org/10.1016/j.patter.2020.100142 (2020).
Defresne, M., Barbe, S. & Schiex, T. Protein design with deep learning. Int. J. Mol. Sci.22, 11741. https://doi.org/10.3390/ijms222111741 (2021).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature577, 706–710. https://doi.org/10.1038/s41586-019-1923-7 (2020).
Plante, A. et al. A machine learning approach for the discovery of ligand-specific functional mechanisms of GPCRs. Molecules24, 2097. https://doi.org/10.3390/molecules24112097 (2019).
Han, M. et al. Recognition of the ligand-induced spatiotemporal residue pair pattern of \(\beta\)2-adrenergic receptors using 3-D residual networks trained by the time series of protein distance maps. Comput. Struct. Biotechnol. J.20, 6360–6374. https://doi.org/10.1016/j.csbj.2022.10.036 (2022).
Li, C. et al. An interpretable convolutional neural network framework for analyzing molecular dynamics trajectories: A case study on functional states for G-protein-coupled receptors. J. Chem. Inf. Model.62, 1399–1410. https://doi.org/10.1021/acs.jcim.2c00085 (2022).
Karplus, M. & McCammon, J. A. Molecular dynamics simulations of biomolecules. Nat. Struct. Biol.9, 646–652. https://doi.org/10.1038/nsb0902-646 (2002).
Filipe, H. A. L. & Loura, L. M. S. Molecular dynamics simulations: Advances and applications. Molecules27, 2105. https://doi.org/10.3390/molecules27072105 (2022).
Hollingsworth, S. A. & Dror, R. O. Molecular dynamics simulation for all. Neuron99, 1129–1143. https://doi.org/10.1016/j.neuron.2018.08.011 (2018).
Liu, X. et al. Deep geometric representations for modeling effects of mutations on protein–protein binding affinity. PLoS Comput. Biol.17, e1009284. https://doi.org/10.1371/journal.pcbi.1009284 (2021).
Mannar, D. et al. Structural analysis of receptor binding domain mutations in SARS-CoV-2 variants of concern that modulate ACE2 and antibody binding. Cell Rep.37, 110156. https://doi.org/10.1016/j.celrep.2021.110156 (2021).
Chen, J. et al. Mutations strengthened SARS-CoV-2 infectivity. J. Mol. Biol.432, 5212–5226. https://doi.org/10.1016/j.jmb.2020.07.009 (2020).
Yang, W.-T. et al. SARS-CoV-2 e484k mutation narrative review: Epidemiology, immune escape, clinical implications, and future considerations. Infecti. Drug Resist.15, 373–385. https://doi.org/10.2147/idr.s344099 (2022).
Sergeeva, A. P. et al. Free energy perturbation calculations of mutation effects on SARS-CoV-2 RBD:ACE2 binding affinity. J. Mol. Biol.435, 168187. https://doi.org/10.1016/j.jmb.2023.168187 (2023).
Baum, A. et al. Antibody cocktail to SARS-CoV-2 spike protein prevents rapid mutational escape seen with individual antibodies. Science369, 1014–1018. https://doi.org/10.1126/science.abd0831 (2020).
Rockx, B. et al. Escape from human monoclonal antibody neutralization affects in vitro and in vivo fitness of severe acute respiratory syndrome coronavirus. J. Infect. Dis.201, 946–955. https://doi.org/10.1086/651022 (2010).
Wang, G. et al. Deep-learning-enabled protein–protein interaction analysis for prediction of SARS-CoV-2 infectivity and variant evolution. Nat. Med.29, 2007–2018. https://doi.org/10.1038/s41591-023-02483-5 (2023).
Ou, J. et al. V367f mutation in SARS-CoV-2 spike RBD emerging during the early transmission phase enhances viral infectivity through increased human ACE2 receptor binding affinity. J. Virol.[SPACE]https://doi.org/10.1128/jvi.00617-21 (2021).
Pipitò, L., Rujan, R., Reynolds, A. C. & Deganutti, G. Molecular dynamics studies reveal structural and functional features of the SARS-CoV-2 spike protein. BioEssays[SPACE]https://doi.org/10.1002/bies.202200060 (2022).
Rath, S., Padhi, K. & Mandal, N. Scanning the RBD-ACE2 molecular interactions in omicron variant. Biochem. Biophys. Res. Commun.592, 18–23. https://doi.org/10.1016/j.bbrc.2022.01.006 (2022).
Abduljalil, J. et al. How helpful were molecular dynamics simulations in shaping our understanding of SARS-CoV-2 spike protein dynamics?. Int. J. Biol. Macromol.242, 125153. https://doi.org/10.1016/j.ijbiomac.2023.125153 (2023).
Ahamad, S., Hema, K. & Gupta, D. Structural stability predictions and molecular dynamics simulations of RBD and HR1 mutations associated with SARS-CoV-2 spike glycoprotein. J. Biomol. Struct. Dyn.40, 6697–6709. https://doi.org/10.1080/07391102.2021.1889671 (2021).
Pavlova, A. et al. Machine learning reveals the critical interactions for SARS-CoV-2 spike protein binding to ace2. J. Phys. Chem. Lett.12, 5494–5502. https://doi.org/10.1021/acs.jpclett.1c01494 (2021).
Liu, J. et al. Characterization of SARS-CoV-2 worldwide transmission based on evolutionary dynamics and specific viral mutations in the spike protein. Infect. Dis. Poverty.[SPACE]https://doi.org/10.1186/s40249-021-00895-4 (2021).
Williams, J. et al. Molecular dynamics analysis of a flexible loop at the binding interface of the SARS-CoV-2 spike protein receptor-binding domain. Proteins: Struct. Funct. Bioinform.90, 1044–1053. https://doi.org/10.1002/prot.26208 (2021).
Antony, P. & Vijayan, R. Molecular dynamics simulation study of the interaction between human angiotensin converting enzyme 2 and spike protein receptor binding domain of the SARS-CoV-2 b.1.617 variant. Biomolecules11, 1244. https://doi.org/10.3390/biom11081244 (2021).
Harvey, W. T. et al. SARS-CoV-2 variants, spike mutations and immune escape. Nat. Rev. Microbiol.19, 409–424. https://doi.org/10.1038/s41579-021-00573-0 (2021).
Schrödinger, LLC. The PyMOL molecular graphics system (2020).
Waterhouse, A. et al. SWISS-MODEL: Homology modelling of protein structures and complexes. Nucleic Acids Res.46, W296–W303. https://doi.org/10.1093/nar/gky427 (2018).
Anandakrishnan, R., Aguilar, B. & Onufriev, A. V. H\(+\) 3.0: Automating pK prediction and the preparation of biomolecular structures for atomistic molecular modeling and simulations. Nucleic Acids Res.40, W537–W541. https://doi.org/10.1093/nar/gks375 (2012).
Case, D. A. et al. The Amber biomolecular simulation programs. J. Comput. Chem.26, 1668–1688. https://doi.org/10.1002/jcc.20290 (2005).
Phillips, J. C. et al. Scalable molecular dynamics with NAMD. J. Comput. Chem.26, 1781–1802. https://doi.org/10.1002/jcc.20289 (2005).
Roe, D. R. & Cheatham, T. E. PTRAJ and CPPTRAJ: Software for processing and analysis of molecular dynamics trajectory data. J. Chem. Theory Comput.9, 3084–3095. https://doi.org/10.1021/ct400341p (2013).
Wu, F. et al. A new coronavirus associated with human respiratory disease in China. Nature579, 265–269. https://doi.org/10.1038/s41586-020-2008-3 (2020).
CDC. SARS-CoV-2 Variant Classifications and Definitions. https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classifications.html (2023).
Radvak, P. et al. SARS-CoV-2 B.1.1.7 (alpha) and B.1.351 (beta) variants induce pathogenic patterns in K18-hACE2 transgenic mice distinct from early strains. Nat. Commun.12, 6559. https://doi.org/10.1038/s41467-021-26803-w (2021).
Wang, P. et al. Antibody resistance of SARS-CoV-2 variants B.1.351 and B.1.1.7. Nature593, 130–135. https://doi.org/10.1038/s41586-021-03398-2 (2021).
Faria, N. R. et al. Genomics and epidemiology of the P.1 SARS-CoV-2 lineage in Manaus, Brazil. Science372, 815–821. https://doi.org/10.1126/science.abh264 (2021).
Naveca, F. G. et al. COVID-19 in Amazonas, Brazil, was driven by the persistence of endemic lineages and P.1 emergence. Nat. Med.27, 1230–1238. https://doi.org/10.1038/s41591-021-01378-7 (2021).
Mlcochova, P. et al. SARS-CoV-2 B.1.617.2 delta variant replication and immune evasion. Nature599, 114–119. https://doi.org/10.1038/s41586-021-03944-y (2021).
Pouwels, K. B. et al. Effect of delta variant on viral burden and vaccine effectiveness against new SARS-CoV-2 infections in the UK. Nat. Med.27, 2127–2135. https://doi.org/10.1038/s41591-021-01548-7 (2021).
Zhang, W. et al. Emergence of a novel SARS-CoV-2 variant in southern California. JAMA325, 1324. https://doi.org/10.1001/jama.2021.1612 (2021).
Annavajhala, M. K. et al. Emergence and expansion of SARS-CoV-2 b.1.526 after identification in New York. Nature597, 703–708. https://doi.org/10.1038/s41586-021-03908-2 (2021).
Zhou, H. et al. B.1.526 SARS-CoV-2 variants identified in New York City are neutralized by vaccine-elicited and therapeutic monoclonal antibodies. mBio12, e0138621. https://doi.org/10.1128/mbio.01386-21 (2021).
McCallum, M. et al. Molecular basis of immune evasion by the delta and kappa SARS-CoV-2 variants. Science374, 1621–1626. https://doi.org/10.1126/science.abl8506 (2021).
Wilhelm, A. et al. Antibody-mediated neutralization of authentic SARS-CoV-2 b.1.617 variants harboring l452r and t478k/e484q. Viruses13, 1693. https://doi.org/10.3390/v13091693 (2021).
Halfmann, P. J. et al. Characterization of the SARS-CoV-2 b.1.621 (mu) variant. Sci. Transl. Med.14, eabm4908. https://doi.org/10.1126/scitranslmed.abm4908 (2022).
Laiton-Donato, K. et al. Characterization of the emerging B.1.621 variant of interest of SARS-CoV-2. Infect. Genet. Evolut.95, 105038. https://doi.org/10.1016/j.meegid.2021.105038 (2021).
Chen, J. et al. Omicron variant (B.1.1.529): Infectivity, vaccine breakthrough, and antibody resistance. J. Chem. Inf. Model.62, 412–422. https://doi.org/10.1021/acs.jcim.1c01451 (2022).
Haw, N. J. et al. Epidemiological characteristics of the SARS-CoV-2 Theta variant (P.3) in the Central Visayas region, Philippines, 30 October 2020–16 February 2021. West. Pac. Surveill. Resp. J.13, 60–62. https://doi.org/10.5365/wpsar.2022.13.1.883 (2022).
WHO. Tracking SARS-CoV-2 variants. https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/ (2020).
Aksamentov, I. et al. Nextclade: Clade assignment, mutation calling and quality control for viral genomes. J. Open Source Soft.6, 3773. https://doi.org/10.21105/joss.03773 (2021).
Hodcroft, E. B. Covariants: SARS-CoV-2 Mutations and Variants of Interest[SPACE]https://covariants.org/ (2021).
Teng, S. et al. Systemic effects of missense mutations on SARS-CoV-2 spike glycoprotein stability and receptor-binding affinity. Brief. Bioinform.22, 1239–1253. https://doi.org/10.1093/bib/bbaa233 (2020).
Laurini, E. et al. Molecular rationale for SARS-CoV-2 spike circulating mutations able to escape bamlanivimab and etesevimab monoclonal antibodies. Sci. Rep.11, 20274. https://doi.org/10.1038/s41598-021-99827-3 (2021).
Murtagh, F. A survey of recent advances in hierarchical clustering algorithms. Comput. J.26, 354–359. https://doi.org/10.1093/comjnl/26.4.354 (1983).
Kloczkowski, A. et al. Distance matrix-based approach to protein structure prediction. J. Struct. Funct. Genomics10, 67–81. https://doi.org/10.1007/s10969-009-9062-2 (2009).
Ding, W. & Gong, H. Predicting the real-valued inter-residue distances for proteins. Adv. Sci.[SPACE]https://doi.org/10.1002/advs.202001314 (2020).
Du, Y., Kabir, A., Zhao, L. & Shehu, A. From interatomic distances to protein tertiary structures with a deep convolutional neural network. in Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 1–8 (2020). https://doi.org/10.1145/3388440.3414699.
Leach, A. Molecular Modeling: Principles and Applications 2nd edn. (Prentice Hall, 2001).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. in 3rd International Conference on Learning Representations (ICLR 2015) 1–14 (2015). ArXiv:1409.1556v6.
LeCun, Y. Backpropagation applied to handwritten zip code recognition. Neural Comput.1, 541–551. https://doi.org/10.1162/neco.1989.1.4.541 (1989).
Chollet, F. Deep Learning with Python 4th edn. (Manning, 2017).
Mishkin, D., Sergievskiy, N. & Matas, J. Systematic evaluation of convolution neural network advances on the Imagenet. Comput. Vis. Image Understand.[SPACE]https://doi.org/10.1016/j.cviu.2017.05.007 (2017).
Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. in Proceedings of the 32nd International Conference on International Conference on Machine Learning, vol. 37, 448—456 (2015).
Srivastava, N. et al. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.15, 1929–1958 (2014).
Abadi, M. et al.TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems[SPACE]arXiv:1603.04467 (2016)
Sanches, P. R. S. et al. Recent advances in SARS-CoV-2 Spike protein and RBD mutations comparison between new variants Alpha (B.1.1.7, United Kingdom), Beta (B.1.351, South Africa), Gamma (P.1, Brazil) and Delta (B.1.617.2, India). J. Virus Erad.7, 100054. https://doi.org/10.1016/j.jve.2021.100054 (2021).
Duda, R. O., Hart, P. E. & Stork, D. G. Pattern Classification (Wiley, 2001).
Haykin, S. Neural Networks (Prentice Hall, 1999).
Cherian, S. et al. SARS-CoV-2 spike mutations, L452R, T478K, E484Q and P681R, in the second Wave of COVID-19 in Maharashtra. India. Microorgan.9, 1542–1553. https://doi.org/10.3390/microorganisms9071542 (2021).
Webb, A. R. & Copsey, K. D. Statistical Pattern Recognition (Wiley, 2011).
Dai, L. & Gao, G. F. Viral targets for vaccines against COVID-19. Nat. Rev. Immunol.21, 73–82. https://doi.org/10.1038/s41577-020-00480-0 (2020).
Tada, T. et al. SARS-CoV-2 lambda variant remains susceptible to neutralization by mRNA vaccine-elicited antibodies and convalescent serum (2021). https://doi.org/10.1101/2021.07.02.450959.
Wang, M. et al. Reduced sensitivity of the SARS-CoV-2 lambda variant to monoclonal antibodies and neutralizing antibodies induced by infection and vaccination. Emerg. Microb. Infect.11, 18–29. https://doi.org/10.1080/22221751.2021.2008775 (2021).
Tai, W. et al. Characterization of the receptor-binding domain (RBD) of 2019 novel coronavirus: Implication for development of RBD protein as a viral attachment inhibitor and vaccine. Cell. Mol. Immunol.17, 613–620. https://doi.org/10.1038/s41423-020-0400-4 (2020).
Alaofi, A. L. & Shahid, M. Mutations of SARS-CoV-2 RBD may alter its molecular structure to improve its infection efficiency. Biomolecules11, 1273. https://doi.org/10.3390/biom11091273 (2021).
Paul, D., Pyne, N. & Paul, S. Mutation profile of SARS-CoV-2 spike protein and identification of potential multiple epitopes within spike protein for vaccine development against SARS-CoV-2. VirusDisease32, 703–726. https://doi.org/10.1007/s13337-021-00747-7 (2021).
Tao, K. et al. The biological and clinical significance of emerging SARS-CoV-2 variants. Nat. Rev. Genet.22, 757–773. https://doi.org/10.1038/s41576-021-00408-x (2021).
Chen, J., Gao, K., Wang, R. & Wei, G.-W. Prediction and mitigation of mutation threats to COVID-19 vaccines and antibody therapies. Chem. Sci.12, 6929–6948. https://doi.org/10.1039/d1sc01203g (2021).
Wang, M., Cang, Z. & Wei, G.-W. A topology-based network tree for the prediction of protein–protein binding affinity changes following mutation. Nat. Mach. Intell.2, 116–123. https://doi.org/10.1038/s42256-020-0149-6 (2020).
Wang, R. et al. Emerging vaccine-breakthrough SARS-CoV-2 variants. ACS Infect. Dis.8, 546–556. https://doi.org/10.1021/acsinfecdis.1c00557 (2022).
Chen, J. et al. Revealing the threat of emerging SARS-CoV-2 mutations to antibody therapies. J. Mol. Biol.433, 167155. https://doi.org/10.1016/j.jmb.2021.167155 (2021).
Chen, J., Gao, K., Wang, R. & Wei, G. Revealing the threat of emerging SARS-CoV-2 mutations to antibody therapies. J. Mol. Biol.433, 167155. https://doi.org/10.1016/j.jmb.2021.167155 (2021).
Mishra, P. M., Anjum, F., Uversky, V. N. & Nandi, C. K. SARS-CoV-2 spike mutations modify the interaction between virus spike and human ACE2 receptors. Biochem. Biophys. Res. Commun.620, 8–14. https://doi.org/10.1016/j.bbrc.2022.06.064 (2022).
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable (2022).
Acknowledgements
The authors gratefully acknowledge the financial support from the Brazilian agencies, institutes, and networks: Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES), Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG-MG), Fundação de Apoio à Pesquisa do Estado da Paraíba (FAPESQ-PB), Programa de Apoio a Núcleos de Excelência (PRONEX-FACEPE), Fundação de Apoio ao Desenvolvimento da Universidade Federal de Pernambuco (FADE-UFPE) and Financiadora de Estudos e Projetos (FINEP). The authors also acknowledge the physical structure and computational support provided by Universidade Federal da Paraíba (UFPB), the computer resources of Centro Nacional de Processamento de Alto Desempenho em São Paulo (CENAPAD-SP), Núcleo de Processamento de Alto Desempenho da Universidade Federal do Rio Grande do Norte (NPAD/UFRN), and Supercomputer Santos Dumont (https://sdumont.lncc.br/) at the Brazilian National Laboratory for Scientific Computing (LNCC).
Funding
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) [Grant Nos. 312143/2020-6, 307340/2021-0, 405745/2021-4, 440363/2022-5 and 440307/2022-8]; Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG) [Grant Nos. APQ-01834-21, APQ-02690-22, APQ-01838-24]; Fundação de Apoio à Pesquisa do Estado da Paraíba (FAPESQ) [Grant No. 030/2023].
Author information
Authors and Affiliations
Contributions
L.M.S. designed the machine learning model, analyzed the results, contributed to the drafting and revising of the manuscript, and approved the final draft. J.G.M. performed the molecular dynamics simulations, analyzed the data, contributed to the drafting and revising of the manuscript, and approved the final draft. Y.J.L. performed the molecular dynamics simulations, analyzed the data, authored, reviewed, and approved the final draft. L.H.L. conceived and designed the molecular dynamics experiments, analyzed the data, contributed to the drafting and revising of the manuscript, and approved the final draft. G.B.R. conceived and designed the molecular dynamics experiments, performed the experiments, analyzed the data, contributed to the drafting and revising of the manuscript, and approved the final draft. R.C.M. advised the study on the design of the machine learning models, analyzed the results, reviewed the drafts of the article, and approved the final draft.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Moraes dos Santos, L., Gutembergue de Mendonça, J., Jerônimo Gomes Lobo, Y. et al. Deep learning for discriminating non-trivial conformational changes in molecular dynamics simulations of SARS-CoV-2 spike-ACE2. Sci Rep 14, 22639 (2024). https://doi.org/10.1038/s41598-024-72842-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-72842-w