Exploring transition states of protein conformational changes via out-of-distribution detection in the hyperspherical latent space

Liu, Bojun; Boysen, Jordan G.; Unarta, Ilona Christy; Du, Xuefeng; Li, Yixuan; Huang, Xuhui

doi:10.1038/s41467-024-55228-4

Download PDF

Article
Open access
Published: 02 January 2025

Exploring transition states of protein conformational changes via out-of-distribution detection in the hyperspherical latent space

Bojun Liu^1,2,
Jordan G. Boysen¹,
Ilona Christy Unarta^1,2,
Xuefeng Du³,
Yixuan Li^2,3 &
…
Xuhui Huang ORCID: orcid.org/0000-0002-7119-9358^1,2

Nature Communications volume 16, Article number: 349 (2025) Cite this article

7068 Accesses
8 Citations
29 Altmetric
Metrics details

Subjects

Abstract

Identifying transitional states is crucial for understanding protein conformational changes that underlie numerous biological processes. Markov state models (MSMs), built from Molecular Dynamics (MD) simulations, capture these dynamics through transitions among metastable conformational states, and have demonstrated success in studying protein conformational changes. However, MSMs face challenges in identifying transition states, as they partition MD conformations into discrete metastable states (or free energy minima), lacking description of transition states located at the free energy barriers. Here, we introduce Transition State identification via Dispersion and vAriational principle Regularized neural networks (TS-DAR), a deep learning framework inspired by out-of-distribution (OOD) detection in trustworthy artificial intelligence (AI). TS-DAR offers an end-to-end pipeline that can simultaneously detect all transition states between multiple free minima from MD simulations using the regularized hyperspherical embeddings in latent space. The key insight of TS-DAR lies in treating transition state structures as OOD data, recognizing that they are sparsely populated and exhibit a distributional shift from metastable states. We demonstrate the power of TS-DAR by applying it to a 2D potential, alanine dipeptide, and the translocation of a DNA motor protein on DNA, where it outperforms previous methods in identifying transition states.

Artificial intelligence guided conformational mining of intrinsically disordered proteins

Article Open access 20 June 2022

Explaining Conformational Diversity in Protein Families through Molecular Motions

Article Open access 10 July 2024

Neural relational inference to learn long-range allosteric interactions in proteins from molecular dynamics simulations

Article Open access 29 March 2022

Introduction

Understanding the transition states of protein conformational changes, which typically encompass a collection of conformations located at the saddle points of free energy barriers that separate different free energy basins, is crucial for gaining insights into various biological processes, including protein folding, misfolding, gene expression, etc. This understanding also facilitates drug design and enzyme engineering. For example, a recent study¹ suggests that a single point mutation can rescue RNA polymerase II from pausing at an epigenetically modified DNA base by destabilizing the transition state during the conformational change of translocation. This insight into transition states deepens our understanding of how epigenetic modifications interfere with gene expression. Moreover, understanding transition states could aid in designing drug compounds that bind to these states, thereby effectively inhibiting the functions of targeted proteins. However, due to the low populations and transient features of these transition state structures, it remains challenging to directly investigate them at atomic resolutions using experimental techniques.

Molecular dynamics (MD) simulations can serve as a powerful approach to complement experimental methods in studying protein conformational changes, as they enable the elucidation of conformational dynamics in a high spatial and time resolution. However, all-atom MD simulations typically operate at femtosecond time-step, posing a challenge in capturing protein conformational changes that usually occur in milliseconds or longer. Markov state models (MSMs)^{2,3,4,5,6,7,8,9,10,11,12,13,14} have effectively addressed this challenge by integrating multiple short MD trajectories to predict long-time scale dynamics. Despite the significant success of MSM in studying conformational changes over the past decade^{15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35}, identifying transition state structures remains challenging. In MSMs, dynamic processes are modeled as a series of Markovian transitions among metastable conformational states (or free energy minima) at discrete time intervals (called lag times). Each MD conformation, including those in the transition state region, is therefore assigned to a specific metastable state, complicating the unraveling of transition states.

For an MSM containing a large number of small states (or microstates), one approach to identifying transition state structures within the MSM pipeline is to compute the committor probabilities of these small microstates. Microstates with an equal probability (committor equals 0.5) of reaching the initial and final metastable states can be identified as transition states between the two. However, the efficacy of this method relies significantly on the quality and accuracy of constructing the corresponding microstate MSMs. Additionally, it can only identify transition states between pairs of metastable states one at a time. Recently, a deep learning-based approach, MaxEnt-VAMPNets³⁶, has been developed to identify transition states structures to facilitate the adaptive sampling. This approach utilizes the state assignment probabilities output from VAMPnets³⁷ to calculate the Shannon entropy for each MD conformation. It assumes that the conformations with higher Shannon entropy values are more likely to be located at low-probability regions (i.e., the summit of free energy barriers). However, the state assignment probabilities from VAMPnets³⁷ represent the basis functions that can best linearly reconstruct the system’s slowest dynamic modes³⁸. It is not guaranteed that they can precisely reflect the probabilities of MD conformations transitioning in or out of metastable states.

In the past decades, a number of other MD simulation-based methods have been developed for identifying the transition states of conformational dynamics. For instance, transition path sampling (TPS)^39,40,41 directly uncovers transition states using the committor function derived from the transition path ensemble generated through Monte Carlo sampling. Additionally, a deep reinforcement learning approach, integrating an efficient path sampling method called enhanced sampling of reactive trajectories (ESoRT)^42,43, has been employed to identify transition states by framing the problem as a shooting game⁴⁴. Rather than directly capturing transition states, more approaches have been developed to extract the optimal reaction pathways^{45,46,47,48,49,50,51,52} or calculate committor functions^53,54,55,56, both of which greatly assist in transition states identification. However, despite the robustness and promise of these methods, they may pose challenges in terms of their high computational cost, or the requirement for prior knowledge and accurate characterization of the initial and final states.

Recent advancements from the out-of-distribution (OOD) detection within trustworthy artificial intelligence (AI) offer potential solutions to the challenges of simultaneously identifying all transition states between multiple free-energy minima from MD simulations. OOD detection⁵⁷, an important task for trustworthy AI, has emerged and attracted increasing attention in recent years. The major issue targeted by OOD detection is that the model trained on a specific closed-world dataset, i.e., the in-distribution (ID) data, may make overconfident and wrong predictions on unknown examples, i.e., the OOD data, from the open-world. Therefore, a reliable deep learning model should be able to perform a binary ID v.s. OOD detection classification task, and reject OOD data points. This is especially important when applying deep learning models to safety-critical applications^{58,59,60,61,62,63} such as self-driving cars and rare disease detection tasks. OOD detection, which has not been previously employed in the study of biomolecular conformational changes, bears significant potential in identifying transition states. This is because the conformations at the transition states, located at free energy barriers, are sparsely-populated and exhibit out-of-distribution characteristics in contrast to conformations within the densely populated free energy basins (the ID data). Consequently, these transition states can be considered as OOD data.

Recently, Ming et al. developed a Compactness and DispErsion Regularized learning framework (CIDER)⁶⁴ for detecting OOD images. This was achieved by harnessing latent hyperspherical embeddings that are effectively regularized through the joint optimization of compactness loss and dispersion loss. Specifically, the hyperspherical embeddings (${{\rm{z}}}\in {{\mathbb{R}}}^{d}$) refers to a set of points in $d$-dimensional Euclidean space that are located at a constant distance from the center, collectively defining a ($d-1$)-dimensional hypersphere, as illustrated in Fig. 1c. Following this, the compactness loss encourages the tightening of image samples in each class on the hypersphere, while the dispersion loss promotes large angular distances between class prototypes. Consequently, OOD samples are expected to lie between class prototypes and can be detected based on cosine similarity-based measures. The concept introduced by CIDER serves as inspiration for our development of a deep learning approach aimed at detecting transition state structures in protein conformational changes. In the context of biomolecular dynamics, the metastable free energy basins, akin to class prototypes, should be separated from one another, while the transition state structures, representing OOD samples, are expected to reside in between them. However, a major challenge in directly applying CIDER to protein dynamics is that CIDER was developed under supervised learning settings, requiring labeled data for computing class prototypes. In contrast, protein conformations in MD trajectories typically lack these ground-truth labels. VAMPnets³⁷ offers a potential solution by bridging the gap between unsupervised learning and the need for labeled data. Furthermore, the VAMP-2 loss function, which is designed to capture the slowest dynamic modes of the system based on the separation of timescales, can facilitate the compression of metastable conformations within each energy basin, playing a similar role to the compactness loss in CIDER.

**Fig. 1: Schematic representation of TS-DAR for transition state identification.**

In this work, we present an end-to-end approach called Transition State identification via Dispersion and vAriational principle Regularized neural networks (TS-DAR) for detecting transition state structures of protein conformational changes from MD simulations. This approach utilizes the hyperspherical latent representations of the biomolecular conformations. The loss function comprises two terms: VAMP-2 loss and dispersion loss. By minimizing the VAMP-2 loss, the MD conformations within the same metastable state (or free energy basin) are compacted in the hyperspherical latent space. The dispersion loss further ensures the metastable state centers are uniformly distributed across the hypersphere. Consequently, all transition state conformations, located between free energy basins, could be simultaneously and automatically identified in the latent space. In TS-DAR, the key advantage of introducing hyperspherical latent space is that its geometry effectively prevents the dispersion term from pushing data infinitely apart, thereby ensuring a uniform distribution of free energy basins across the hypersphere. We have demonstrated the efficacy of our method by applying it to three systems: the 2D Müller potential, alanine dipeptide and the translocation of a DNA motor protein along double-strand DNA (dsDNA).

Results and discussion

Hyperspherical latent representations at the penultimate layer in TS-DAR

The schematic representation of TS-DAR’s model architecture is shown in Fig. 1a. Different from VAMPnets³⁷, which directly employs two parallel encoders to utilize transition pairs of MD conformations to produce the SoftMax probabilities of state assignments, TS-DAR introduces a enhancement in its model architecture. Specifically, it incorporates an additional L2-norm/scale layer at the penultimate layer to extract the hyperspherical latent representations of MD conformations (Fig. 1a). These hyperspherical latent representations can be effectively regularized by the joint optimization of VAMP-2 loss and dispersion loss, enabling the robust transition states identification on the latent hypersphere (Fig. 1c). In particular, the L2-norm/scale layer consists of two parts. The feature vectors at the penultimate layer are first divided by their L2-norms, and then rescaled by a scaling factor $\gamma$ (Fig. 1b). As a result, the feature embeddings at the penultimate layer are successfully confined on a hypersphere of radius $\gamma$, referring to as the hyperspherical latent representations. To illustrate the hyperspherical representations of MD trajectories, we utilize a 2D Müller potential⁶⁵ with three minima as an example (Fig. 2a). In the latent space of the trained TS-DAR model for the Müller potential (Fig. 2b), all MD conformations perfectly lie on a hypersphere. Three distinct free energy basins can be clearly identified in this hypersphere, with dashed lines denoting the mean vectors of each basin in Fig. 2b. Furthermore, these three free energy basins in the hypersphere (Fig. 2d) correspond clearly to the three energy basins of the Müller potential (see basin 1–3 in Fig. 2a). Pseudo-code for the TS-DAR algorithm is provided in SI Fig. S1, and additional details on training the TS-DAR model can be found in the Methods section.

**Fig. 2: Demonstration of TS-DAR on the Müller potential for learning latent hyperspherical representations.**

Utilizing VAMP-2 loss to capture the slowest dynamics within the hypersphere

Like in VAMPnets³⁷, the minimization of the VAMP-2 loss (${{{\mathcal{L}}}}_{{\mbox{vamp}}}$ in Eq. 1) ensures our TS-DAR model can capture the slowest dynamic modes underlying the conformational changes of interest.

$${{{\mathcal{L}}}}_{{\rm{vamp}}}=-{\left|\left|{\bar{{{\rm{C}}}}}_{00}^{-\frac{1}{2}}{\bar{{{\rm{C}}}}}_{01}{\bar{{{\rm{C}}}}}_{11}^{-\frac{1}{2}}\right|\right|}_{F}^{2}-1$$

(1)

where ${\bar{{{\rm{C}}}}}_{01}$ represents the time-lagged correlation matrix, ${\bar{{{\rm{C}}}}}_{00}$ and ${\bar{{{\rm{C}}}}}_{11}$ denote the time-instantaneous covariance matrices at time $t$ and $t+\tau$ ($\tau$ is the lag time). The details for computing these correlation matrices are presented in the Methods section. According to this property, the SoftMax outputs from TS-DAR can provide optimal state assignments that are aligned with the free energy basins, allowing the on-the-fly labeling of MD conformations during the training. In addition, with the help of the VAMP-2 loss function term, the hyperspherical latent representations in the trained TS-DAR model can retain all relevant kinetic geometries and compact the conformations on the hypersphere in terms of their kinetic metastability. This is demonstrated by the latent space representation of the Müller potential (Fig. 2d), where the three identified free energy minima and their order of connection correspond precisely with the arrangements of the three energy basins of the Müller potential (Fig. 2c). Additionally, we observe that the hyperspherical latent space effectively condenses MD conformations into three distinct clusters, each corresponding to a specific metastable free energy basin. For instance, the largest free energy basin (basin 1 in Fig. 2a) has been more significantly compressed compared to the other two basins due to its large metastability, resulting in a small but deep free energy minimum in the latent space (Fig. 2b). We attribute this observation to the presence of a sole fully connected layer extending from the latent bottleneck to the outputs, compelling the latent representations to optimally capture the slowest dynamics through the VAMP-2 loss. This design of the penultimate layer for representation learning has already exhibited considerable promise in diverse deep learning fields ranging from computer vision to natural language processing^66,67. Finally, to elucidate the distinct roles of the VAMP-2 and dispersion terms in the loss function, we conducted a control experiment by omitting the dispersion loss. The aforementioned observations remain the same despite the absence of the dispersion loss (Fig. 2e, f). However, it is noteworthy that the three free energy basins exhibit uneven distribution in the latent space when the dispersion loss is excluded. Further discussion on this point will be presented in the next section.

Implementing dispersion loss for uniform latent distribution of metastable state centers and defining an OOD score for transition state identification

The dispersion loss was initially introduced by us and has successfully been employed in OOD detection for image classification tasks⁶⁴. In TS-DAR, we introduce the dispersion loss aiming to encourage the state centers (i.e., free energy minima) to be uniformly distributed across the hypersphere by maximizing the angular distances between these centers. For example, in the presence of the dispersion loss (Fig. 2d), the centers of 3 free energy basins of the Müller potential (labeled as state 1–3) are well separated and uniformly distributed in the latent space. In sharp contrast, without the dispersion loss (Fig. 2f), the three state centers exhibit an uneven distribution in the latent space. The dispersion loss is defined as follows:

$${{{\mathcal{L}}}}_{{\mbox{dis}}}=\frac{1}{C}\sum _{i=1}^{C}\log \frac{1}{C-1}\sum _{j=1}^{C}{\mathbb{1}}\left\{j\ne i\right\}{e}^{{{{\mathbf{\mu }}}}_{i}^{{{\rm{T}}}}{{{\mathbf{\mu }}}}_{j}/\sigma }$$

(2)

where $C$ corresponds to the number of states, ${{{\mathbf{\mu }}}}_{c}$ is a unit vector, representing the mean direction of all conformations (state center) in state $c$, and $\sigma$ is a scaling hyperparameter and specifically defined as 0.1. To compute the dispersion loss, it is necessary to first estimate the state center vectors ${\left\{{{{\mathbf{\mu }}}}_{c}\right\}}_{c=1}^{C}$. For the robustness and the efficiency of training, we employ an exponential-moving-average (EMA)⁶⁸ method to estimate ${\left\{{{{\mathbf{\mu }}}}_{c}\right\}}_{c=1}^{C}$ on-the-fly and update them frequently during the training (see the Methods section and SI Fig. S1 for details).

After we obtained the regularized hyperspherical latent representations through the joint optimization of VAMP-2 and dispersion loss, the metastable states are compacted and uniformly separated. As a result, transition states in between these metastable states will share equal angular distances to their nearest state centers (Fig. 2d). This prompts us to define an OOD score based on the cosine similarity to quantify the angular distances in the hyperspherical latent space (Eq. (3)), and thus help identify all the transition states automatically and simultaneously.

$${{\mbox{OOD}}} \, {{\mbox{score}}}=-\max \left\{{{{\rm{z}}}}^{{{\rm{T}}}}{{\mathcal{U}}}\right\}+1$$

(3)

where ${{\rm{z}}}$ represents the hyperspherical latent embedding of the conformation, ${{\mathcal{U}}}$ denotes as $\left[{{{\mathbf{\mu }}}}_{1},{{{\mathbf{\mu }}}}_{2},\ldots,{{{\mathbf{\mu }}}}_{C}\right]$. In light of the provided definition, the OOD score ranges from a minimum of $0$, with higher values indicating increased out-of-distribution characteristics, thereby highlighting the structures in transition states. As shown in Fig. 3b, MD conformations situated at two different transition state regions (between basins 1 & 2, and between basins 2 & 3 of the Müller potential in Fig. 2a) are identified simultaneously, as they display equal and the largest OOD scores in the latent space. We then selected MD conformations with OOD scores larger than a threshold as transition state structures. As shown in Fig. 3c, the predicted conformations at various OOD threshold choices all exhibit committor probabilities with an average of 0.5 for both TS12 and TS23 (see the Methods section for the analytical solution of committor for the Müller potential). Since transition state structures theoretically should have committor probabilities of 0.5 (i.e., equally likely to reach the product or return to the reactant), these findings suggest that TS-DAR can robustly predict the transition state structures. In addition, with the increase in the OOD threshold from 0.225 to 0.425, the number of predicted transition state structures decreases from ∼1,000 to ∼300 for TS12, and from ∼3,500 to ∼300 for TS23 (Fig. 3c). This provides a way to tune the OOD threshold to obtain a desired number of transition state structures.

Previous studies⁵⁷ in the trustworthy AI field have utilized other similarity-based metrics in the latent representations for OOD detection. However, these methods may not be ideally suited for direct application in identifying transition states for protein conformational changes. For example, one can directly detect OOD samples at the state boundaries on the hypersphere and treat them as transition states structures. However, this method may not be well-suited for identifying transition states in our TS-DAR model. This is because the VAMP-2 loss in TS-DAR is designed to optimize the slowest dynamic modes, rendering state boundaries less sensitive and thereby limiting its ability to accurately pinpoint transition state conformations. As shown in SI Fig. S3b, MD conformations located at the state boundaries in the Müller potential do not accurately correspond to true transition states. Recent approaches^64,69 including CIDER aim to detect low-density regions in the latent space as OOD samples, which may serve as potential transition state structures in our context. However, these methods may overlook the transition states separated by relatively low free energy barriers or incorrectly classify the low-density regions as transition states. For example, choosing an appropriate density threshold to concurrently identify two transition states of the Müller potential is challenging. With a density threshold set at $\exp (-7.2{\mbox{k}}T)$, only the transition state between states 1 and 2 can be identified (SI Fig. S4c). Reducing the threshold to $\exp (-6{\mbox{k}}T)$ enables the identification of the transition state between states 2 and 3. However, in this case, the transition state between states 1 and 2 becomes poorly defined (SI Fig. S4d).

TS-DAR outperforms MSM’s committor probabilities and MaxEnt-VAMPNets in identifying transition states for the Müller potential

The Müller potential exhibits three potential energy basins, with two transition states, positioned between them (Fig. 2a). We trained our TS-DAR model by specifying 3 metastable states. The validation curves are depicted in Fig. S2a, and more training details are in the Methods section. With the trained model, we first plotted the OOD scores of all MD conformations in the latent hyperspherical space (Fig. 3a). As shown in Fig. 3b, the MD conformations positioned between the state centers consistently align with those located at the summit of the energy barriers. This facilitates the straightforward identification of transition state structures. In the three-state model, both transition states situated between the three energy basins are simultaneously captured (Fig. 3b). If one’s interest lies solely in the transition state associated with the highest energy barrier (TS12), we show that a two-state TS-DAR model can also accurately identify this transition state (SI Fig. S5).

We next demonstrate that our TS-DAR outperforms two previously developed methods, MaxEnt-VAMPNets³⁶ and MSM’s committor probabilities (MSM-committor)¹⁵, in identifying the transition states for the Müller potential. For MaxEnt-VAMPNets, the VAMPnets model was trained with three states (see SI Sec. 3 for details). Regarding committor probabilities, the Müller potential was first discretized into 1,000 microstates using k-centers clustering. Committor probabilities were then computed based on the 1,000-state MSM by specifying the source and sink states (see SI Section 4 for details). As shown in Fig. 3f, MaxEnt-VAMPNets mistakenly identified the energy basin 1 instead of the energy barrier as the transition state region, characterized by high Shannon entropy. We anticipate that this misidentification occurs because the Shannon entropy computed from VAMPnets’ outputs lacks direct physical connections with the true uncertainty measures of MD data, either in terms of transitioning in or out of energy basins. Regarding MSM-committor, its performance is heavily dependent on the selection of the source and sink states (SI Fig. S6a, c). When the source and sink states are small (each containing only one microstate), MSM-committor can identify the correct transition state region for TS12, despite assigning relatively low committer probabilities of ∼0.3 for true transition state structures obtained from the analytical solution (Fig. 3d). Strikingly, for TS23, MSM-committor erroneously predicts a broad transition state region including even the entire energy basin 1 (Fig. 3e). In this scenario, microstates in the energy basin 1 that are far away from the sink and source states are wrongly categorized as transition states. These remote states might have weak connections with sink and source states, and therefore share an equal probability of transitioning to either, leading to the misidentification as shown in Fig. 3e. When the source and sink states are relatively large (e.g., containing 100 and 10 most populated microstates for TS12 and TS23, respectively), MSM-committor can successfully uncover the correct regions for both transition states (SI Fig. S6b, d).

Transition states for alanine dipeptide

Alanine dipeptide consists of 10 heavy atoms, with its conformational changes often visualized through two backbone torsion angles, $\phi$ and $\psi$ (Fig. 4a, b). To identify its transition states, we trained 3 TS-DAR models for alanine dipeptide using the Cartesian coordinates of its 10 heavy atoms as input features, and the number of metastable states was set to be 2, 3, and 4, respectively (see Fig. 4c and refer to the Methods section for additional details). In the 2-state TS-DAR model, we successfully identified the transition state (TS13) located on the highest free energy barrier (Fig. 4d). This free energy barrier corresponds to the slowest dynamic transition between $\beta$ (free energy basin 1 in Fig. 4b) and ${\alpha }_{{\mbox{L}}}$ (free energy basin 3 in Fig. 4b) metastable states of alanine dipeptide. In the 3-state model, we identified additional transition state regions situated between free energy basins 1 and 2 (TS12). It is noteworthy that since the torsion angles exhibit periodicity every 2π, there are two transition state regions that separate basins 1 and 2 $({\alpha }_{{\mbox{R}}})$ (Fig. 4e). In the 4-state model, all the transition states were simultaneously identified, with additional transition states (TS34) detected corresponding to the third slowest dynamic mode, separating the two free energy basins (3 and 4) located on the right-hand side of Fig. 4f. Among the three transition states, TS13 (located at the highest free energy barrier) is the least sampled by our MD simulations, with only 6 conformations found in this region (SI Fig. S7a). In contrast, TS12 is the most sampled, and TS-DAR can provide a varying number of MD conformations in this region by changing the OOD threshold: the number of candidate transition state structures ranges from ∼30 to ∼7,000 with the OOD threshold decreases from 0.4 to 0.2 (SI Fig. S7b). To further assess the robustness of TS-DAR and validate its computational efficiency with respect to MD sampling, we also trained our TS-DAR models on a series of down-sampled datasets, with the total MD sampling of 3, 30, 90, 300, and 750 ns (the full dataset). As shown in SI Fig. S8a–d, our 3-state TS-DAR models successfully identified TS12 across all down-sampled datasets, yielding consistent results with that obtained from the full dataset (Fig. 4e). Strikingly, as illustrated in SI Fig. S8a, TS-DAR was able to identify TS12 with just 3 ns of total sampling. For TS13, it requires TS-DAR 90 ns sampling to properly identify this transition state, located at the highest free energy barrier. In a previous study⁷⁰, TPS was applied to investigate the transition state TS12, which required a total of ∼77 ns of MD simulation sampling.

We then compare the performance of TS-DAR with MSM-committor and MaxEnt-VAMPNets. As shown in Fig. 4g–i, three independent analyses using MSM-committor with different pairs of source and sink states successfully identify three transition state regions for alanine dipeptide, consistent with the findings of a single 4-state TS-DAR (Fig. 4e). However, the committor probabilities obtained from MSM-committor analysis still depend on the selection of the source and sink states (SI Fig. S9). In contrast, in MaxEnt-VAMPNets models, while MD conformations in the transition state regions TS12 and TS23 are assigned high Shannon entropy values, those in the free energy basins 1 and 2 also receive high Shannon entropy values. Consequently, this complicates the clear identification of TS12 and TS23 (Fig. 4j–l). For TS34, the 4-state MaxEnt-VAMPNets model correctly identifies this transition state (see Fig. 4l). In the 4-state TS-DAR model, we chose a three-dimensional latent space (2D hypersphere), and the centers of the 4 metastable states (indicated by the dashed lines, see Fig. 4c) form a tetrahedral geometry with the inter-state angular distance approximately equals to 109 degrees. In the implementation of the TS-DAR method, we recommend choosing a three-dimensional latent space (2D hypersphere) for models containing 4 or more states.

Transition states for the translocation of a motor protein on DNA

Bacillus cereus alkylpurine glycosylase D (AlkD) is a DNA motor protein that can translocate along the dsDNA, and repairs DNA damage to maintain genome integrity⁷¹. Understanding the translocation of AlkD on dsDNA is crucial for comprehending how it diffuses along the genomic DNA to locate sparse lesion sites efficiently and accurately within the entire genome. We employed TS-DAR to investigate the transition states of the diffusion dynamics of AlkD along a double-stranded DNA for the distance of one base pair. We have followed our previous study⁷² to select 684 pairwise distances as the input features (SI Fig. S10) and 4 states to train TS-DAR. In addition, we have also trained TS-DAR models on different input feature sets (see SI Sec. 2 for details) and shown that TS-DAR has exhibited robust convergence in terms of incorporating more features (SI Fig. S11). Figure 5a displays the hyperspherical latent representations of AlkD-DNA complexes obtained from the TS-DAR model, where four successively connected free energy basins (pointed by dashed lines) are uncovered and uniformly separated across the hypersphere. These four free energy basins are consistent with those revealed by MSMs in a previous study⁷² (SI Fig. S12). Specifically, free energy basin 1, 2, 3 correspond to the pre-translocation, an intermediate state exhibiting a rotation of AlkD on dsDNA, and post-translocation state, respectively (Fig. 5b). Notably, we didn’t show basin 4 as it represents a hyper-translocation state, where the AlkD has translocated along the dsDNA beyond one base pair. We identify three transition states with high OOD scores (>$0.21$, see Fig. 5a and the Methods section for additional details) that separate adjacent free energy basins in the hyperspherical latent space.

**Fig. 5: TS-DAR identifies transition states of a DNA motor (AlkD) translocating along a double-stranded DNA (dsDNA) over one base pair.**

Previous MSM⁷² revealed a two-step translocation pathway of AlkD on dsDNA involving an intermediate state (Fig. 5b). However, it remains elusive why transitions from pre-translocation (basin 1) to the intermediate state (basin 2) constitute the rate-limiting step, which is an order of magnitude slower than the other step from basin 2 to basin 3 (∼17.8 μs v.s. ∼1.3 μs⁷²). This issue arises because MSMs can only provide information on the free energy basins, while kinetic rates are determined by both free energy basins and the transition states. The transition states revealed by our TS-DAR successfully address this issue and elucidate molecular mechanisms responsible for the rate-limiting step of the AlkD translocation. Specifically, the transition state (TS12) separating basin 1 and basin 2 exhibits a partial rotation of AlkD on dsDNA (see the top view of Fig. 5b), resulting in an unfavorable interaction energy between the protein and dsDNA compared to both basin 1 and basin 2 (Fig. 5d). Further analysis reveals that this unfavorable interaction energy is associated with significant reconfigurations of hydrogen bonds between AlkD and dsDNA. During the transition from basin 1 to TS12, approximately 10 existing AlkD-dsDNA hydrogen bonds must be broken, while only about 5 new hydrogen bonds are formed (Fig. 5c). Consequently, there exists a net loss of ∼5 hydrogen bonds within TS12. For example, the rotational movement of AlkD disrupts the hydrogen bond initially formed between residue R43 and base 23T. Subsequently, a new hydrogen bond forms with an adjacent base, 22 G (SI Fig. S13a). Thus, TS12 presents a significant free energy barrier that AlkD must overcome. In contrast, in the transition state TS23, AlkD forms a similar number of hydrogen bonds with the dsDNA compared to basins 2 and 3 (Fig. 5e), along with a comparable AlkD-dsDNA interaction energy (Fig. 5f). This results in a relatively fast transition from the intermediate state to the post-translocation state (basin 3), involving the translation of AlkD on dsDNA (see the side view of Fig. 5b). Transition states identified through our TS-DAR offer insights into comprehending the rate-limiting step in AlkD’s translocation on dsDNA and underscore the important role of hydrogen bonds in governing the dynamics of this conformational change.

For AlkD, the transition state regions revealed by our TS-DAR (SI Fig. S14c) align with those identified through two separate MSM-committor analyses (TS12 in SI Fig. S15c and TS23 in SI Fig. S15d). The committor probabilities obtained from the MSM-committor analysis vary slightly depending on the selection of the source and sink states (SI Fig. S15c–h). The 4-state MaxEnt-VAMPNets model assigned high Shannon entropy values to the free energy basins 1 and 2, erroneously identifying them as transition state regions (SI Fig. S16). In TS-DAR, as the OOD threshold increases from 0.2 to 0.4, the number of transition state structures decreases from approximately 150 to about 20 for TS12, and from roughly 850 to around 100 for TS23 (SI Fig. S14a, b). This enables TS-DAR to provide the desired number of candidate structures for a given transition state by adjusting the OOD threshold.

In this study, we show that the hyperspherical latent representations from TS-DAR model’s penultimate layer serve as a good reduced kinetic space for understanding the slow dynamics of protein conformational changes. Previous methods, such as tICA^73,74, and SRVs⁷⁵, which are rooted in the variational approach, aim to identify decorrelated orthogonal collective variables (CVs) for dimensionality reduction. However, these methods may hinder the comprehensive understanding of intricate collaborative dynamic motions. In contrast, alternative approaches like RC flow⁷⁶ are specifically designed to reveal latent kinetic manifolds that preserve full-state kinetic information. Nevertheless, these methods pose greater challenges in terms of training and may struggle to discover clearer state boundaries. Here, we highlight the robustness of our hyperspherical latent representations for capturing the relevant reduced kinetics and underscores that these representations benefit from two perspectives. Firstly, the utilization of penultimate layer of a deep neural network for representation learning is simple and robust. This design choice facilitates the learned latent representations to be directly regularized through the optimization of the loss function defined on the output layer, such as VAMP-2 score in our framework. Secondly, and of more significance, representation learning on a unit sphere helps the model better capture and describe kinetics. Previous works in the field of deep learning^{77,78,79,80,81,82} have demonstrated that the hyperspherical latent space performs better than traditional Euclidean space in applications ranging from variational autoencoders to convolutional neural networks for image classification tasks. In our specific context of biomolecular dynamics, the hyperspherical latent space is particularly important, offering greater capacity for describing more complex kinetic geometries including specific kinetic symmetries and periodicity of the kinetic data. In the future, we anticipate that the hyperspherical latent representations in TS-DAR will have broad applications in the study of biomolecular dynamics. For example, it can provide a good platform for analyzing the parallel transition pathways of complex dynamic systems on the hypersphere. Furthermore, recent developments^76,83 have utilized autoencoder or normalizing flow architectures to learn a reduced kinetic model in the latent space, such as Brownian dynamics, by introducing specific dynamical assumptions or constraints in the loss function. We expect that the penultimate hyperspherical latent space in TS-DAR can be deployed for learning a continuous dynamic model by integration of more physical-driven designs of loss functions.

Our TS-DAR framework can also be applied in various fields beyond its original scope. For example, MaxEnt-VAMPNets can accelerate phase space exploration by iteratively launching parallel, short, unbiased MD trajectories from high-uncertainty regions, identified through the Shannon entropy from VAMPnets outputs. Similarly, TS-DAR holds significant promise for adaptive sampling of biomolecular dynamics. The OOD scores generated by TS-DAR can serve as a robust metric to select the initial seeds (i.e., the conformations with the highest OOD scores) for adaptive sampling. This anticipated robustness of OOD scores from TS-DAR for adaptive sampling can be attributed to two key reasons. First, as demonstrated in this work, OOD scores enable the automatic and simultaneous identification of TS conformations of biomolecular dynamics. TS, positioned at the free energy barriers, serve as ideal candidates for initiating trajectories that can effectively explore the configuration space. Second, OOD scores measures the degree of the distributional shift away from the free energy minima. Under-sampled regions that exhibit distributional shifts away from well-explored metastable states can thus be easily detected by OOD scores. As such, OOD score can serve as an effective metric to detect the high-uncertainty regions for adaptive sampling in future studies.

Despite the robustness of TS-DAR, we summarize the potential limitations when applying our method. First, TS-DAR is trained on MD simulation datasets, so it is essential that the original MD trajectories capture the transitions between free energy basins and sample the transition state regions. Second, the input number of states (N) for TS-DAR cannot be too large. Similar to VAMPnets, a too large N will result in empty assignments of states and unstable training. Additionally, too many states can overcrowd the hypersphere latent space, making it difficult to distinguish between conformations in the transition state and those in the free energy basins (metastable states) using angular distance measures.

In this work, we introduce TS-DAR, a deep learning approach designed for detecting transition states from MD simulations by utilizing hyperspherical embeddings in the latent space. Inspired by recent advancements in trustworthy AI for identifying OOD data, TS-DAR treats transition state structures as OOD data. This approach allows TS-DAR to discern transition state conformations that separate multiple metastable states, facilitated by the introduction of a dispersion loss function term. Moreover, the hyperspherical embeddings of MD conformations in TS-DAR retain all relevant kinetic geometries and are compacted in terms of their kinetic metastability through the incorporation of a VAMP-2 loss function term. Collectively, our TS-DAR method establishes an end-to-end pipeline capable of simultaneously and automatically identifying all transition states across multiple free energy barriers underlying protein conformational changes. To demonstrate the efficacy of TS-DAR, we apply it to the 2D Müller potential, alanine dipeptide, and the translocation of a DNA motor protein on dsDNA. Across all three systems, we show that our TS-DAR outperforms MSM-committor and MaxEnt-VAMPNets in terms of efficiency and accuracy in simultaneously identifying all transition states for protein conformational changes. We anticipate that TS-DAR can find widespread application in identifying transition states for protein conformational changes.

Methods

Remove-mean time-instantaneous and time-lagged correlation matrices

Given a set of basis functions ${{\mathcal{{X}}}}={\left[{{{\mathcal{X}}}}_{1},{{{\mathcal{X}}}}_{2},\ldots,{{{\mathcal{X}}}}_{m}\right]}^{{{\rm{T}}}}$, and a MD trajectory of length $T$ ($\left\{{{{\rm{x}}}}_{1},\ldots,{{{\rm{x}}}}_{T}\right\}$), the remove-mean time-instantaneous and time-lagged correlation matrices are defined as follows:

$$\left\{\begin{array}{c} {\bar{{{\rm{C}}}}}_{00}=\frac{1}{T-\tau }{{{\rm{X}}}}^{{{\rm{T}}}}{{\rm{X}}}-{{{\rm{\pi }}}}_{0}{{{\rm{\pi }}}}_{0}^{{{\rm{T}}}}\\ {\bar{{{\rm{C}}}}}_{11}=\frac{1}{T-\tau }{{{\rm{Y}}}}^{{{\rm{T}}}}{{\rm{Y}}}-{{{\rm{\pi }}}}_{1}{{{\rm{\pi }}}}_{1}^{{{\rm{T}}}}\\ {\bar{{{\rm{C}}}}}_{01}=\frac{1}{T-\tau }{{{\rm{X}}}}^{{{\rm{T}}}}{{\rm{Y}}}-{{{\rm{\pi }}}}_{0}{{{\rm{\pi }}}}_{1}^{{{\rm{T}}}}\end{array}\right.$$

(4)

where ${{\rm{X}}}$ and ${{\rm{Y}}}$ are two $T-\tau$ by $m$ matrices, defined as ${\left[{{\mathcal{X}}}\left({{{\rm{x}}}}_{1}\right),\ldots,{{\mathcal{X}}}\left({{{\rm{x}}}}_{T-\tau }\right)\right]}^{{{\rm{T}}}}$ and ${\left[{{\mathcal{X}}}\left({{{\rm{x}}}}_{\tau+1}\right),\ldots,{{\mathcal{X}}}\left({{{\rm{x}}}}_{T}\right)\right]}^{{{\rm{T}}}}$ ($\tau$ is the lag time). ${{{\rm{\pi }}}}_{0}$ and ${{{\rm{\pi }}}}_{1}$ are mean vectors of ${{\rm{X}}}$ and ${{\rm{Y}}}$, which equal to $\frac{1}{T-\tau }{{{\rm{X}}}}^{{{\rm{T}}}}1$ and $\frac{1}{T-\tau }{{{\rm{Y}}}}^{{{\rm{T}}}}1$, respectively. In our method, we parameterized the basis functions, ${{\mathcal{X}}}$, using the SoftMax outputs of two parallel networks with shared parameters from TS-DAR.

Exponential-moving-average (EMA) to estimate the metastable state centers

We denote the hyperspherical embeddings of conformations as ${\left\{{{{\rm{z}}}}_{i}\right\}}_{i=1}^{N}$, where ${{{\rm{z}}}}_{i}\in {{\mathbb{R}}}^{d}$, and $\left\{{\mathrm{1,2}},\ldots,C\right\}$ as the metastable state indices, the metastable state centers correspond to vectors ${\left\{{{{\mathbf{\mu }}}}_{c}\right\}}_{c=1}^{C}$ that can be computed via a EMA manner:

$${{{\mathbf{\mu }}}}_{c}{{\rm{:=}}}{{\mbox{Normalize}}}\left(\theta {{{\mathbf{\mu }}}}_{c}+\left(1-\theta \right){{{\rm{z}}}}_{i}\right),c={\widetilde{y}}_{i}$$

(5)

where ${\widetilde{y}}_{i}\in \left\{{\mathrm{1,2}},\ldots,C\right\}$ represents the state index of the conformation $i$, refers to as the pseudo-labels obtained from the Softmax outputs of TS-DAR neural network. $\theta$ is the state center update factor and specifically selected as 0.5.

MD simulation datasets

The analytical form of the Müller potential⁶⁵ (Fig. 2a) is as follows:

$${V}_{{{\rm{M}}}{{ \ddot{\rm u} }}{{\rm{ller}}}}({x}_{1},{x}_{2})=\sum _{i=1}^{4}{A}_{i}\exp ({a}_{i}{({x}_{1}-{\bar{x}}_{i})}^{2}+{b}_{i}({x}_{1}-{\bar{x}}_{i})({x}_{2}-{\bar{y}}_{i})+{c}_{i}{({x}_{2}-{\bar{y}}_{i})}^{2})$$

(6)

where $\left({A}_{1},\ldots,{A}_{4}\right)=\left(-10,-5,-8.5,0.75\right)$, $\left({a}_{1},\ldots,{a}_{4}\right)=\left(-1,-1,-6.5,0.7\right)$, $\left({b}_{1},\ldots,{b}_{4}\right)=\left(0,0,11,0.6\right)$, $\left({c}_{1},\ldots,{c}_{4}\right)=\left(-10,-10,-6.5,0.7\right)$, $\left({\bar{x}}_{1},\ldots,{\bar{x}}_{4}\right)=\left(1,0,-0.5,-1\right)$, $\left({\bar{y}}_{1},\ldots,{\bar{y}}_{4}\right)=\left(0,0.5,1.5,1\right)$. We performed a Brownian dynamics simulation (time step equals $2\times {10}^{-4}$, damping factor equals 1) to sample this Müller potential at the temperature of $0.9$. A reflective boundary condition is adopted: ${x}_{1}\in \left[-1.5,1.2\right]$, ${x}_{2}\in \left[-0.2,2\right]$. The simulation trajectory contains $3\times {10}^{5}$ frames with the saving interval of 0.01. For alanine dipeptide, we obtained the MD simulation dataset from a previous study⁸⁴. It contains three 250-ns MD trajectories, with a saving interval of 1 ps. As a result, the entire dataset contains 750,000 MD conformations. All conformations were aligned to the first frame according to the minimal root mean square deviation. The input features consist of the $x$, $y$ and $z$ coordinates of the 10 heavy atoms (totally 30 input features). For the translocation of a DNA motor protein (AlkD) on DNA system, we obtained the MD simulation dataset from a previous study⁷². It contains 200 50-ns and 100 45-ns MD trajectories, with a saving interval of 20 ps. As a result, the entire dataset contains 725,300 MD conformations. The input features consist of the pairwise distances of 38 atoms, which are constituted by phosphate atoms of five base pairs in the center region of dsDNA and heavy atoms of five protein helixes within 12 Å of nucleotides. See SI Fig. S10 for illustration.

Analytical solution of committor probabilities for the Müller potential

For this 2D-potential, the forward committor function ($q\left({{\rm{x}}}\right)$) can be mathematically obtained by solving the backward Kolmogorov equation with Dirichlet boundary conditions:

$$\Bigg \{ \begin{array}{c}\nabla V\cdot \nabla q({{\rm{x}}})-{\beta }^{-1}\Delta q({{\rm{x}}})=0,x\in \Omega \backslash (A\cup B)\\ q({{\rm{x}}})=0,x\in \partial A;q({{\rm{x}}})=1,x\in \partial B\end{array}$$

(7)

where $V$ is the potential energy function, $\beta=\frac{1}{{\mbox{k}}T}$ is the inverse temperature, $\nabla=\left(\frac{\partial }{\partial {x}_{1}},\ldots,\frac{\partial }{\partial {x}_{n}}\right)$ is the gradient operator and $\triangle={\sum }_{i=1}^{n}\frac{{\partial }^{2}}{\partial {x}_{i}^{2}}$ is the Laplace operator. $A$ and $B$ are the source and sink states, $\partial A$ and $\partial B$ represent the boundary of the two states, respectively. We used finite element method (FEM) to solve the Eq. (7) for the 2D Müller potential by the solver FreeFem + +⁸⁵. Specifically, we selected the regions with potential energy below 5.5 ${\mbox{k}}T$ in basin 1 and 2 as the source and sink states to compute the committor function for TS12, and the regions with potential energy below 3.9 ${\mbox{k}}T$ in basin 2 and 3 as the source and sink states to compute the committor function for TS23.

Training details of the TS-DAR models

We specified the scaling factor ($\gamma$) of the hyperspherical embeddings as 1, the scaling hyperparameter in dispersion loss as 0.1, the weight ($\beta$) of dispersion loss as 0.01 for Müller potential and alanine dipeptide datasets, 0.05 for AlkD-DNA dataset, the state center update factor ($\theta$) as 0.5, and the lag time ($\tau$) as 1 time step for Müller potential, 1 ps for alanine dipeptide, 8 ns for AlkD-DNA dataset. In addition, we established a criterion for determining the dimensionality (denote as $d$) of the latent hyperspherical embeddings (i.e., ($d-1$)-sphere): $d=3$, if number of states is more than 3, otherwise, $d=2$. For the specific training hyperparameters of TS-DAR on Müller potential, alanine dipeptide and AlkD-DNA datasets, please refer to SI Sec. 1 for details.

The validation of $\beta$

The selection of $\beta$ is trivial in TS-DAR model. However, choosing a proper magnitude of $\beta$ is important for the fully optimization of both VAMP-2 and dispersion losses. In this work, we performed the ablation tests on $\beta$ within Müller potential, alanine dipeptide and AlkD-DNA systems by training the TS-DAR model with different magnitudes of $\beta$. The determination of the magnitude of $\beta$ is guided by the following two criteria: (1) The dispersion loss can converge to the minimum boundary. (2) There is no significant deviation in VAMP-2 loss before and after integrating the dispersion loss optimization. Please see SI Fig. S17 for details.

The selection of transition states structures based on the OOD scores

With obtaining OOD scores from TS-DAR model that effectively distinguish between transition and metastable states, there are multiple ways for selecting transition state structures based on these scores for practical applications. We recommend that users define the OOD threshold based on the desired number of transition state structures they want to obtain. This strategy is particularly useful for adaptive sampling. Alternatively, one can define a threshold as the following:

$${\mbox{thres}}=0.5\times \left(-\cos \left(\frac{{\theta }^{\star }}{2}\right)+1\right)$$

(8)

where ${\theta }^{\star }$ represents the angle between two nearest neighbor state center vectors on the hypersphere after the optimization of TS-DAR model. In this work, we successfully selected the transition states structures of AlkD translocation along a DNA system (Fig. 5) by applying the threshold of ∼0.21 defined by Eq. (8).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All the MD simulation datasets and the training models of TS-DAR in this study are available in the Zenodo repository at https://zenodo.org/records/13835580⁸⁶. Source data are provided as a Source Data file. Source data are provided with this paper.

Code availability

MSM related study were conducted with MSMBuilder^87,88 (version msmbuilder2022-3.8.5), freely available at https://github.com/msmbuilder/msmbuilder2022. Biological analysis for AlkD dataset was performed using GROMACS⁸⁹ version 2019.6 package. TS-DAR is built based on PyTorch⁹⁰. The source code of TS-DAR and associated documentation are available at https://github.com/xuhuihuang/ts-dar⁹¹.

References

Konovalov, K. A. et al. A comprehensive mechanism for 5-carboxylcytosine-induced transcriptional pausing revealed by Markov state models. J. Biol. Chem. 296, 100735 (2021).
Article CAS PubMed PubMed Central MATH Google Scholar
Prinz, J.-H. et al. Markov models of molecular kinetics: generation and validation. J. Chem. Phys. 134, 174105 (2011).
Article ADS PubMed MATH Google Scholar
Chodera, J. D., Singhal, N., Pande, V. S., Dill, K. A. & Swope, W. C. Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. J. Chem. Phys. 126, 155101 (2007).
Article ADS PubMed MATH Google Scholar
Husic, B. E. & Pande, V. S. Markov state models: from an art to a science. J. Am. Chem. Soc. 140, 2386–2396 (2018).
Article CAS PubMed MATH Google Scholar
Bowman, G. R., Pande, V. S. & Noé, F. in An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation (eds. Bowman, G. R., Pande, V. S. & Noé, F.) 1–6 (Springer Netherlands, 2014).
Pan, A. C. & Roux, B. Building Markov state models along pathways to determine free energies and rates of transitions. J. Chem. Phys. 129, 064107 (2008).
Article ADS PubMed PubMed Central MATH Google Scholar
Buchete, N.-V. & Hummer, G. Coarse master equations for peptide folding dynamics. J. Phys. Chem. B 112, 6057–6069 (2008).
Article CAS PubMed MATH Google Scholar
Wang, W., Cao, S., Zhu, L. & Huang, X. Constructing Markov State Models to elucidate the functional conformational changes of complex biomolecules. WIREs Comput. Mol. Sci. 8, e1343 (2018).
Article MATH Google Scholar
Huang, X., Bowman, G. R., Bacallado, S. & Pande, V. S. Rapid equilibrium sampling initiated from nonequilibrium data. Proc. Natl Acad. Sci. 106, 19765–19769 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Malmstrom, R. D., Lee, C. T., Van Wart, A. T. & Amaro, R. E. Application of molecular-dynamics based Markov state models to functional proteins. J. Chem. Theory Comput. 10, 2648–2657 (2014).
Article CAS PubMed PubMed Central MATH Google Scholar
Morcos, F. et al. Modeling conformational ensembles of slow functional motions in Pin1-WW. PLOS Comput. Biol. 6, e1001015 (2010).
Article MathSciNet PubMed PubMed Central MATH Google Scholar
Zhang, B. W. et al. Simulating replica exchange: markov state models, proposal schemes, and the infinite swapping limit. J. Phys. Chem. B 120, 8289–8301 (2016).
Article CAS PubMed PubMed Central MATH Google Scholar
Konovalov, K. A., Unarta, I. C., Cao, S., Goonetilleke, E. C. & Huang, X. Markov state models to study the functional dynamics of proteins in the wake of machine learning. JACS Au 1, 1330–1341 (2021).
Article CAS PubMed PubMed Central Google Scholar
Liu, B., Qiu, Y., Goonetilleke, E. C. & Huang, X. Kinetic network models to study molecular self-assembly in the wake of machine learning. MRS Bull. 47, 958–966 (2022).
Article ADS MATH Google Scholar
Noé, F., Schütte, C., Vanden-Eijnden, E., Reich, L. & Weikl, T. R. Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations. Proc. Natl Acad. Sci. 106, 19011–19016 (2009).
Article ADS PubMed PubMed Central MATH Google Scholar
Bowman, G. R., Voelz, V. A. & Pande, V. S. Taming the complexity of protein folding. Curr. Opin. Struct. Biol. 21, 4–11 (2011).
Article CAS PubMed PubMed Central MATH Google Scholar
Da, L.-T. et al. A jump-from-cavity pyrophosphate ion release assisted by a key lysine residue in T7 RNA polymerase transcription elongation. PLOS Comput. Biol. 11, e1004624 (2015).
Article MathSciNet PubMed PubMed Central Google Scholar
Da, L.-T., Wang, D. & Huang, X. Dynamics of pyrophosphate ion release and its coupled trigger loop motion from closed to open state in RNA polymerase II. J. Am. Chem. Soc. 134, 2399–2406 (2012).
Article CAS PubMed PubMed Central MATH Google Scholar
Da, L.-T. et al. Bridge helix bending promotes RNA polymerase II backtracking through a critical and conserved threonine residue. Nat. Commun. 7, 11244 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Silva, D.-A. et al. Millisecond dynamics of RNA polymerase II translocation at atomic resolution. Proc. Natl Acad. Sci. 111, 7665–7670 (2014).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Malmstrom, R. D., Kornev, A. P., Taylor, S. S. & Amaro, R. E. Allostery through the computational microscope: cAMP activation of a canonical signalling domain. Nat. Commun. 6, 7588 (2015).
Article ADS PubMed Google Scholar
Kohlhoff, K. J. et al. Cloud-based simulations on Google Exacycle reveal ligand modulation of GPCR activation pathways. Nat. Chem. 6, 15–21 (2014).
Article CAS PubMed MATH Google Scholar
Deng, N.-J., Dai, W. & Levy, R. M. How kinetics within the unfolded state affects protein folding: an analysis based on markov state models and an ultra-long MD trajectory. J. Phys. Chem. B 117, 12787–12799 (2013).
Article CAS PubMed MATH Google Scholar
Wan, H., Ge, Y., Razavi, A. & Voelz, V. A. Reconciling simulated ensembles of apomyoglobin with experimental hydrogen/deuterium exchange data using bayesian inference and multiensemble markov state models. J. Chem. Theory Comput. 16, 1333–1348 (2020).
Article CAS PubMed Google Scholar
Buch, I., Giorgino, T. & De Fabritiis, G. Complete reconstruction of an enzyme-inhibitor binding process by molecular dynamics simulations. Proc. Natl Acad. Sci. 108, 10184–10189 (2011).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Lawrenz, M., Shukla, D. & Pande, V. S. Cloud computing approaches for prediction of ligand binding poses and pathways. Sci. Rep. 5, 7918 (2015).
Article CAS PubMed PubMed Central MATH Google Scholar
Silva, D.-A., Bowman, G. R., Sosa-Peinado, A. & Huang, X. A Role for Both Conformational Selection and Induced Fit in Ligand Binding by the LAO Protein. PLOS Comput. Biol. 7, e1002054 (2011).
Article CAS PubMed PubMed Central Google Scholar
Plattner, N. & Noé, F. Protein conformational plasticity and complex ligand-binding kinetics explored by atomistic simulations and Markov models. Nat. Commun. 6, 7653 (2015).
Article ADS PubMed MATH Google Scholar
Klem, H., Hocky, G. M. & McCullagh, M. Size-and-shape space gaussian mixture models for structural clustering of molecular dynamics trajectories. J. Chem. Theory Comput. 18, 3218–3230 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Wang, B., Sexton, R. E. & Feig, M. Kinetics of nucleotide entry into RNA polymerase active site provides mechanism for efficiency and fidelity. Biochim. Biophys. Acta (BBA)—Gene Regulatory Mech. 1860, 482–490 (2017).
Article CAS Google Scholar
Khaled, M., Gorfe, A. & Sayyed-Ahmad, A. Conformational and dynamical effects of Tyr32 phosphorylation in K-ras: molecular dynamics simulation and Markov state models analysis. J. Phys. Chem. B 123, 7667–7675 (2019).
Article CAS PubMed PubMed Central Google Scholar
Barros, E. P., Demir, Ö., Soto, J., Cocco, M. J. & Amaro, R. E. Markov state models and NMR uncover an overlooked allosteric loop in p53. Chem. Sci. 12, 1891–1900 (2021).
Article CAS Google Scholar
Feng, J., Selvam, B. & Shukla, D. How do antiporters exchange substrates across the cell membrane? An atomic-level description of the complete exchange cycle in NarK. Structure 29, 922–933.e3 (2021).
Article CAS PubMed Google Scholar
Son, C. Y., Yethiraj, A. & Cui, Q. Cavity hydration dynamics in cytochrome c oxidase and functional implications. Proc. Natl Acad. Sci. 114, E8830–E8836 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Jiang, H. et al. Markov state models reveal a two-step mechanism of miRNA loading into the human argonaute protein: selective binding followed by structural re-arrangement. PLOS Comput. Biol. 11, e1004404 (2015).
Article PubMed PubMed Central Google Scholar
Kleiman, D. E. & Shukla, D. Active learning of the conformational ensemble of proteins using maximum entropy VAMPNets. J. Chem. Theory Comput. 19, 4377–4388 (2023).
Article CAS PubMed MATH Google Scholar
Mardt, A., Pasquali, L., Wu, H. & Noé, F. VAMPnets for deep learning of molecular kinetics. Nat. Commun. 9, 5 (2018).
Article ADS PubMed PubMed Central MATH Google Scholar
Wu, H. & Noé, F. Variational approach for learning Markov processes from time series data. J. Nonlinear Sci. 30, 23–66 (2020).
Article ADS MathSciNet MATH Google Scholar
Bolhuis, P. G., Chandler, D., Dellago, C. & Geissler, P. L. Transition path sampling: throwing ropes over rough mountain passes, in the dark. Annu. Rev. Phys. Chem. 53, 291–318 (2002).
Article ADS CAS PubMed MATH Google Scholar
Dellago, C., Bolhuis, P. G. & Geissler, P. L. in Advances in Chemical Physics (eds Prigogine, I. & Rice, S. A.) 1–78 (Wiley, 2002).
Jung, H., Okazaki, K.-I. & Hummer, G. Transition path sampling of rare events by shooting from the top. J. Chem. Phys. 147, 152716 (2017).
Article ADS PubMed Google Scholar
Zhang, J., Yang, Y. I., Yang, L. & Gao, Y. Q. Dynamics and kinetics study of “in-water” chemical reactions by enhanced sampling of reactive trajectories. J. Phys. Chem. B 119, 14505–14514 (2015).
Article CAS PubMed MATH Google Scholar
Zhang, J. et al. Rich dynamics underlying solution reactions revealed by sampling and data mining of reactive trajectories. ACS Cent. Sci. 3, 407–414 (2017).
Article CAS PubMed PubMed Central MATH Google Scholar
Zhang, J. et al. Deep reinforcement learning of transition states. Phys. Chem. Chem. Phys. 23, 6888–6895 (2021).
Article CAS PubMed MATH Google Scholar
Jónsson, H., Mills, G. & Jacobsen, K. W. in Classical and Quantum Dynamics in Condensed Phase Simulations 385–404 (World Scientific, 1998).
E, W., Ren, W. & Vanden-Eijnden, E. String method for the study of rare events. Phys. Rev. B 66, 052301 (2002).
Article ADS MATH Google Scholar
E, W., Ren, W. & Vanden-Eijnden, E. Finite temperature string method for the study of rare events. J. Phys. Chem. B 109, 6688–6693 (2005).
Article CAS PubMed MATH Google Scholar
E, W. & Vanden-Eijnden, E. Towards a theory of transition paths. J. Stat. Phys. 123, 503–523 (2006).
Article ADS MathSciNet MATH Google Scholar
Maragliano, L., Fischer, A., Vanden-Eijnden, E. & Ciccotti, G. String method in collective variables: minimum free energy paths and isocommittor surfaces. J. Chem. Phys. 125, 024106 (2006).
Article ADS MATH Google Scholar
Pan, A. C., Sezer, D. & Roux, B. Finding transition pathways using the string method with swarms of trajectories. J. Phys. Chem. B 112, 3432–3440 (2008).
Article CAS PubMed PubMed Central MATH Google Scholar
Roux, B. String method with swarms-of-trajectories, mean drifts, lag time, and committor. J. Phys. Chem. A 125, 7558–7571 (2021).
Article CAS PubMed PubMed Central MATH Google Scholar
He, Z., Chipot, C. & Roux, B. Committor-consistent variational string method. J. Phys. Chem. Lett. 13, 9263–9271 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lai, R. & Lu, J. Point cloud discretization of fokker-planck operators for committor functions. Multiscale Modeling Simul. 16, 710–726 (2018).
Article MathSciNet MATH Google Scholar
Khoo, Y., Lu, J. & Ying, L. Solving for high-dimensional committor functions using artificial neural networks. Res. Math. Sci. 6, 1 (2018).
Article MathSciNet MATH Google Scholar
Li, Q., Lin, B. & Ren, W. Computing committor functions for the study of rare events using deep learning. J. Chem. Phys. 151, 054112 (2019).
Article ADS MATH Google Scholar
Prinz, J.-H., Held, M., Smith, J. C. & Noé, F. Efficient computation, sensitivity, and error analysis of committor probabilities for complex dynamical processes. Multiscale Modeling Simul. 9, 545–567 (2011).
Article MathSciNet CAS MATH Google Scholar
Yang, J., Zhou, K., Li, Y. & Liu, Z. Generalized out-of-distribution detection: a survey. Int. J. Comput. Vis. 132, 5635–5662 (2024).
Amodei, D. et al. Concrete problems in AI safety. Preprint at https://arxiv.org/abs/1606.06565 (2016).
Dietterich, T. G. Steps Toward Robust Artificial Intelligence. AI Mag. 38, 3–24 (2017).
MATH Google Scholar
Leike, J. et al. AI safety gridworlds. Preprint at https://arxiv.org/abs/1711.09883 (2017).
Smuha, N. A. The EU approach to ethics guidelines for trustworthy. Artif. Intell. Comput. Law Rev. Int. 20, 97–106 (2019).
Article Google Scholar
Shneiderman, B. Bridging the gap between ethics and practice: guidelines for reliable, safe, and trustworthy human-centered AI systems. ACM Trans. Interact. Intell. Syst. 10, Article 26 (2020).
Article MATH Google Scholar
Mohseni, S. et al. Taxonomy of machine learning safety: a survey and primer. ACM Comput. Surv. 55, 157 (2022).
Ming, Y., Sun, Y., Dia, O. & Li, Y. How to exploit hyperspherical embeddings for out-of-distribution detection? In The Eleventh International Conference on Learning Representations (ICLR, 2023).
Müller, K. & Brown, L. D. Location of saddle points and minimum energy paths by a constrained simplex optimization procedure. Theoretica Chim. Acta 53, 75–93 (1979).
Article MATH Google Scholar
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In: Proc. 37th International Conference on Machine Learning Vol. 119 (eds Hal, D., I. I. I. & Aarti, S.) 1597–1607 (PMLR, Proceedings of Machine Learning Research, 2020).
Wang, T. & Isola, P. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In: Proc. 37th International Conference on Machine Learning Vol. 119 (eds Hal, D., I. I. I. & Aarti, S.) 9929–9939 (PMLR, Proceedings of Machine Learning Research, 2020).
Li, J., Xiong, C. & Hoi, S. MoPro: webly supervised learning with momentum prototypes. In International Conference on Learning Representations (ICLR, 2021).
Sun, Y., Ming, Y., Zhu, L. & Li, Y. Out-of-distribution detection with deep nearest neighbors. In Proc. 39th International Conference on Machine Learning 20827–20840 (PMLR, 2022).
Bolhuis, P. G., Dellago, C. & Chandler, D. Reaction coordinates of biomolecular isomerization. Proc. Natl Acad. Sci. 97, 5877–5882 (2000).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Jones, L. E. Jr. et al. Differential effects of reactive nitrogen species on DNA base excision repair initiated by the alkyladenine DNA glycosylase. Carcinogenesis 30, 2123–2129 (2009).
Article CAS PubMed PubMed Central MATH Google Scholar
Peng, S. et al. Target search and recognition mechanisms of glycosylase AlkD revealed by scanning FRET-FCS and Markov state models. Proc. Natl Acad. Sci. 117, 21889–21895 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Schwantes, C. R. & Pande, V. S. Improvements in Markov state model construction reveal many non-native interactions in the folding of NTL9. J. Chem. Theory Comput. 9, 2000–2009 (2013).
Article CAS PubMed PubMed Central MATH Google Scholar
Pérez-Hernández, G., Paul, F., Giorgino, T., De Fabritiis, G. & Noé, F. Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 139, 015102 (2013).
Article ADS PubMed Google Scholar
Chen, W., Sidky, H. & Ferguson, A. L. Nonlinear discovery of slow molecular modes using state-free reversible VAMPnets. J. Chem. Phys. 150, 214114 (2019).
Article ADS PubMed Google Scholar
Wu, H. & Noé, F. Reaction coordinate flows for model reduction of molecular kinetics. J. Chem. Phys. 160, 044109 (2024).
Article ADS CAS PubMed MATH Google Scholar
Xu, J. & Durrett, G. Spherical Latent Spaces for Stable Variational Autoencoders 4503–4513 (Association for Computational Linguistics, Brussels, Belgium, 2018).
Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T. & Tomczak, J. M. Hyperspherical variational auto-encoders. In: 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 (eds Silva, R. & Globerson, A.) 856–865 (Association For Uncertainty in Artificial Intelligence (AUAI), 2018).
Bojanowski, P. & Joulin, A. Unsupervised learning by predicting noise. In: Proc. 34th International Conference on Machine Learning Vol. 70 (eds Doina, P. & Yee Whye, T.) 517–526 (PMLR, Proceedings of Machine Learning Research, 2017).
Mettes, P., Van der Pol, E. & Snoek, C. Hyperspherical prototype networks. Adv. Neural Inform. Process. Syst. 32 (2019).
Liu, W. et al. SphereFace: deep hypersphere embedding for face recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 6738–6746 (IEEE, 2017).
Wang, F., Xiang, X., Cheng, J. & Yuille, A. L. NormFace: L2 hypersphere embedding for face verification. In Proc. 25th ACM international conference on Multimedia 1041–1049 (Association for Computing Machinery, 2017).
Wang, D., Wang, Y., Evans, L. & Tiwary, P. From latent dynamics to meaningful representations. J. Chem. Theory Comput. 20, 3503–3513 (2024).
Article CAS PubMed MATH Google Scholar
Nüske, F. et al. Markov state models from short non-equilibrium simulations—analysis and correction of estimation bias. J. Chem. Phys. 146, 094104 (2017).
Article ADS MATH Google Scholar
Hecht, F. New development in freefem. J. Numer. Math. 20, 251–266 (2012).
Article MathSciNet MATH Google Scholar
Liu, B. et al. Data for: exploring transition states of protein conformational changes via out-of-distribution detection in the hyperspherical latent space. Zenodo https://doi.org/10.5281/zenodo.13835580 (2024).
Beauchamp, K. A. et al. MSMBuilder2: modeling conformational dynamics on the picosecond to millisecond scale. J. Chem. Theory Comput. 7, 3412–3419 (2011).
Article CAS PubMed PubMed Central MATH Google Scholar
Harrigan, M. P. et al. MSMBuilder: statistical models for biomolecular dynamics. Biophysical J. 112, 10–15 (2017).
Article ADS CAS MATH Google Scholar
Abraham, M. J. et al. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1-2, 19–25 (2015).
Article ADS MATH Google Scholar
Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inform. Process. Syst. 32 (2019).
Liu, B. xuhuihuang/ts-dar: exploring transition states of protein conformational changes via out-of-distribution detection in the hyperspherical latent space. Zenodo https://doi.org/10.5281/zenodo.14037314 (2024).

Download references

Acknowledgements

X.H. acknowledges the support from the NIH/NIGMS under award number 1 R01GM147652 and the support from the Hirschfelder Professorship Fund from University of Wisconsin-Madison. Y.L. acknowledges the support by the AFOSR Young Investigator Program under award number FA9550-23-1-0184, National Science Foundation (NSF) Award No. IIS-2237037 & IIS-2331669, Office of Naval Research under grant number N00014-23-1-2643, and faculty research awards/gifts from Google and Meta. X.H. and Y.L. are also supported by the Research Forward Fund from the University of Wisconsin-Madison Office of the Vice Chancellor for Research with funding from the Wisconsin Alumni Research Foundation. X.D. is supported by the Jane Street Graduate Research Fellowship. We also acknowledge helpful discussions with Siqin Cao and Michael Kalin.

Author information

Authors and Affiliations

Department of Chemistry, Theoretical Chemistry Institute, University of Wisconsin-Madison, Madison, WI, 53706, USA
Bojun Liu, Jordan G. Boysen, Ilona Christy Unarta & Xuhui Huang
Data Science Institute, University of Wisconsin-Madison, Madison, WI, 53706, USA
Bojun Liu, Ilona Christy Unarta, Yixuan Li & Xuhui Huang
Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, 53706, USA
Xuefeng Du & Yixuan Li

Authors

Bojun Liu
View author publications
Search author on:PubMed Google Scholar
Jordan G. Boysen
View author publications
Search author on:PubMed Google Scholar
Ilona Christy Unarta
View author publications
Search author on:PubMed Google Scholar
Xuefeng Du
View author publications
Search author on:PubMed Google Scholar
Yixuan Li
View author publications
Search author on:PubMed Google Scholar
Xuhui Huang
View author publications
Search author on:PubMed Google Scholar

Contributions

B.L. developed the methodology, established the software, and produced the majority of the results. J.B. contributed to conceptual discussions and validated the results. I.U. conducted the biological analysis on AlkD-dsDNA system. X.D. also contributed to conceptual discussions. Y.L. contributed conceptually to the methodology development. X.H. initiated this study, contributed conceptually to the methodology development and provided supervision throughout the project. All authors contributed to writing the paper.

Corresponding author

Correspondence to Xuhui Huang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Matteo Degiacomi, Johan Ulander and the other, anonymous, reviewer for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, B., Boysen, J.G., Unarta, I.C. et al. Exploring transition states of protein conformational changes via out-of-distribution detection in the hyperspherical latent space. Nat Commun 16, 349 (2025). https://doi.org/10.1038/s41467-024-55228-4

Download citation

Received: 07 June 2024
Accepted: 05 December 2024
Published: 02 January 2025
DOI: https://doi.org/10.1038/s41467-024-55228-4

This article is cited by

Memory kernel minimization-based neural networks for discovering slow collective variables of biomolecular dynamics
- Bojun Liu
- Siqin Cao
- Xuhui Huang
Nature Computational Science (2025)