Introduction

Noise-induced transitions are ubiquitous in nature and occur in diverse systems with multi-stable states1. Examples include switches between different voltage and current states in the circuit2, noisy genetic switches3, noise-induced biological homochirality of early life self-replicators4, protein conformational transitions5,6, and chemical reactions7 with the multi-stable probability distribution8. Learning noise-induced transitions is vital for understanding the critical phenomena of these systems. In many scenarios, only time series are available without mathematical equations known in prior. To effectively learn and predict noise-induced transitions from time series, it is necessary to distinguish both slow and fast time scales: fast relaxation around distinct stable states and slow transitions between them, where the fast time-scale signals are often referred to noise9,10. Consequently, it remains elusive to learn stochastic transitions from time series in general.

Recently, many efforts have been made to learn the dynamics from data by machine-learning methods11,12,13,14,15,16,17,18,19,20. One type of approach uses Sparse Identification of Nonlinear Dynamics (SINDy) for identifying nonlinear dynamics, denoising time-series data, and parameterizing the noisy probability distribution from data21,22. Due to the nonconvexity of the optimization problem, the method may struggle to robustly handle large function libraries for the regression. Another type of approach employs physics-informed neural networks for data-driven solutions and discoveries of partial differential equations23,24,25, or Koopman eigenfunctions from data26. However, the method requires an extensive quantity of data to train the deep neural network and refinements of the network.

Despite the broad application of the aforementioned methods, to our knowledge, they have not been utilized in studying noise-induced transitions. To learn noise-induced transitions, we first utilize the SINDy22 and recurrent neural network (RNN)27 for data with noise. We find that SINDy and RNN do not faithfully predict stochastic transitions, even for a one-dimensional bistable system with Gaussian white noise. We also apply the filters28,29 to the data, obtain the smoothed time series, and then deal with the filtered data by SINDy21. Still, this method does not accurately capture noise-induced transitions. Similarly, the method of First-Order, Reduced, and Controlled Error (FORCE) learning30, including its various versions of full-FORCE and the spiking neuron model31, does not fully capture stochastic transitions in the experimental data and requires relatively high computational cost. These attempts indicate that these conventional methods were mainly designed for denoising the noisy data to learn the deterministic dynamics, rather than capturing the noise-induced phenomena. We thus need to develop a new approach to predict stochastic transitions.

We notice that one machine-learning architecture, reservoir computing (RC)11,17, may be suitable for this task. The training of reservoir computing only needs linear regression, which is less computationally expensive than the neural network that requires back-propagation. Reservoir computing was found effective for learning dynamical systems12,32,33, including chaotic systems34,35,36,37,38. A recent research started to apply reservoir computing to stochastic resonance39, however, the functional role of noise in shifting dynamics between stable states has not been investigated. Another attempt employed RC for noise-induced transitions40 but relied on an impractical assumption of knowing the deterministic dynamics equation beforehand. In practice, prior knowledge of deterministic dynamics is often lacking and sometimes even cannot be directly described by an equation6. Thus, can we forecast noise-induced transitions solely based on data without any prior knowledge of the underlying equation?

In this study, we develop a framework of multi-scaling reservoir computing for learning noise-induced transitions in a model-free manner. The present method is inspired by that the hyperparameter α in the reservoir was found to determine the time scale of reservoir dynamics41. Given a multi-scale time series, we can thus tune the hyperparameter α to match the slowly time-scale dynamics. After the reservoir captures the slowly time-scale dynamics by fitting the output layer matrix, we can separate the fast time-scale series as a noise distribution. During the predicting phase, we utilize the trained reservoir computer to simulate the slowly time-scale dynamics, and then add back the noise sampled from the separated noise distribution (for white noise) or learnt from the second reservoir (for colored noise). The whole protocol is iterated overtime points as the rolling prediction. Notably, the present method is different from previous work that regards noise merely as a disturbance22,42, and instead focuses on capturing noise-induced transitions from the data.

To demonstrate the effectiveness of the present method, we apply it to two categories of scenarios. One type has the data generated from stochastic differential equations (SDE) for the purpose of testing the method, and the other has the experimental data6. For the first category with white noise, it includes a one-dimensional (1D) bistable gradient system, two-dimensional (2D) bistable gradient and non-gradient systems43, 1D and 2D gradient systems with a tilted potential, a 2D tilted non-gradient system, and a 2D tristable system44. The present approach can capture statistics of the transition time and the number of transitions. For the first category with colored noise, we study a 1D bistable gradient system with Lorenz noise (Lorenz-63 model and Lorenz-96 model40), and accurately predict the specific transition time, without the assumption of knowing the deterministic part of dynamics as required in40. For the second category, we apply the approach to the protein folding data6, and explore the least amount of required data for accurate training, which can help reduce the demand for extensive measurements in experiments.

Results

The problem and conventional approaches

To study noise-induced transitions, we consider two types of data: one type is generated from SDE and the other type is experimental data. First, we use data generated from SDE. The continuous-state and continuous-time Markovian stochastic dynamics can be given as

$$\dot{{{{\bf{u}}}}}=f({{{\bf{u}}}})+\sigma \xi (t),$$
(1)

where the vector \(\dot{{{{\bf{u}}}}}\) is the time derivative, the deterministic part of the dynamics is f(u), and σ corresponds to the noise strength. The ξ(t) is a k-dimensional Gaussian white noise with \(\langle \xi (t)\rangle=0,\langle \xi (t){\xi }^{\top }({t}^{{\prime} })\rangle=\delta (t-{t}^{{\prime} }){I}_{k}\), where denotes the transpose, Ik is the k-dimensional identity matrix, \(\delta (t-{t}^{{\prime} })\) is the Dirac δ function, and 〈  〉 represents the average.

Recent methods for learning dynamical models from time series have not directly handled the stochastic transitions (Supplementary Fig. 1). We apply the methods related to our work for the example below. First, we utilize two types of SINDy, SINDy-202122 and SINDy-201621,45 (Supplementary Fig. 2, Supplementary Table 1). The SINDy-2021 can learn the dynamics from data with noise and separate the noise distribution. However, it does not faithfully find stable states or predict stochastic transitions, while requiring high computational cost (Supplementary Table 2). The SINDy-2016 also does not capture the stochastic transitions from data with noise. The second method is RNN27,46, which still does not accurately predict the stochastic transitions (Supplementary Fig. 3). Besides, the SINDy-2016 is not designed for data with noise. We thus preprocess the data by filters (Kalman filter28 and Savizky-Golay filter29). However, SINDy-2016 still does not predict the noise-induced transitions for the filtered data (Supplementary Fig. 4, Supplementary Table 3). Moreover, we find that the FORCE learning method31 can capture transitions in the bistable system with white noise (Supplementary Figs. 5, 6), but requires higher computational cost. It also does not faithfully learn stochastic transitions from the experimental data (Supplementary Figs. 7, 8).

The framework of multi-scaling reservoir computing

Given that previous approaches are not applicable to noise-induced transitions, we leverage reservoir computing to learn the transitions. In reservoir computing12,34, the input layer of the reservoir transforms time series into the reservoir network, while the output layer transforms the variables of the reservoir back to time series. The output layer is trained to minimize the difference between the input and output, with tuning the hyperparameters. It has the following scheme:

$${{{{\bf{r}}}}}_{t+1}=(1-\alpha ){{{{\bf{r}}}}}_{t}+\alpha \tanh (A{{{{\bf{r}}}}}_{t}+{W}_{in}{{{{\bf{u}}}}}_{t}),$$
(2)
$${\tilde{{{{\bf{u}}}}}}_{t+1}={W}_{out}{{{{\bf{r}}}}}_{t+1}.$$
(3)

Here, the vector u is an n-dimensional state vector, and the initial condition is u0 with the lower script denoting the time, the Win is the input matrix with the values uniformly sampled in \(\left[-{K}_{in},{K}_{in}\right]\), the r is the N-dimensional reservoir state vector, the A is the adjacency matrix of an Erdõs-Rényi network with average degree D to describe the reservoir connection between N nodes, and the ρ is the spectral radius of A. The \(\tanh\) represents our activation function for this study. The \(\tilde{{{{\bf{u}}}}}\) is the output vector and Wout is the output matrix. The α is the leak hyperparameter, representing the time scale41, which becomes clearer when we rewrite Eq. (2) in its continuous-time form:

$$\frac{1}{\alpha }\dot{{{{\bf{r}}}}}=-{{{\bf{r}}}}+\tanh (A{{{\bf{r}}}}+{W}_{in}{{{\bf{u}}}}).$$
(4)

In the training phase, only Wout is trained to minimize the difference between the output time series and the training data47. With the regularization term, the loss function is given by

$$L={\sum}_{t=1}^{T}| | {{{{\bf{u}}}}}_{t}-{W}_{out}{{{{\bf{r}}}}}_{t}| {| }^{2}+\beta | | {W}_{out}| {| }^{2},$$
(5)

where β is the regression hyperparameter. We then regress the matrix Wout by minimizing the loss function (Methods). By stacking the vectors of different time points as a vector: U [u1, …, uT] and R [r1, …, rT] with t = 1, …, T, it can be rewritten as a compact form:

$${W}_{out}=({{{\bf{U}}}}{{{{\bf{R}}}}}^{\top })\cdot {({{{\bf{R}}}}{{{{\bf{R}}}}}^{\top }+\beta )}^{-1}.$$
(6)

Determining Wout is a simple linear regression, which is less computationally expensive than the neural network that requires the back propagation.

The framework for learning noise-induced transitions using multi-scaling reservoir computing is summarized in Fig. 1. A reservoir acquires a time series u that contains signals with both fast and slow time scales. Given that α characterizes the time scale of reservoir computer41, we search for an appropriate value of α to capture the slow time scale initially. After identifying an appropriate α value, additional searches are conducted to find suitable values for other hyperparameters. This process aims to improve the accuracy of the results and obtain the trained slow-scale model. We utilize the trained slow-scale model to separate the noise distribution from the original time series. Then, we sample noise from the separated distribution and employ the trained slow-scale model for rolling prediction.

Fig. 1: Framework of learning noise-induced transitions by multi-scaling reservoir computing.
figure 1

a The training data is a time series u with slow and fast time scales, and the fast time-scale part can be considered as noise, causing noise-induced transitions between stable states. In the training phase, at each time step t, the reservoir takes into ut through a matrix Win and has a reservoir state rt with a connection matrix A. The output matrix Wout is trained to fit the output time series to the training data at the next time point. Tuning the hyperparameter α alters the time scale of the output \(\widetilde{{{{\bf{u}}}}}\), and a properly chosen α leads to a match with the slowly time-scale data. Then, \({{{\bf{u}}}}-\widetilde{{{{\bf{u}}}}}\) separates the fast time-scale signal η as a noise distribution. b In the predicting phase, the \({\widehat{{{{\bf{u}}}}}}_{s}\) is put into the trained reservoir to generate \({\widetilde{{{{\bf{u}}}}}}_{s+1}\). In the next time step s + 1, the input \({\widehat{{{{\bf{u}}}}}}_{s+1}\) is the \({\widetilde{{{{\bf{u}}}}}}_{s+1}\) plus the noise ηs+1 sampled from the separated noise distribution. This process is iterated as a rolling prediction56 to generate the time series \(\widehat{{{{\bf{u}}}}}\). c The evaluation on the predicted transition statistics. In the middle, different colored lines of \(\widehat{{{{\bf{u}}}}}\) represent replicates of the predictions. The accuracy is evaluated by the statistics of transition time and the number of transitions. PDF: probability density function.

We use trial and error to search for appropriate hyperparameters47. In detail, the first strategy employs the information of stable states in the training set, which can be inferred by segmenting the time series between large jumps and calculating the mean value of each segment6. If reservoir computing effectively captures the slowly time-scale dynamics, the generated trajectories from various initial points (e.g., ten chosen points) should converge to the corresponding stable state. To achieve that, we tune the hyperparameter α and then refine the remaining hyperparameters. In the case of nonconvergence, the hyperparameter is adjusted in the opposite direction. When the hyperparameter adjustments do not further improve the convergence, we turn to the next hyperparameter47. The second strategy does not rely on prior information of stable states, where the hyperparameters are searched by evaluating the power spectral density (PSD). The accuracy of learning deterministic dynamics is quantified by the match of PSDs between the predicted time series and the training data (Supplementary Fig. 9). Thus, the match between the PSDs serves another indicator on the proper choice of hyperparameters.

After finding the appropriate hyperparameters, we utilize the trained slow-scale model to separate the noise distribution. Within the training phase, at time step t, the reservoir accepts the input ut, resulting in an output \({\tilde{{{{\bf{u}}}}}}_{t+1}\). Then, the noise at time step t + 1 can be computed as

$${{{{\boldsymbol{\eta }}}}}_{t+1}={{{{\bf{u}}}}}_{t+1}-{\tilde{{{{\bf{u}}}}}}_{t+1}.$$
(7)

We then obtain the noisy time series and the distribution as depicted in Fig. 1a. We can continue to implement the rolling prediction (Fig. 1b). In the predicting phase, at time step s, the reservoir accepts \({\widehat{{{{\bf{u}}}}}}_{s}\), yielding the output \({\tilde{{{{\bf{u}}}}}}_{s+1}\). By adding ηs+1, sampled from the noise distribution, to the output \({\tilde{{{{\bf{u}}}}}}_{s+1}\) as

$${\widehat{{{{\bf{u}}}}}}_{s+1}={\tilde{{{{\bf{u}}}}}}_{s+1}+{{{{\boldsymbol{\eta }}}}}_{s+1},$$
(8)

the \({\widehat{{{{\bf{u}}}}}}_{s+1}\) is used as the input for the time step s + 1. The vector \(\widehat{{{{\bf{u}}}}}\) is the prediction.

As illustrated in Fig. 1c, to validate that the present method accurately captures noise-induced transitions, we compare the prediction with the test data. For white noise that is memoryless, we quantify the accuracy of the prediction by the statistics of noise-induced transitions. Instead of predicting a single transition, we focus on learning the statistics of transition time and the number of transitions from a set of trajectories. For colored noise, we aim to accurately forecast the occurrence of a specific noise-induced transition.

We next proceed with two categories of examples. One category is data generated from stochastic differential equations, including a 1D bistable gradient system and a 2D bistable non-gradient system with white noise, as well as a 1D bistable gradient system with colored noise. More examples are provided in Supplementary Note: a 1D tilted bistable gradient system (Supplementary Fig. 10, Supplementary Table 4), a 2D bistable gradient system (Supplementary Fig. 11, Supplementary Table 5), 2D tilted bistable gradient (Supplementary Fig. 12, Supplementary Table 6) and non-gradient (Supplementary Fig. 13, Supplementary Table 6) systems, a 2D tristable system (Supplementary Fig. 14, Supplementary Table 7), and a 1D bistable system with high-dimensional colored noise (Supplementary Fig. 15, Supplementary Table 8). The second category focuses on experimental data, where we apply the present method to protein folding data6. We also assess the performance of using a small part of the dataset (Supplementary Fig. 16).

Examples

A bistable gradient system with white noise

As a first example, we consider a 1D bistable gradient system with white noise9:

$${\dot{u}}_{1}=-b(-{u}_{1}+{u}_{1}^{3}+c)+\sqrt{2\varepsilon b}{\xi }_{1}(t),\quad t\ge 0.$$
(9)

The ξ1(t) is a Gaussian white noise. The parameter b denotes the strength of the diffusion coefficient, ε is the noise strength, and c controls the tilt of the two potential wells. The system has noise-induced transitions between the two potential wells as illustrated in Fig. 2a. We generated a time series lasting 20000δt, with the training set t [0, 100] and the predicting set t [100, 200]. Figure 2b shows the first 3000δt of the training set.

Fig. 2: Capturing stochastic transitions in a bistable gradient system with white noise.
figure 2

a Schematic of noise-induced transitions in the bistable gradient system with Gaussian white noise. b Generated time series from Eq. (9) (b = 5, c = 0, ε = 0.3, u1(0) = 1.5, δt = 0.01) with t = 30 as the ground truth. c The trained slow-scale model transforms ten different start points into ten different slowly time-scale series (color lines), and the noise distribution is separated. d The prediction for t [100, 130]. e The number of transitions for the test and predicted data matches. Transition refers to the shift from u1 = − 1 to u1 = 1 or vice versa. The duration of the prediction is 10000δt. f Histograms of transition time for the test and predicted data. Transition time refers to the interval between two consecutive transitions.

In the training phase, the tuning of hyperparameters for the slow-scale model is performed as the protocol in our framework. After finding the proper hyperparameters listed in Table 1 (Example 1), we obtain the trained slow-scale model and the separated noise distribution (Fig. 2c). We remark that the convergence speed of the captured deterministic dynamics may have discrepancies compared with the actual dynamics. As a result, the separated noise distribution may exhibit lower or higher intensity compared with the actual noise distribution. In this case, we can employ a factor to magnify or reduce the noise strength. For instance, we amplify the sampled noise by a factor of 1.1 here, which improves the accuracy of the predictions (Supplementary Fig. 17).

Table 1 The list of hyperparameters used for the different examples of the main text

In the evaluation, we conduct rolling prediction as shown in Fig. 1. The first 3000δt of the prediction is illustrated in Fig. 2d. The prediction has similar noise-induced transition dynamics to the test data. Next, we generate 100 replicates of time series from Eq. (9), train the model, and produce 100 time series separately. We then compare the statistics of the noise-induced transitions for these two sets of time series, e.g., the number of transitions over 10000δt (Fig. 2e) and the transition time (Fig. 2f). The match between the test and predicted data demonstrates the effectiveness of our approach in capturing noise-induced transition dynamics.

A bistable gradient system with colored noise

The prediction on a single stochastic transition becomes possible when the system has colored noise. We need to learn the time evolution of the separated noise. Since RC is good at learning deterministic system, we employ a second RC (the first RC for the deterministic part) to learn the noise series for predicting a single transition. To demonstrate that the present method is applicable to such cases, we consider a system Eqs. (10) to (13) studied in ref. 40, where their method relies on the assumption of knowing the deterministic part of the equation in prior. In contrast, we do not assume any prior knowledge of the deterministic part of the dynamical system and directly learn both the deterministic part and noise (Fig. 3a), enabling prediction in a model-free manner.

Fig. 3: Predicting the accurate transition time for a bistable gradient system with colored noise.
figure 3

The system is the same as Eq. (10)40. a The flowchart of predicting stochastic transitions with colored noise. The process for obtaining noise ζt follows that in Fig. 1a, and a second reservoir takes into ζt through matrix \({W}_{in}^{*}\) and has reservoir states \({{{{\bf{r}}}}}_{t}^{*}\) with a connection matrix A*. The output matrix \({W}_{out}^{*}\) is trained to learn noise. b Target time series (x(0) = y(0) = z(0) = 1, b = 1, c = 0, ψ = 0.08, ϵ = 0.5, u1(0) = − 1.5, δt = 0.01) with 8000δt, where a noise-induced transition occurs in t [22, 25] marked by the green dashed line. The noisy data from 580δt (with a range of 550δt to 650δt empirically suitable) before the stochastic transition at t = 22 is applied to predict the noisy time series in t [22, 25]. c The trained slow-scale model transforms ten different start points into ten different slowly time-scale series (color lines). d By repeating the process in a with the same hyperparameters, 50 predicted u1(t) are obtained (fainter lines). The averaged predicted time series (thick green) matches the test data (coral). e Absolute error of the predicted 50 time series and its mean value (thick green).

The system is a 1D bistable gradient system, as illustrated in Fig. 3b:

$${\dot{u}}_{1}=-b(-{u}_{1}+{u}_{1}^{3}+c)+\frac{\psi }{\epsilon }y,$$
(10)
$$\dot{x}=\frac{10}{{\epsilon }^{2}}(y-x),$$
(11)
$$\dot{y}=\frac{1}{{\epsilon }^{2}}(28x-xz-y),$$
(12)
$$\dot{z}=\frac{1}{{\epsilon }^{2}}\left(xy-\frac{8}{3}z\right).$$
(13)

The parameter b denotes the strength of the diffusion coefficient, ϵ corresponds to the noise strength, ψ controls the influence of the noise on the slow-scale dynamics, and c controls the tilt of the two potential wells. The noise (xyz) is modeled by the Lorenz-63 model48. The system has stochastic transitions between the two potential wells under the Lorenz noise.

To test the present method, we consider the time series with a stochastic transition prior to the green dashed line in Fig. 3b. In the training phase, we obtain a slow-scale model to learn the deterministic part (Fig. 3c) and to separate noise. The hyperparameters are listed in Table 1 (Example 2 set 1). In the predicting phase, accurately forecasting the stochastic transitions requires predicting the noisy time series. Thus, we utilize a second reservoir (Fig. 3a) to learn the previously separated noise during the training phase. The hyperparameters for the noisy time series are listed in Table 1 (Example 2 set 2). With the deterministic component slow-scale model, we execute a rolling prediction to predict a single transition.

To evaluate the accuracy of the prediction, we applied the same hyperparameters to conduct 50 predictions as in ref. 40. These predictions were then used alongside trained slow-scale model for 50 times rolling prediction. The average of the 50 predictions outcomes closely approximates the actual time series (Fig. 3d). Furthermore, Fig. 3e shows a near-zero average absolute error between the 50 predictions and the actual time series, indicating high accuracy. These results demonstrate that the present approach requires no assumptions about knowing the deterministic part, underscoring its potential in predicting a single stochastic transition under colored noise.

A bistable non-gradient system

We next focus on investigating whether the present method can predict noise-induced transitions in 2D non-gradient systems. We consider a bistable non-gradient system43:

$${\dot{u}}_{1}=-b(-{u}_{1}+{u}_{1}^{3}+c)-a{u}_{2}+\sqrt{2{\varepsilon }_{1}b}{\xi }_{1}(t),\quad t\ge 0,$$
(14)
$${\dot{u}}_{2}=a(-{u}_{1}+{u}_{1}^{3}+c)-b{u}_{2}+\sqrt{2{\varepsilon }_{2}b}{\xi }_{2}(t),\quad t\ge 0,$$
(15)

with Gaussian white noise ξ1(t) and ξ2(t). In this system, b is the diffusion coefficient, a represents the strength of the non-detailed balance part, ε1 and ε2 are the noise strengths, and c controls the tilt of the potential. The system has noise-induced transitions between the two potential wells under the noise, as illustrated in Fig. 4a. The presence of a non-detailed balance introduces a rotational component to the time series, which complicates the prediction.

Fig. 4: Learning noise-induced transitions in a bistable non-gradient system.
figure 4

a Schematic of transitions in the 2D bistable non-gradient system. b Generated time series from Eqs. (14) and (15) (a = b = 5, c = 0, ε1 = ε2 = 0.3, u1(0) = 0, u2(0) = 2, δt = 0.002) with t = 40 as the ground truth. c The trained slow-scale model transforms ten different start points into ten different slowly time-scale series (color lines), t [40, 80], and the noise distribution is separated in the training phase. d Result of prediction using the slow-scale model and the noise distribution in c. e The number of transitions for the 100 replicates simulated in t [40, 80] and that from the 100 predicted trajectories matches. f Histograms of transition time for the test and predicted data. The transition occurs when the time series crosses the zero point in the u1-direction without returning for 50δt. The transition time is defined as the interval between two consecutive zero crossings.

In the training phase, we generated a series consisting of 40000δt. The training set is t [0, 40], and the predicting set is t [40, 80]. Figure 4b displays the training set. Following the method in our framework, the deterministic part is reconstructed as shown in Fig. 4c. A proper set of hyperparameters is listed in Table 1 (Example 3). We observe that the generated time series starting from the ten initial points converge to two potential wells, where the time series has rotational dynamics. In the predicting phase, we perform rolling prediction within t [40, 80] (Fig. 4d).

In the evaluation, we predict 100 replicates of the time series, and compare them with 100 replicates simulated from Eqs. (14) and (15). Figure 4e presents histograms of the number of transitions for the 100 predicted and test time series, in t [40, 80]. Figure 4f presents histograms of transition time for the 100 predicted and test time series, in t [40, 80]. The results demonstrate that, for the 2D bistable non-gradient system, the present method accurately learns the dynamics and yields precise estimations on the number of transitions and transition time.

Experimental data of protein folding

We apply the present method to the protein folding data6, demonstrating that it can learn the noise-induced transitions of experimental data. The talin protein has five regions with distinct states, and two states (native and unfolded) can be singled out in a short time (native folding dynamics). A short end-to-end length represents the native state, while a longer length represents the unfolded state. This shift of end-to-end length can be considered a noise-induced transition. Figure 5a shows the training data, where transitions occur between two stable states.

Fig. 5: Learning the stochastic transitions from the experimental data of protein folding.
figure 5

The u1 represents the end-to-end length of the protein. We refer to transitions from around u1 = 15 to around u1 = 30 as upward transitions, and vice versa as downward transitions. The right-pointing arrow: reduction of training data. a Time series of the training set (0 − 25000 time steps). b The trained slow-scale model generates slowly time-scale series (color lines), and the noise distribution is separated out. c The prediction during time steps 25000−50,000. d, e Histograms of upward and downward transition time for the prediction and the test data, where the length of the training set (Ttrain) is 25,000 time steps. Transition time refers to the interval between two consecutive transitions. f–i Similar histograms of upward and downward transition time with different lengths of the training sets, Ttrain = 7500 for (f, g), and Ttrain = 6000 for (h, i). The present method can still be accurate even when the training length is reduced to Ttrain = 7500.

In the training phase, with the training set length (Ttrain) of 25000 time steps, we obtain the trained slow-scale model and ten different slowly time-scale series with the separated noise distribution (Fig. 5b). The proper hyperparameters are listed in Table 1 (Example 4). In the predicting phase, we employ the trained slow-scale model and the separated noise distribution to do rolling prediction for 100,000 time steps. The first 25000 time steps of prediction are plotted in Fig. 5c, showing the transitions between stable states and the asymmetric dynamics.

In experiments, the available data is often limited, and it is essential to determine the minimum amount of data required. Thus, we reduce the amount of training data to 7500 and 6000 time steps separately. We generate a prediction for 100,000 time steps and then compare it with the test data to evaluate the impact of data length on prediction accuracy. The results in Fig. 5d, e and f, g demonstrate that the present method can learn the dynamics of protein folding from the data with around 7500 time steps. Figure 5h, i show a larger error between the predicted and true transition time when Ttrain is equal to 6000 time steps. This suggests that 7500-time steps approximate the minimum data requirement for the present method in this system, allowing the behavior of protein folding to be effectively learned and simulated for more time steps. Additionally, when compared with SINDy-2021 and FORCE learning, our method has higher accuracy (Supplementary Fig. 8). Therefore, the present approach is helpful for streamlining the workload of experimentalists by learning protein folding dynamics from a small dataset.

Discussion

The choice of hyperparameter α affects the training: the larger α corresponds to the time series with fast time scale, while the smaller α leads to slow time scale41. We utilize this characteristic to search for α to match the slow dynamics and separate noise. If a time series is generated from a system with asymmetric potential wells, the two distinct potential wells exhibit different time scales. In this case, we may need to employ two different sets of hyperparameters (Supplementary Figs. 10, 12, 13), where our approach can identify the two-time scales. For colored noise (Example 2), α for the noisy RC is smaller than that for the deterministic part (Table 1), because using a smaller α leads to a smoother noisy time series and better captures the major trend of colored noise.

The effectiveness of learning slowly time-scale series can also be influenced by other hyperparameters47. Although it is challenging to have a universal and systematic strategy for selecting hyperparameters49, we have proposed a general method to search for the optimal hyperparameters. We find that the power spectral density of the time series can be used to quantify the training performance (Supplementary Fig. 9). The PSD of the predicted stochastic time series closely matches that of the training data when deterministic dynamics are accurately captured. Therefore, a closer match of PSDs indicates a better choice of the hyperparameters. Moreover, we change the values of hyperparameters (Supplementary Figs. 18, 19). The RC is still effective when α [0.15, 0.25] and the regularization parameter β is not too small. It demonstrates that the method is robust under a range of parameter values.

For experimental data, we have shown the possibility of accurate predictions from a small dataset, as exemplified for the protein folding data6. During a short time span, the data samples a local equilibrium involving the native and unfolded conformations. If the measurement time is significantly extended, previously inaccessible regions separated by high-energy barriers may be explored. Consequently, in order to capture a wider variety of protein folding transitions, it may be necessary to use a longer training set, which needs higher computational cost. Additionally, we observed tilted dynamics in the time series of protein folding. Even so, we can learn both the upward and downward transitions by using only one set of hyperparameters. This suggests that the time scales of upward and downward transitions might not differ significantly. When dealing with a time series generated from very tilted dynamics, we can employ two distinct sets of hyperparameters.

The present method is found effective for various cases of the frequency distributions of the data (Supplementary Fig. 20). For Example 1, the frequency distribution has almost equal intensity over a long range of frequencies due to the white noise. Differently, in Example 2 with colored noise, the frequency distributions of deterministic and noisy signals are mixed with their indistinguishable frequency distributions. To further test the effectiveness of our approach in such cases, we apply it to another case with mixed frequency distributions, where the precise transition time is also accurately predicted. Besides, for the real data of protein folding, the frequency distribution is similar to Example 1, which helps us to better grasp the data characteristics. In general, the frequency distribution of the data can help guide the training, including the choice of the hyperparameters.

In summary, we have provided a general framework for learning noise-induced transitions solely based on data. We have applied the method to examples from stochastic differential equations and experimental data, where the method can accurately learn transition statistics from a small training set. As potential ways of improvements, the Bayesian optimization50 and simulated annealing51 can be used to help the search for the hyperparameters. The present approach may be applied to analyze transitions of trajectories between different dynamical phases of spins52. The approach can also be generalized to the examples with hidden nodes and hidden links53, or with other types of noise, where the conditional generative adversarial network54 may be employed to model the noise. We anticipate that this study can motivate a series of systematic explorations on learning noise-induced phenomena beyond mitigating noisy effect in extracting deterministic dynamics, such as by extending the frameworks of SINDy and FORCE learning.

Methods

We first describe the previous training process of reservoir computing. We reformulate the loss function to derive the expression for the output matrix Wout and discuss the hyperparameters in the present method. The loss function is given as Eq. (5). In detail, we should write the loss function as a sum from all the parameters to do linear regression. Then, the regression becomes simply a sum of vectors:

$$L= \, {\sum}_{t=1}^{T}[| | {{{{\bf{u}}}}}_{t}-{W}_{out}{{{{\bf{r}}}}}_{t}| {| }^{2}+\beta | | {W}_{out}| {| }^{2}]\\= \, {\sum}_{t=1}^{T}[{({{{{\bf{u}}}}}_{t}-{W}_{out}{{{{\bf{r}}}}}_{t})}^{\top }({{{{\bf{u}}}}}_{t}-{W}_{out}{{{{\bf{r}}}}}_{t})+\beta | | {W}_{out}| {| }^{2}]\\= \, {\sum}_{t=1}^{T}[{({{{{\bf{u}}}}}_{t})}^{\top }{{{{\bf{u}}}}}_{t}-{({W}_{out}{{{{\bf{r}}}}}_{t})}^{\top }{{{{\bf{u}}}}}_{t}-{({{{{\bf{u}}}}}_{t})}^{\top }{W}_{out}{{{{\bf{r}}}}}_{t}+{({W}_{out}{{{{\bf{r}}}}}_{t})}^{\top }{W}_{out}{{{{\bf{r}}}}}_{t}+\beta | | {W}_{out}| {| }^{2}].$$
(16)

As the loss function is convex (to prove that the zero gradient is indeed the local minimum, one needs to differentiate once more to obtain the Hessian matrix and show that it is positive definite; this is provided by the Gauss-Markov theorem), the optimum solution lies at the zero gradient by

$${\partial }_{{W}_{out}}L={\sum}_{t=1}^{T}[-2{({{{{\bf{r}}}}}_{t})}^{\top }{{{{\bf{u}}}}}_{t}+2{({{{{\bf{r}}}}}_{t})}^{\top }{W}_{out}{{{{\bf{r}}}}}_{t}+2\beta {W}_{out}]=0,$$
(17)

which leads to the regression:

$${W}_{out}={\sum}_{t=1}^{T}\left[({{{{\bf{u}}}}}_{t})\cdot {({{{{\bf{r}}}}}_{t})}^{\top }\right]\cdot {[({{{{\bf{r}}}}}_{t})\cdot {({{{{\bf{r}}}}}_{t})}^{\top }+\beta ]}^{-1},$$
(18)

where we have neglected the notation of the identity matrix and the identify vector. By stacking the vectors of different time points as a vector, it can be rewritten as a compact form Eq. (6). Here, the present method further adjusts Wout by tuning hyperparameters to capture the deterministic dynamics and separate noise distribution, thereby enabling us to learn stochastic dynamics.

There are six hyperparameters in the present method. The variable N represents the number of reservoir nodes, which determines the reservoir size. In most cases, performance improves with larger reservoir55. However, using large reservoir might lead to overfitting, requiring the application of suitable regularization techniques47. The hyperparameter Kin represents the scaling factor for the input matrix Win. The average degree of the reservoir connection network is denoted by D, and we choose the connection matrix A to be sparse11. This approach stems from the intuition that decoupling the state variables can result in a richer encoding of the input signal55. The spectral radius of the reservoir connection network, denoted as ρ, represents a critical characteristic of the dynamics of the reservoir state. Notably, it affects both the nonlinearity of the reservoir and its capacity to encode past inputs in its state42,55. The α is the leak parameter, which determines the time scale41. The hyperparameter β represents the regularization term47.