Introduction

Chaotic systems exhibit sensitive dependence on initial conditions, meaning that small changes in starting conditions can lead to vastly different outcomes over time. Understanding such systems requires sophisticated mathematical tools and computational techniques that enable researchers to analyze complex behaviors, identify underlying patterns, and make predictions about system evolution over time. Lorenz’s foundational work on deterministic nonperiodic flow provided insights into such behavior1.

The dynamical behavior of the phase space of Hamiltonian systems can be described in general by two-dimensional area-preserving mappings, so that numerical simulations can easily be performed. Such a dynamical behavior can be described in terms of action-angle variables

$$\begin{aligned} J_{n+1}= & J_n + \beta \; f(J_{n+1},\theta _n) \end{aligned}$$
(1)
$$\begin{aligned} \theta _{n+1}= & \theta _n + 2 \pi g(J_{n+1},\theta _n) + \beta \; h(J_{n+1},\theta _n) \end{aligned}$$
(2)

where f, g and h are nonlinear functions of the parameters, the pair \((\theta _n, J_n)\) represents a point in the action-angle phase-space at time \(t_n\), and \(\beta\) represents a perturbation parameter. For many interesting applications the functions f and g are assumed to be independent of the action variable and the angle variable, respectively, that is, \(f= f(\theta _n)\), \(g = g(J_{n+1})\), while \(h \equiv 0\). By linearizing equation (2) around a fixed point of period 1, say \(J_0 = J_{n+1} = J_n\), for a trajectory close to the actual action \(J_n = J_0 + \Delta J_n\) the new action \(p_n = 2 \pi g^\prime \Delta J_n\) (\(g^\prime\) is the derivative of g) satisfies the so-called standard map2,3.

$$\begin{aligned} p_{n+1}= & p_n + k \; \sin \theta _n \nonumber \\ \theta _{n+1}= & \left( \theta _n + p_{n+1}\right) \mod 2\pi \end{aligned}$$
(3)

where \(k = 2\pi g^\prime \beta f_{max}\) is the chaoticity parameter, and \(f_{max}\) is the maximal value of the function f, which, as usual, has been chosen as a sinusoidal function \(f/f_{max} = \sin \theta _n\)4. When \(k = 0\), the system (3) describes an integrable Hamiltonian system, while when k is different from zero the system becomes non-integrable.

The Standard Map models various phenomena, such as a kicked pendulum in the absence of friction and gravity, or the interaction of a charged particle with an electrostatic wave5,6. It belongs to the family of twist maps and serves as a foundational model for both classical and quantum chaos5,6, as it describes the onset of chaotic behavior near the separatrix of non-integrable systems. Additionally, the Standard Map exhibits the key properties of two-dimensional symplectic mappings, making it a prototype for studying such systems. The dynamics of the map in phase space are governed by the KAM (Kolmogorov–Arnold–Moser) theorem4, which addresses the persistence of regular, quasi-periodic motion under small perturbations of an integrable system. For instance, when an integrable Hamiltonian system is slightly perturbed, such as by \(k \ll 1\) in the map (3), resonances between degrees of freedom may disrupt the convergence of power series expansions. As a result, near the hyperbolic fixed points of the system, the presence of homoclinic and heteroclinic points ensures a region of chaotic motion, bounded by KAM surfaces. However, regular structures characterized by irrational frequency ratios associated with the action variables persist, even at higher values of the parameter k. For a fixed k, the phase space consists of interwoven regions of regular and irregular dynamics, with the extent of regular behavior diminishing as k increases. To illustrate this, in Figure 1, we show the classical phase space \((\theta _n,p_n)\) of the Standard Map for different values of k, using \(N = 160\) random initial values \((\theta _0, p_0)\) within the range \([0, 2\pi ]\) and performing \(L = 2048\) iterations. This figure captures the dynamics described by the KAM theorem. The colors in the figure simply represent different trajectories within the phase space.

Fig. 1
Fig. 1
Full size image

Phase space of the standard map \((\theta _n, p_n) \mod 2\pi\) for various values of the chaoticity parameter k. As k increases, chaotic trajectories begin to occupy more of the phase space, while regular trajectories persist, even at larger values of k. This illustrates the transition from regular to chaotic behavior as described by the KAM theorem, where the extent of regular regions in phase space diminishes as k increases.

Over recent years, machine learning (ML) techniques have emerged as powerful tools for forecasting and characterizing chaotic systems, often providing predictive capabilities that surpass those of traditional mathematical approaches7,8. Among these techniques, deep neural networks (DNNs), particularly artificial neural networks (ANNs), have become state-of-the-art for modeling chaotic time series. Deep learning algorithms can automatically learn hierarchical representations of data, enabling them to identify patterns at various levels of abstraction without explicit guidance from human programmers9,10,11. Various DNN architectures, including feed-forward neural networks (FFNs) and recurrent neural networks (RNNs), have been applied to chaotic systems with differing levels of success. While FFNs focus on learning static input-output relationships, RNNs and their advanced forms, such as reservoir computing models12,13 and long short-term memory (LSTM) networks14, excel in capturing temporal dependencies, making them particularly suitable for time series prediction in chaotic dynamics.

The scope of ML applications in chaotic systems has broadened significantly, introducing new tools for classification, forecasting, and parameter inference. In the context of regression and forecasting, Boullé et al. (2020)15 and Celletti et al. (2022)16 demonstrate that deep learning can effectively classify chaotic and regular dynamics in complex systems, overcoming the limitations of conventional methods in high-dimensional systems. Similarly, Lee and Flach (2020)17 explore the application of deep learning techniques to classify chaotic and regular dynamics in dynamical systems, particularly in the two-dimensional Chirikov Standard Map. Using a convolutional neural network, the authors show that the model can effectively identify chaotic behavior even over short trajectories, where traditional numerical methods (such as the Lyapunov exponent) can fail to converge. The study highlights the neural network’s robustness across varying control parameters and its success in testing other discrete dynamical systems, including the one-dimensional logistic map and a discrete version of the three-dimensional Lorenz system. This approach provides a promising alternative for rapid chaos classification in complex systems.

When it comes to regression and forecasting tasks, Sangiorgio et al. (2022) demonstrate how deep learning models can effectively predict the multi-step evolution of chaotic systems, addressing the challenges posed by noise and real-world unpredictability in chaotic environments18. This work complements that of Pathak et al. (2017), who use machine learning to replicate chaotic attractors and calculate Lyapunov exponents, showing that data-driven models can capture the underlying chaotic structure without relying on exact system equations12. Similarly, Duncan et al. (2023) introduce a hybrid reservoir computing model that combines data-driven and model-based approaches, optimizing predictions in cases where partial system knowledge exists, thereby enhancing both model accuracy and robustness13. The studies by Kavuran et al. (2022) and Maathuis et al. (2017) further emphasize the role of machine learning in chaotic time series analysis. Kavuran et al. (2022) demonstrate the use of bidirectional LSTM (BiLSTM) for detecting structural variations in fractional-order chaotic systems, underscoring the flexibility of machine learning in identifying subtle dynamical changes, which is essential for applications in secure communications and encryption14. Maathuis et al. (2017) highlight the suitability of neural networks for forecasting chaotic time series, proposing these models as viable alternatives to traditional approaches, particularly in high-dimensional, nonlinear dynamic systems19.

Many complex physical phenomena are described by chaotic Hamiltonian models. However, the limited scope of experimental data available for real chaotic systems often reduces confidence in the theoretical feasibility of predicting key system properties with a finite dataset. Despite these limitations, in time series analysis, deep learning models trained on multiple short time series have been observed to outperform traditional methods, which typically require long time series to achieve similar results12,17,20. This enhanced performance stems from the ability of deep learning models to ’learn’ the statistical properties of the system from a large set of short time series. However, this advantage relies on specific conditions: ergodicity, where statistical properties remain consistent over time, and stationarity, where these properties remain unchanged. In non-ergodic or non-stationary systems, where statistical properties vary, the ensemble averaging approach often used by deep learning models may not hold. In such cases, advanced techniques like Finite Time Stability Exponents (FTSEs) and Finite Time Lyapunov Exponents (FTLEs), which capture the system’s dynamic structure over finite time windows, can provide more reliable insights and improve predictability21,22,32,33,34.

Suppose we examine the dynamics of a generic system described by an energy-based model, where phase-space evolution is governed by the standard map (3) with an unknown parameter k. By using the map’s dynamics across varying k values as training data for a DNN model, we aim to assess the model’s ability to estimate the unknown parameter k using only a limited number, L, of map iterations and N trajectories generated with that parameter. Unlike previous studies focusing solely on classifying trajectories or forecasting time series, this study addresses the following questions: How accurately can a DNN predict the chaoticity parameter k in a two-dimensional symplectic map? How does prediction accuracy depend on the number of iterations, L, and the initial conditions? What insights into the structure of chaotic phase space can be gained from the representations learned by the neural network?

By addressing these questions, we aim to contribute to the growing body of research on ML applications in chaos theory, demonstrating that neural networks can be utilized not only for short-term forecasting but also for the structural characterization of chaotic systems through parameter inference.

Methods

Data generation

In this study, we used 100 distinct values of k, spanning the range \(0 \le k \le 5\) with a step size of \(\Delta k = 0.05\). We generated a total of 169 initial conditions \((p_0, \theta _0)\), obtained by combining 13 discrete values of \(p_0\) and 13 discrete values of \(\theta _0\) within the interval \([0, 2\pi ]\), using a step size of \(\Delta = 0.5\). From this full set, we considered subsets of up to \(N = 160\) initial conditions, and for each initial condition, we performed a maximum of \(L = 2048\) iterations for every value of k.

To train our deep learning model, we constructed multiple datasets by extracting different subsets from this pool of initial conditions, using the following procedure. For each value of k, we randomly selected a subset of N initial conditions, where N takes values in the set 1, 10, 20, 30,..., 160, from the original set of 169. Each selected initial condition was evolved for a number of iterations, chosen from one of five possible values of L (128, 256, 512, 1024, 2048). This process was performed independently for each k. As a result, datasets such as \((L = 128, N = 100)\) and \((L = 256, N = 100)\), while containing the same number of trajectories, do not necessarily contain the same set of initial conditions due to the stochastic selection process applied independently to each one.

The trajectories were divided based on the value of k into three disjoint subsets: a training set (\(70\%\)) for model fitting, a validation set (\(15\%\)) for hyperparameter tuning, and a test set (\(15\%\)) for evaluating the model’s generalization performance. For consistency, the assignment of k values to each set remained fixed across all pairs of N and L. In other words, if a particular k value was assigned to the training, validation, or test set for one specific pair of N and L, it retained that assignment across all other pairs.

The ability of a trained deep learning model to generalize to new, unseen samples depends on the similarity of the distributions in the training, validation, and test sets. Ensuring that these subsets are representative of the overall data distribution is crucial for optimal model performance. To achieve this, we employed a random partitioning strategy to divide the dataset into the the training, validation and test sets, ensuring that each set accurately reflected the overall distribution.

Deep learning model

Time series data consists of observations or measurements collected sequentially over time and is found in various domains such as finance, economics, weather forecasting, signal processing, and healthcare, among others. Unlike standard classification and regression tasks, time series problems introduce the complexity of temporal dependence between observations, necessitating specialized handling during model fitting and evaluation. However, this temporal structure can also be advantageous, offering additional insights such as trends and seasonality that can enhance model performance. Deep learning algorithms, particularly Deep Neural Networks (DNNs), have emerged as a promising approach to address the challenges of analyzing and modeling time series data. Their ability to capture complex temporal patterns and dependencies has led to their increasing use in time series analysis23,24,25.

DNNs map inputs to targets through a series of simple transformations learned from examples (pairs of inputs and targets). These transformations are parameterized by the network’s weights, so training a DNN involves finding the optimal set of weight values to accurately map inputs to their corresponding targets. Since DNNs often contain millions of parameters, this optimization task is challenging, as adjusting one weight can influence the behavior of the entire network.

To guide the training process, a loss function measures the difference between the predicted and true target values. This function computes a score that reflects the network’s performance on a given example. The score serves as feedback to adjust the weights, pushing them in the direction that reduces the loss for that example. Initially, the weights are assigned random values, leading to poor predictions and high loss scores. However, through iterative processing, the weights gradually adjust to minimize the loss function, allowing the network to make increasingly accurate predictions.

In this study, we apply these principles within the framework of a convolutional neural network (CNN), a specialized type of neural network designed to process data with a grid-like topology, such as images. Initially developed for tasks like handwritten digit recognition26, CNNs have also been applied to time series analysis, as time series data can be treated as a one-dimensional grid of samples taken at regular intervals. The core operation of a CNN is convolution, which involves summing each element of the input with its neighbors, weighted by a kernel. As the kernel slides across the input, it generates a feature map, which is passed to subsequent layers. This process enables the network to develop hierarchical representations, from basic patterns in shallow layers to more complex structures in deeper layers. This hierarchical feature extraction enables CNNs to achieve high accuracy across a range of tasks, including image recognition27, object detection28, and natural language processing29.

While CNNs can process time series data, sequence modeling benefits from a different approach to handle temporal dependencies. For this reason, we next consider recurrent neural networks (RNNs), specialized neural networks designed specifically for sequence modeling. RNNs maintain an internal memory state, which is a condensed representation of past information, continuously updated with new observations at each time step. However, training RNNs on tasks that involve long-term dependencies can be challenging due to issues like vanishing and exploding gradients30. To address these challenges, long short-term memory networks (LSTMs) were developed31. LSTM models enhance gradient flow by introducing a cell state that serves as a memory unit for storing long-term information. This cell state is regulated by a set of gates: the input gate, forget gate, and output gate. The input gate controls how new information is integrated into the cell state, while the forget gate determines which information from the previous state should be discarded. The output gate regulates the flow of information either to the next time step or as the final output of the network. By controlling the information flow through these gates, LSTMs can capture and retain long-term dependencies in sequential data, overcoming the limitations of traditional RNNs.

We apply these models to a multivariate time series of length L, represented by two components, p and q, which are generated from Equation 3. In the context of the CNN, these components will be referred to as ‘channels’, while in the LSTM context, they will be referred to as ‘features’. The model’s primary objective is to predict the parameter k, which governs the system dynamics. For each dataset instance, a distinct DNN was trained, validated, and tested. The architectures of the two models, a CNN and an LSTM, are depicted in Figures 2 and 3, respectively. Detailed information about the architectures can be found in the corresponding figures.

Fig. 2
Fig. 2
Full size image

Convolutional Neural Network (CNN) architecture designed for sequence data processing. The network consists of four ‘Down Blocks’, each containing three 1D convolutional layers, followed by a ReLU activation function. Each block ends with a MaxPooling layer that downsamples the feature maps by a factor of two. The input consists of two channels corresponding to the variables p and q, which are progressively expanded to 32, then 64, and finally 128 channels across the blocks. After the final block, an adaptive average pooling layer reduces each feature map to a single value. The resulting feature vector is passed through a classifier, composed of two fully connected layers with ReLU activations, and finally, a single output neuron that produces a prediction based on the input sequence.

Fig. 3
Fig. 3
Full size image

Architecture of the Long Short-Term Memory (LSTM) network for sequence data processing. The model consists of two stacked LSTM layers with 128 hidden units that process input sequences with two features. The output from the LSTM layers is then passed through a fully connected layer, which maps the features to a single output value. This output represents the predicted value of the chaoticity parameter k, based on the learned temporal dependencies in the sequential input data.

To assess predictive performance and model robustness, we trained each model using three loss functions: Mean Squared Error (MSE), Mean Absolute Error (MAE), and SmoothL1. The Mean Absolute Error (MAE), also referred to as L1 loss (Eq. (4)), computes the average of the absolute differences between the true and the predicted values of k (\(k_{\text {true}}\) and \(k_{\text {pred}}\), respectively). Since large errors contribute linearly to this loss, MAE is less sensitive to outliers than MSE, promoting a more uniform error distribution across the dataset.

$$\begin{aligned} MAE = \frac{1}{N} \sum _{i=1}^{N} |k_{\text {true}}(i) - k_{\text {pred}}(i)| \end{aligned}$$
(4)

The Mean Squared Error (MSE), also known as L2 loss (Eq. (5)), measures the average of the squared differences between the true and the predicted \(k\) values. Because the squaring operation magnifies large errors, MSE is more sensitive to outliers. This property encourages the model to prioritize reducing larger errors, which may be beneficial in applications requiring high precision in predictions.

$$\begin{aligned} MSE = \frac{1}{N} \sum _{i=1}^{N} (k_{\text {true}}(i) - k_{\text {pred}}(i))^2 \end{aligned}$$
(5)

SmoothL1 loss, a combination of L1 and L2 losses, strikes a balance between MSE and MAE (Eq. (6)). For small errors, SmoothL1 behaves similarly to MSE by squaring the differences, whereas for larger errors, it switches to an L1 form, using only the absolute difference. This hybrid approach moderates the impact of outliers while maintaining an overall focus on error reduction, making it more robust to extreme deviations.

$$\begin{aligned} \text {SmoothL1} = \frac{1}{N} \sum _{i=1}^{N} {\left\{ \begin{array}{ll} 0.5(k_{\text {true}}(i) - k_{\text {pred}}(i))^2 & \text {if } |k_{\text {true}}(i) - k_{\text {pred}}(i)| < 1 \\ |k_{\text {true}}(i) - k_{\text {pred}}(i)| - 0.5 & \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

We used the Adam optimizer for training with a learning rate of 0.001, while all other hyperparameters were maintained at their default settings. Each model was trained for 100 epochs. Despite variations in the dataset instances used, the neural network architectures remained consistent, enabling a comprehensive comparison of their performance across different datasets.

Results and discussion

As expected, the results for forecasting the parameter k depend on both the number of initial conditions N and the trajectory length L. Figure 4 displays the phase space for \(k=1.2\) as presented to the deep learning model during training. Although the density of trajectories varies, the main structures within the phase space remain preserved. This section presents the outcomes of the LSTM model using the SmoothL1 loss function, as outlined in Section “Deep learning model”. Comparative analysis of the CNN and LSTM models, as well as the effects of different loss functions, is provided in Appendix.

Fig. 4
Fig. 4
Full size image

Phase space of the standard map \((\theta _n, p_n) \mod 2\pi\) for N different initial trajectories and L iterations of the map, with \(k = 1.2\). As the number of trajectories (N) and iterations (L) increase, the phase space becomes increasingly populated, providing a more comprehensive representation of the dynamics.

After training for a fixed pair (NL), we use the reserved values of k in the forecasting phase, as outlined in Section “Data generation”, to obtain the predicted values \(k_{pred}\) from our model. Figure 5 presents the Probability Density Functions (PDFs) of \(\log (k_{\text {pred}}/k_{\text {true}})\) for various combinations of \((k, N)\), each trained on trajectories of different lengths. Figure 6 illustrates the PDFs of \(\log (k_{\text {pred}}/k_{\text {true}})\) for different combinations of \((k, L)\), each trained on a variable number of trajectories. These PDFs were estimated using Kernel Density Estimation (KDE) with a Gaussian kernel and normalized to unit area.

Fig. 5
Fig. 5
Full size image

Probability Density Functions (PDFs) of the logarithmic error \(\log (k_{\text {pred}}/k_{\text {true}})\), estimated using Kernel Density Estimation (KDE) with a Gaussian kernel, for three different values of the chaoticity parameter k. Each plot shows the distributions of predictions from models trained on trajectories of varying lengths for a fixed pair (kN). This demonstrates how sequence length influences the accuracy of predicting the parameter k.

Fig. 6
Fig. 6
Full size image

Probability Density Functions (PDFs) of the logarithmic error \(\log (k_{\text {pred}}/k_{\text {true}})\), estimated using KDE with a Gaussian kernel, for three different values of k. The curves in each plot represent predictions from models trained on different numbers of trajectories for a fixed pair (kL). This illustrates the effect of the number of trajectories on the accuracy of the predicted values for the chaoticity parameter k.

Narrower PDFs around \(k_{\text {pred}} = k_{\text {true}}\) indicate improved accuracy of the model’s predictions. At first glance, the data-driven approach for predicting parameters in Hamiltonian chaotic systems, represented here by the standard map, shows only limited success. Generally, the predictability of the chaotic parameter \(k\) is influenced by the map’s dynamics, governed by the KAM theorem, in which the density of regular trajectories varies with \(k\). Thus, we find that the forecast accuracy for \(k\) depends on the parameter’s values, given fixed \(N\) and \(L\). Examining the figures, we observe that for lower values of \(k\), where regular trajectories cover a larger portion of phase space, the PDFs remain broad, particularly for smaller \(L\) and \(N\) values. However, as \(k\), \(L\), and \(N\) increase, the PDFs become narrower, suggesting improved prediction accuracy. This trend aligns with expectations, as data-driven deep learning models tend to perform better with larger datasets. Additionally, the differences in predictability across various \(k\) values are non-trivial and in some cases counterintuitive.

In Fig. 7, we report the mean values and standard deviations of \(\log (k_{\text {pred}}/k_{\text {true}})\) as functions of \(k_{\text {true}}\) used in forecasting across different combinations of \((N, L)\). Our results confirm that higher values of \(N\) and \(L\) enhance the system’s predictability. A particularly interesting and counterintuitive finding is that, for a fixed \((N, L)\), there is a trend indicating that the PDFs become narrower as \(k\) increases. This suggests that predictability improves at higher \(k\) values, even when the PDFs occasionally peak away from \(k_{\text {pred}} = k_{\text {true}}\). However, the best results are achieved for intermediate values of \(k\).

In Fig. 8, we display the dispersion of \(k_{\text {pred}}\) around the corresponding \(k_{\text {true}}\) values on the \((k_{\text {true}}, k_{\text {pred}})\) plane, for selected values of \((N, L)\). This figure reveals that the minimum dispersion tends to occur at intermediate \(k\) values. In other words, the deep learning model appears to perform best when the system’s phase space is balanced between extreme conditions, avoiding overly expansive or constrained regions of chaotic and regular trajectories. This finding confirms our earlier result,indicating that prediction accuracy depends more on the phase space structure than solely on the values of \(N\) and \(L\).

Fig. 7
Fig. 7
Full size image

Average values and standard deviations of the logarithmic error \(\log (k_{\text {pred}}/k_{\text {true}})\) across test sets for various combinations of the pair (NL). A solid horizontal line at zero indicates perfect predictions (\(\text {predictions} = \text {true}\)). This figure provides insight into how the model’s prediction accuracy improves as the number of trajectories and sequence length vary.

Fig. 8
Fig. 8
Full size image

Comparison of the average predictions of k by the deep neural network (DNN) (on the y-axis) against the ground truth values (on the x-axis) for the test sets. The diagonal line represents perfect predictions, where \(\text {predictions} = \text {true}\). Error bars represent the standard deviation from the average, indicating the variability of the predictions. The average and standard deviation are calculated over predictions made using varying numbers of trajectories, N.

Conclusion

In this study, we applied deep neural networks (DNNs) to analyze the standard map, a model for non-integrable Hamiltonian systems governed by the KAM theorem, to evaluate its effectiveness in predicting the chaoticity parameter \(k\). Our goal was to explore whether deep learning can reliably estimate the chaoticity parameter, which represents the degree of chaos, by training on variations in initial conditions and trajectory lengths. The results underscore that prediction accuracy is highly dependent on both the phase-space structure, controlled by \(k\), and the density of trajectories in the phase space. Specifically, prediction accuracy improves as the phase space becomes more evenly populated with both regular and chaotic trajectories, particularly for intermediate values, where regular and irregular structures are balanced.

For lower values, with a dominance of regular trajectories, the DNN model showed limited predictive ability, likely due to the relatively uniform coverage of phase space by regular trajectories. In contrast, high values of \(k\) presented another challenge, while dominated by chaotic trajectories, these regions also retained persistent stable structures, such as stable manifolds and regular islands, within the predominantly chaotic regions of phase space. These structures are a distinctive feature of mixed-phase space systems, governed by the KAM theorem, where regular and chaotic dynamics coexist. Even in regions where chaos predominates, these stable structures remain embedded and can influence the dynamics, making it more difficult for the DNN to make accurate predictions. These ‘hidden’ regularities are not necessarily obvious or widespread, but their presence creates a complex topology that hinders the model’s ability to learn and predict accurately. However, for intermediate values of \(k\), the model achieved its best performance, suggesting that deep learning is most effective when the phase space is neither entirely regular nor entirely chaotic.

Further analysis, as presented in the Appendix, provided deeper insights into the model’s behavior concerning chaotic and non-chaotic trajectories. The results indicate that predictions based solely on non-chaotic trajectories are significantly more accurate than those for chaotic trajectories, emphasizing the model’s sensitivity to regular patterns.

Additionally, the comparative analysis of convolutional neural networks (CNNs) and long short-term memory (LSTM) models revealed distinct advantages. The LSTM model demonstrated superior performance in predicting chaotic trajectories compared to the CNN model, which exhibited higher loss values for chaotic paths. This suggests that LSTMs, which can capture temporal dependencies, are better suited for handling chaotic systems. Furthermore, the results confirmed that increasing the number of initial conditions (N) and the trajectory length (L) enhances the model’s predictive accuracy, though improvements plateau beyond a certain threshold.

Loss function analysis further supported these findings, with SmoothL1 outperforming Mean Squared Error (MSE) and Mean Absolute Error (MAE) across different trajectory lengths and models. This underscores the importance of selecting appropriate loss functions when training deep learning models for chaotic systems.

Ultimately, this research demonstrates that while DNNs have significant potential in chaotic system analysis, their efficacy in parameter inference is subject to inherent system complexities as defined by the KAM theorem. Future research can leverage these insights to refine deep learning models for better adaptability in systems with varying chaotic and regular dynamics.