Background and Summary

Proteins, the building blocks of life, are central to nearly all biological processes, and understanding their structure and dynamics is crucial for advancements in fields ranging from biochemistry to pharmaceuticals. The convergence of advanced computational methods and biophysical techniques has led to unprecedented insights into molecular structures and functions of proteins. Molecular dynamics (MD), for example, is a compute-intensive technique that attempts to model the dynamics of biological macromolecules in realistic environments, often at all-atom resolution, based on empirical force-fields whose quality has been improving over decades1,2,3. Machine learning, especially through the development of neural network potentials (NNPs), has the potential to further enhance computational protein research by enabling more accurate predictions and simulations of behaviors4,5,6. However, the lack of comprehensive datasets capturing the dynamic behaviors of proteins remains a significant challenge7. Such datasets are vital for training machine learning models that can predict protein folding, functions, and interactions — often dynamic and transient processes, yet critical for understanding how macromolecules work, interact, and how they might be targeted. High-quality datasets are thus pivotal in advancing our comprehension of these complex phenomena. In recent years, efforts have been made to provide MD datasets, especially for key targets in drug discovery. Notable databases include GPCRmd8, a platform dedicated to the study of G-protein-coupled receptors (GPCRs) dynamics, and SCOV2-MD9 as well as BioExcel-CV1910, both showcasing the power of collaborative MD databases in the context of COVID-19 research. However, these initiatives are limited by their focus on specific proteome subsets, leaving a gap in comprehensive proteome-wide dynamic datasets. Previous projects such as MoDEL11, Dynameomics12 and ATLAS13, and the MDDB14 and MDRepo15 initiatives have been introduced to provide dynamics datasets encompassing a broader range of proteins, often in a single replica and at room temperature, but the computational cost of MD has generally limited databases in terms of coverage breadth and timescales.

Here, we introduce mdCATH, a dataset focused on providing extensive all-atom MD-derived dynamics for most protein domains in the CATH classification system16. mdCATH features simulations of 5,398 domains at five different temperatures, each in five replicas, therefore offering statistically relevant large-scale insights into protein structure dynamics under a multiplicity of conditions. This extensive and homogeneously-collected dataset of all-atom molecular dynamics simulations fills a critical void in the available molecular datasets by offering a rich, diverse, and physiologically relevant array of protein domain dynamics, enabling systematic, proteome-wide studies into protein thermodynamics, folding, and kinetics. It is possible to exploit mdCATH for learning data-driven (e.g. neural network-based) potentials17, also thanks to the inclusion, unique to our knowledge, of instantaneous forces derived from a state-of-the-art all-atom force field. We hope that the mdCATH dataset will facilitate improvements in the design and refinement of biomolecular force fields.

Dataset Requirements

Our goal is to take a step forward in creating a proteome-wide molecular dynamics dataset for advancing drug discovery and enabling researchers to explore the dynamic behaviors of diverse protein targets. We built the mdCATH dataset to meet the following design features:

  • Comprehensive coverage of structural features. mdCATH provides molecular dynamics information across 5,398 protein domains from the CATH classification system. This extensive coverage ensures a broad representation of the proteome, making the dataset valuable for a wide range of research applications in drug discovery.

  • MD-derived coordinates and forces. The dataset includes both coordinates and forces from simulated trajectories. The presence of forces is a unique feature in this dataset, which enables training force-based machine learning potentials.

  • Wide conformational space sampling. mdCATH features multiple replicas at different temperatures, capturing a variety of conformations, including higher energy states encountered in molecular dynamics simulations. This ensures that the potential functions trained on this dataset produce accurate results across all relevant conformations.

  • High quality data. To ensure the highest accuracy, mdCATH utilizes state-of-the-art force fields, code, and computational resources. The accuracy of the dataset directly impacts the performance of models trained on it, making the use of the most accurate level of theory practical a priority.

  • Derived metadata. The dataset includes pre-computed information such as root-mean-square deviation (RMSD), root-mean-square fluctuation (RMSF), secondary structure composition, and so on.

  • Reproducibility. Reproducibility is ensured by including the PDB and PSF files in the dataset. Additionally, the data is stored in the efficient HDF5 binary data format, facilitating easy access and manipulation of the dataset for further research and model training.

Methods

We built the dataset on the basis of the domain definitions provided by the CATH database18,19,20. CATH, a publicly available resource maintained by the Orengo group, provides a set of domains clustered by general architecture according to the class, architecture, topology, and homologous superfamily hierarchy16. We started from 14,433 non-homologous domains at the S20 (20%) homology level in CATH release 4.2.0. We then restricted the selection to the subset of 13,470 domains between 50 and 500 amino acids, to focus on globular structures. Next, we excluded all the structures whose backbone was non-contiguous, e.g. due to unresolved regions in the original experimental structures; we also excluded sequences containing non-standard amino acids (also absent from CATH model files). The inclusion criteria left 5,883 residues for further processing.

All the domain structures have been prepared with a standard protonation protocol at pH 7 including charge state assignments, proton placement and H-bond network optimization21. Peptide chains were capped with acetylated and N-methylated termini. The systems were solvated in cubic boxes of TIP3P water with at least 9 Å of padding on each side, neutralized, and ionized with Na+ and Cl ions at 0.150 M concentration. Systems whose resulting solvation cubic box was larger than (100 Å)3 were discarded. The final dataset includes 5,398 accepted domains, as illustrated in Fig. 1. HTMD version 1.16 was used for all the building steps22,23.

Fig. 1
figure 1

Exclusion criteria and the resulting number of domains at each step, starting from the 14,433 domains in the S20 homology set of CATH release 4.2.0, and ending with 5,398 domains included in the mdCATH dataset presented in this work.

All systems were parameterized with the CHARMM22* forcefield1. Long-range electrostatic forces were treated with the particle-mesh Ewald (PME) summation24, with an integration timestep of 4 fs enabled by the hydrogen mass repartitioning scheme of 4 amu per H atom25. The simulations were performed with ACEMD26 on GPUGRID.net distributed network27.

Each system thus obtained was subjected to a pre-equilibration phase for 20 ns with a time-step of 4 fs in the NPT ensemble at 1 atm and 300 K utilizing the Montecarlo barostat. Harmonic restraints were applied to the protein’s carbon α atoms (1.0 kcal/mol/Å) and heavy atoms (0.1 kcal/mol/Å) to maintain them close to their initial positions during the first half (10 ns) of equilibration. The second half of equilibration (10 ns to 20 ns) was performed without restraints. No restraints were used during the subsequent production phase.

The final configuration of each system was used as a starting point for 25 production simulations, spawning runs at five temperatures in geometric progression (320 K, 348 K, 379 K, 413 K, 450 K), each in five replicas. The production simulations were performed in the NVT ensemble using Langevin thermostat for integration and a 0.1 ps−1 relaxation time. The use of the constant-volume ensemble sidesteps issues with the poor reproduction of the water phase and pressure by TIP3P28,29. Bonds involving hydrogen atoms were constrained at the equilibrium length with the M-shake algorithm30 with a tolerance of 10−5. Atom positions and forces acting on each atom were recorded every 1 ns and made available as part of the dataset as described below. A sampling rate of 1 ns bounds the tractable kinetics, enabling the resolution of the dynamics of relatively slow degrees of freedom such as conformational changes, but not faster motions (e.g. solvent-exposed side-chain rotations). For both NPT and NVT simulations, a 9 Å cutoff was applied for PME, while van der Waals interactions used a cutoff of 9 Å and a switching distance of 7.5 Å. Analysis of the trajectories was conducted using the HTMD library23, in order to include potentially useful pre-computed metadata. Secondary structure assignments have been computed for each frame and residue using the implementation of the DSSP algorithm in moleculekit version 1.8.32, encoded following the customary 8-class codes31.

Data Records

The mdCATH dataset makes the trajectories available under a CC BY 4.0 license. It is available at HuggingFace32. It is possible to (1) download individual domain files from HuggingFace via a browser; (2) retrieve them via the HuggingFace dataset API (Listing 2); (3) visualize them interactively (without downloading) on the PlayMolecule website (see the “Code Availability” section); (4) download them from PlayMolecule in XTC format.

Organization

The dataset is provided as a set of files in the Hierarchical Data Format, version 5 (HDF5). HDF5 allows the efficient storage and random access of heterogeneous data fields and arrays organized in a filesystem-like hierarchy. For the sake of simplicity, all of the data related to a given domain were collected into an individual HDF5 file. The dataset provided is structured into fields that describe snapshots of molecular simulation trajectories and derived quantities as shown in Table 1. The root group of each file in the dataset is the domain ID, which aggregates fields such as chain, element, resid, resname, and z, each a vector of length N, representing the number of protein atoms. The pdb and psf strings hold, respectively, the verbatim PDB file used for the simulation (with solvent) and its topology in CHARMM/XPLOR protein structure file (PSF) format; pdbProteinAtoms holds a PDB of the N solute atoms used for analysis. Data on the dynamics are organized hierarchically: five groups at the top-most level named according to the temperature; each temperature group includes five groups for each of the replicas; finally, each replica holds fields for atomic coordinates, forces, simulation box, as well as pre-computed derived quantities such as secondary structure assignments, instantaneous gyration radius, root-mean-square deviation, and fluctuations. Coordinates and forces are stored as three-dimensional arrays, their axes running along frames, atoms, and spatial dimensions. DSSP secondary structure assignments are provided per residue and frame following the standard 8-letter codes.

Table 1 Hierarchical organization of the data fields in the mdCATH dataset, with units and description.

Size

At the production cut-off date, we collected 134,950 trajectories for 5,398 domains, which were included in the dataset. Figure 2a and 2b show the distribution of system sizes that made it to the production simulation phase in terms of the number of solute atoms and the number of amino acids. Due to the distributed nature of the computing network, the length of the simulations varies (independently from system size), the majority of trajectories being 500 ns long (average 464 ns, standard deviation 76 ns; Fig. 2c). The total simulated time is over 62 ms. The full dataset size is over 3 TB. Further aggregate statistics are reported in Table 2.

Fig. 2
figure 2

(a) Distribution of the number of atoms per domain, revealing a variation of nearly an order of magnitude in the atom counts across systems. (b) Distribution of the total number of residues per domain, showing a broad peak of around 100 residues and a long tail of up to 500 residues (cut-off size). (c) Distribution of trajectory lengths, peaking at 500 ns. (d) Distribution of root mean square deviation (RMSD) of the protein’s heavy atoms between the first and the last frame of each trajectory.

Table 2 Descriptive statistics of the mdCATH dataset.

Technical Validation

We perform several statistical analyses of the dataset to validate its content.

Validation of temperature denaturation

As a first validation of the dataset, we examined the correlation between the amount of secondary structure and the radius of gyration, which was assumed to be a proxy for domain compactness. The fraction of amino acids that are in helical or β-strand configurations, represented by the DSSP codes G, H, I, E, and B, is used to define the amount of secondary structure. This will be referred to as “α + β” for simplicity. Figure 3 shows the results for six domains at 320 K (only one replica is shown for clarity). The radius of gyration and the fraction of sequence in secondary structure elements naturally depend on the domain architecture. At 320 K the domains are generally stable, and both values exhibit fluctuations around mean values but no systematic drift nor marked correlations, with the possible exception of 1w9rA00, which undergoes a transition compacting its radius of gyration from 2.4 nm to 1.8 nm.

Fig. 3
figure 3

Relation between the radius of gyration and the number of residues in α or β secondary structure elements for six of the mdCATH domains simulated. Each point represents a frame, taken between 0 and 500 ns (blue to yellow) at 1 ns intervals, from the first replica of a run at 320 K.

We then validated whether the relationship holds at increasing temperatures. Figure 4 shows the relation between the radius of gyration and the fraction of sequence in secondary structure elements for a specific domain, subtilisin inhibitor-like, a 2-layer α-β sandwich of 106 amino acids (CATH-Gene3D entry G3DSA:3.30.350.10), at increasing temperatures. Between 320 K and 379 K, the dynamics appear essentially unchanged, namely both quantities fluctuate randomly and uncorrelated within the 500 ns of sampled time. Some destabilization starts to appear at 413 K: the fraction of α/β structure is unchanged, while the radius of gyration has a marked increase beyond the 1.4 nm threshold. At 450 K the system unfolds: the amount of secondary structure drops below 30%, and the radius of gyration grows beyond 1.5 nm within 100 ns.

Fig. 4
figure 4

Relation between the radius of gyration and the number of residues in α or β secondary structure elements for domain 5sicI00 (subtilisin inhibitor-like, a 2-layer α-β sandwich of 106 amino acids) at increasing temperatures. Destabilization is seen at 413 K, and at 450 K complete unfolding occurs within 100 ns (last two panels, fixed scale and full view respectively). Axes and legend are as in Fig. 3.

Fluctuation-unfolding cooperativity

We further validated the dataset by assessing the fluctuation of residues in relation to secondary structure and temperatures. Figure 5 displays, for each residue, the fraction of time spent in an α or β secondary structure element compared to the root mean squared fluctuation (RMSF) of the same residue. The structure-fluctuation relationships are shown for three domains taken as examples, namely 5j8eA00 (actin-binding protein, T-fimbrin, domain 1; mainly α), 2a06B02 (cytochrome Bc1 complex, chain A, domain 1; α-β), and 2xryA01 (HUP superfamily, 6-strand sheet Rossmann fold), in rows, each shown at low (320 K, left column) and high temperature (450 K, right column). A clear inverse relationship between local structure and fluctuation emerges which supports that the dataset is well constructed.

Fig. 5
figure 5

Relationship between the fraction of time each residue spends in α or β secondary structure elements and its root mean squared fluctuation (RMSF) across actin-binding protein T-fimbrin domain (5j8eA00), cytochrome Bc1 complex domain (2a06B02), and HUP superfamily domain (2xryA01). Each point indicates a residue, colored by its position along the sequence, from purple (N-terminal) to red (C-terminal). The relationship is presented in two temperature conditions, 320 K (left) and 450 K (right; note the different RMSF scale). At 320 K, secondary structure presence exhibits a bimodal distribution, weakly correlated with RMSF. Bimodality disappears at 450 K, showing a continuum in the participation to structure elements, which is roughly inversely correlated to the corresponding fluctuations.

Class-wise thermodynamics of denaturation

It is possible to combine the annotations and metadata provided by the CATH database to cross-reference dynamic data with protein classification. For example, we can leverage CATH metadata by conditioning the analysis on the top-most classification level of CATH (Class), defined in terms of the general architectural organization of the domain: mainly α, mainly β, α-β, few secondary structures, and special.

Figure 6 illustrates the construction of probability distribution for various domains, conditioned using domain class annotations. This figure uses ternary plots to show the distribution of protein secondary structures — helical (top), strand (left), and coil/turn (right) content — on a plane. These plots are based on data from the last snapshot of all replicas across all domains, categorized by temperature and domain type. The plots clearly show a shift in the fractions of helical and strand structures toward coil content at temperatures of 413 K and 450 K. Notably, the strand content shows greater resistance to thermal denaturation compared to the helical content.

Fig. 6
figure 6

Distribution of protein secondary structures content–helical (top), strand (left), and coil/turn (right)–organized by CATH domain class and temperature. It represents data taken from the final snapshot of all replicas across all domains, illustrating how the proportions of helical and strand structures shift toward coil content as temperatures increase.

Kinetics of secondary structure loss

As a last example, we show how it is possible to combine the annotations and metadata provided by the CATH database to extract proteome-wide kinetic data. Supplementary Figure S1 analyzes the conservation of α/β structure in time as a function of temperature for the four classes (mdCATH has no representative of the “special” class). Each panel reports time on the horizontal axis and the fraction of residues in secondary structure elements, normalized so the initial value is one, on the vertical axis. Values for 50 domains per class and replicas are aggregated and displayed as distributions. Different cooperativity regimes emerge for the four classes (Kolmogorov-Smirnov tests for all distribution pairs at 400 ns: p 10−6). Mainly β domains appear to be the most stable, losing structure only at 450 K. Mainly α domains exhibit a partial loss of structure at 413 K; interestingly, at 450 K their transition to a low-secondary structure state is, on average, abrupt (100 ns). Mixed α-β domains have an intermediate behaviour showing aspects of both. Lastly, as expected, the few secondary structures class is pretty much diffuse and heterogeneous.

Usage Notes

An ad-hoc class, torch_geometric.data.Dataset, has been integrated into TorchMD-Net33 to streamline the use of the mdCATH dataset, providing precise control over the protein domain selection and advanced filtering options for trajectories. Listing 1 shows a self-contained code demonstrating how to use the mdCATH data loader in TorchMD-Net for model training, highlighting how additional dataset arguments can be used to focus on specific cases of interest. Future dataset releases will include additional simulations at 300 K to expand coverage around room-temperature conditions.

Listing 1. Importing mdCATH as a training set in TorchMD-NET.

Listing 2. Example of how to download an mdCATH HDF5 file using the HuggingFace API.

Fig. 7
figure 7

Example of an mdCATH trajectory loaded in the PlayMolecule platform.