Introduction

In recent years, machine learning (ML) has emerged as a powerful new tool in the materials scientist’s toolkit1,2,3,4. Sophisticated ML models have found their way into a multitude of applications. Surrogate ML models for “instant” predictions of properties such as formation energies, band gaps, mechanical properties, etc.5,6,7,8,9,10,11,12,13 have greatly expanded our ability to explore vast chemical spaces for new materials. In addition, Machine learning (ML) has been widely used for parameterizing potential energy surfaces (PESs)14,15,16, enabling the direct prediction of potential energies, forces, and stresses based on atomic positions and chemical species. These ML interatomic potentials (MLIPs)17,18,19,20,21,22,23,24,25,26,27 have provided us with the means to parameterize complex PESs to perform large-scale atomistic simulations with unprecedented accuracies.

Among ML model architectures, graph deep learning models, also known as graph neural networks (GNNs), utilize a natural representation that incorporates a physically intuitive inductive bias for a collection of atoms28. Figure 1 depicts a typical graph deep learning architecture. In the graph representation, the atoms are nodes and the bonds between atoms (usually defined based on a cutoff radius) are edges. In most implementations, each node is represented by a learned embedding vector for each unique atom type (element). Additionally, some architectures such as the MatErials Graph Network (MEGNet)5 and Materials 3-body Graph Network (M3GNet)29 also include an optional global state feature (u) to provide greater expressive power, for instance, in the handling of multifidelity data30,31. A graph deep learning model is constructed by performing a sequence of update operations, also known as message passing or graph convolutions. In the final layer, the embeddings are pooled and passed through a final MLP layer to arrive at a final prediction. GNNs can be broadly divided into two classes in terms of how they incorporate symmetry constraints. Invariant GNNs use scalar features such as bond distances and angles to describe the structure, ensuring that the predicted properties remain unchanged with respect to translation, rotation, and permutation. Equivariant GNNs, on the other hand, go one step further by ensuring that the transformation of tensorial properties, such as forces, dipole moments, etc. with respect to rotations are properly handled, thereby allowing the use of directional information extracted from relative bond vectors. For a comprehensive overview of different GNN architectures and their applications, readers are referred to recent literature32,33. Given sufficient training data, GNN architectures such as Nequip34, MACE35, Equiformer36 and many others37,38,39 have been shown to provide state-of-the-art accuracies in the prediction of various properties and PESs5,40,41,42. Furthermore, unlike other MLIP architectures based on local-environment descriptors, GNNs have a distinct advantage in the representation of chemically complex systems. The recent emergence of foundation potentials (FPs)29,43,44,45,46,47, i.e., universal MLIPs with coverage of the entire periodic table of elements, is a particularly effective demonstration of the ability of GNNs to handle diverse chemistries and structures.

Fig. 1: Graph deep learning architecture for materials science.
figure 1

Vn and En denotes the set of node/atom ({vi}) and edge/bond features ({eij}), respectively, in the nth layer. Some implementations include a global state feature (U) for greater expressive power. Between layers, a sequence of edge (fE), node (fV) and state (fU) update operations are performed. fE, fV and fU are usually modeled using multilayer perceptrons. In the final step, the edges, nodes and state features are pooled (P) and passed through a multilayer perceptron to arrive at a prediction.

At the time of writing, most software implementations of materials GNNs48,49,50 are for a single architecture, built on PyTorch-Geometric51, Tensorflow52 or JAX53. However, recent benchmarks show that the Deep Graph Library (DGL)54 outperforms PyTorch-Geometric in terms of memory efficiency and speed, particularly when training large graphs under the same GNN architectures for various benchmarks54,55. This improved efficiency enables the training of models with larger batch sizes as well as the performance of large-size and long-time-scale simulations.

In this work, we introduce the Materials Graph Library (MatGL), an open-source modular, extensible graph deep learning library for materials science. MatGL is built on DGL, Pytorch and the popular Python Materials Genomics (Pymatgen)56 and Atomic Simulation Environment (ASE)57 materials software libraries. MatGL provides a user-friendly workflow for training property models and MLIPs, with data pipelines and Pytorch Lightning training modules designed for the unique needs of materials science. In its present version, MatGL provides implementations of several state-of-the-art invariant and equivariant GNN architectures, including the Materials 3-body Graph Network (M3GNet)29, MatErials Graph Network (MEGNet)5, Crystal Hamiltonian Graph Neural Network (CHGNet)43, TensorNet58 and SO3Net49, as well as pre-trained FPs and property models based on these architectures. To facilitate the use of pre-trained FPs in atomistic simulations, MatGL also implements interfaces to widely used simulation packages such as the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) and the Atomic Simulation Environment (ASE). The intent for MatGL to serve as a common platform for the scientific community to collaboratively advance graph deep learning architectures and models for materials science.

Results

In the following sections, we present the MatGL framework, with the manuscript organized as follows: We start with a schematic overview of the core model components, followed by a concise summary of the data pipeline and preprocessing steps. We then introduce the available graph neural network (GNN) architectures for property prediction and the construction of MLIPs. Next, we detail the key components involved in training and deploying these architectures, explaining their integration into MatGL. Additionally, we introduce the simulation interfaces for atomistic simulations and the command-line interface for various applications. Finally, we demonstrate the performance of different GNN architectures on widely used datasets, encompassing both molecular and periodic systems.

MatGL architecture

MatGL is organized around four components: data pipeline, model architectures, model training and simulation interfaces. Figure 2 gives an overview of MatGL architecture, and detailed descriptions of each component are provided in the following paragraphs.

Fig. 2: Overview of MatGL.
figure 2

Class names are in italics. MatGL can be broken down into four main components: 1. the data pipeline component preprocesses a set of raw data into graphs and labels; 2. the architecture component build the GNN model using modular layers implemented; 3. the training component utilizes PyTorch Lightning to train either property models or MLIPs; and 4. the simulation components integrates the MatGL models with atomistic packages such as ASE and LAMMPS to perform molecular dynamics simulations.

The first core component introduced is the data pipeline and preprocessing. The MatGL data pipeline consists primarily of MGLDataset, a subclass of DGLDataset, and MGLDataLoader, a wrapper around DGL’s GraphDataLoader. MGLDataset is used for processing, loading and saving materials graph data, and includes tools to easily convert Pymatgen Structure or Molecule objects into directed or undirected graphs, while MGLDataLoader batches a set of preprocessed inputs with customized collate functions for training and evaluation. The main features of MGLDataset and MGLDataLoader are summarized below.

An important feature of MGLDataset is to provide a pipeline for processing graphs from inputs, loading and saving DGL graphs and labels. The commonly used inputs consist of the following items:

  • structures: A set of Pymatgen Structure or Molecule objects.

  • converter: A graph converter that transforms a configuration into a DGL graph.

  • cutoff: A cutoff radius that defines a bond between two atoms.

  • labels: A list of target properties used for training.

Other inputs such as global state attributes and a cutoff radius for three-body interactions are optional depending on the model architecture and applications. The default units for PES properties are Å for distance, eV for energy, eV Å−1 for force, and GPa for stress. MGLDataset also includes the ability to cache pre-processed graphs, which can facilitate the reuse of data for the training of different models. Once the MGLDataset is successfully loaded or constructed, the dataset can be randomly split into the training, validation, and testing sets using the DGL split_dataset method. MGLDataLoader is then used to batch the separated training, validation and optional testing sets for either training or evaluation via PL modules.

Another core component is the set of GNN architectures implemented in the matgl.models package, using different layers implemented in the matgl.layers package. The models and layers are all subclasses of torch.nn.Module, which offers forward and backward functions for inference and calculation of the gradient of the outputs with respect to the inputs via the autograd function. Different models will utilize different combinations of layers, but, where possible, layers are implemented in a modular manner such that they are usable across different models (e.g., the MLP layer implementing a simple feed-forward neural network). MatGL offers various pooling operations, including set2set59, average, and weighted average, to combine atomic, edge, and global state features into a structure-wise feature vector for predicting intensive properties. The pooled structural feature vector is then passed through an MLP for regression tasks, while a sigmoid function is applied to the output for classification tasks.

Table 1 summarizes the GNN models currently implemented in MatGL. The details of the models were already comprehensively described in the provided references, and interested readers are referred to those works. It should be noted that this is merely an initial set of model implementations. In addition, all MatGL models subclass the MatGLModel abstract base class, which specifies that all models should implement a convenience predict_structure method that takes in a Pymatgen Structure/Molecule and returns a prediction.

Table 1 Graph neural network models implemented in MatGL

A key assumption in MLIPs is that the total energy can be expressed as the sum of atomic contributions. For PES models, the graph-convoluted atomic features are fed into either gated or equivariant gated multilayer perceptrons to predict the atomic energies. In addition, we have implemented a Potential class in the matgl.apps.pes package that acts as a wrapper to handle MLIP-related operations. For instance, a best practice for MLIPs is to first carry out a scaling of the total energies, for example, by computing either the formation energy or cohesive energy using the energies of the elemental ground state or isolated atom, respectively, as the zero reference. The Potential class takes care of accounting for the normalization factor in the total energies, as well as computing the gradient to obtain the forces, stresses and hessians. Other atomic properties such as magnetic moments and partial charges can also be predicted at the same time with the Potential class.

For the training module, MatGL leverages the PL framework, which supports different efficient parallelization schemes and a variety of hardware including CPUs, GPUs and TPUs. MatGL provides two different PL modules including ModelLightningModule and PotentialLightningModule for property model and PES model training, respectively. Figure 3 illustrates the training workflow for building property models and MLIPs in MatGL. A set of reference calculations including structures and target properties is generated using ab initio methods and experiments. The reference structures are converted into a list of Pymatgen Structure/ Molecule objects, and target properties are stored in a dictionary, where the property names are the keys and corresponding values denote items. These inputs are passed through MGLDataset, followed by splitting the dataset into training, validation, and optional test sets, and then MGLDataLoader to obtain batched graphs, stacked state attributes, and labels. The desired GNN model architecture is initialized with requisite settings such as the number of radial basis functions, cutoff radii, etc. Various algorithms such as Glorot60 and Kaiming61 implemented in Pytorch can also be used to initialize the learnable parameters in GNNs. The PL training framework includes two modules: PotentialLightningModule and ModelLightningModule. The primary difference between them lies in their respective loss functions. In ModelLightningModule, the loss is defined solely as the error between the predicted and target structural properties. In contrast, PotentialLightningModule uses a weighted sum of errors across various PES properties, such as energies, forces, and stresses. It can also optionally include other atomic properties that influence the PES, such as magnetic moments and charges.

Fig. 3: Workflow for training property models and machine learning interatomic potentials in MatGL.
figure 3

The initial raw data includes a list of Pymatgen Structure/Molecule objects, optional global state attributes and labels such as structure-wise and PES properties. These inputs are used to preprocess training, validation and optional test sets containing a tuple of DGL graphs, labels, optional line graphs and state attributes using MGLDataset. These datasets are then fed into MGLDataLoader to create the batched inputs including graphs, state attributes and labels for training and validation. The GNN architecture is initialized with chosen hyperparameters and passed as inputs to PL training modules with training and validation data loaders.

As for performing molecular simulations, MatGL currently provides interfaces to the Atomistic Simulation Environment (ASE) and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) to perform simulations with Potential models, i.e., MLIPs. For ASE, a PESCalculator class, initialized using a Potential class and state attributes, calculates energies, forces, stresses, and other atomic properties such as magnetic moments and charges for an ASE Atoms object, with the necessary conversion into DGL graphs being handled within the class itself. In addition, a Relaxer class allows users to perform structural optimization with different settings such as optimization algorithms (e.g. FIRE62, BFGS63,64 and Gaussian process minimizer (GPMin)65) and variable cell relaxation for both Pymatgen Structure/Molecule and ASE Atoms objects. Finally, a MolecularDynamics class makes it easy to perform MD simulations under different ensembles with various thermostats such as Berendsen66, Andersen67, Langevin68 and Nosé-Hoover69,70. Additional functionality to compute material properties such as elasticity, phonon analysis and finding minimum energy paths using PESCalculator are available in the MatCalc71 package. An interface to LAMMPS has also been implemented by AdvanceSoft, which utilizes PESCalculator to provide PES predictions for simulations. This interface enables the use of MatGL for a wide range of simulations supported by LAMMPS, including replica exchange72 and grand canonical Monte Carlo (GCMC)73, etc.

Finally, MatGL offers a command-line interface (CLI) for performing a variety of tasks including model training, evaluation and atomistic simulations. This interface minimizes the user’s effort and time in preparing scripts to run calculations such as property prediction, geometry relaxation, MD, model training, and evaluation.

  • matgl predict. This command is used to perform structure-wise property prediction, such as formation energy and band gap of materials. The prediction requires at least a structure file that can be read using the Structure.from_file method from Pymatgen and a directory that stores the trained property model. Additionally, predictions for multiple structure-wise properties are also supported.

  • matgl relax. This command is used to perform geometry relaxation using the Relaxer class with a trained MLIP. Users can flexibly decide whether to perform variable-cell relaxation and can adjust the maximum allowable force components to define the relaxation criteria. The default optimizer is the FIRE algorithm62, although other optimization algorithms are also available.

  • matgl md. This command is used to perform MD simulations using the MolecularDynamics class. Similar to matgl relax, it also requires a structure and a trained MLIP. Users can customize various simulation parameters, including the step size, ensemble type, number of time steps, target pressure, and temperature. Furthermore, ensemble-dependent settings such as collision probability, external stress, and coupling constants for thermostats can be also adjusted to specific systems.

  • matgl train and matgl evaluate. These commands are used to perform model training and evaluation, including data preprocessing, splitting, setting up the GNN architecture, and configuring Lightning modules. Users only need to provide an input file containing structures and their corresponding target properties, along with the settings for graph construction, GNN architecture, and training hyperparameters. These settings can be modified in the configuration file or specified as input arguments.

Property benchmarks

In the following paragraphs, we benchmark the performance of different GNN architectures, trained on various popular datasets, in terms of accuracy and inference time.

We first compared the performance of various GNN architectures for predicting various properties of the QM9 molecular74 and Matbench bulk crystal75 datasets. The QM9 dataset contains 130,831 organic molecules including H, C, N, O and F. GNN models were trained on the isotropic polarizability (α), free energy (G) and the gap (Δϵ) between the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO), which were computed with DFT with the B3LYP functional.

Table 2 shows the MAE of different GNN architectures. Consistent with previous analyses, MEGNet obtains the highest errors, while other models are comparable. For example, MEGNet achieves validation and test MAEs of 0.037 eV for free energy, while other models reach a range of 0.025–0.027 eV. It should be noted that these experiments aim to demonstrate the capabilities of MatGL with consistent settings. For comparing the best accuracy between different architectures, an extensive search for preprocessing treatments of target properties and hyperparameters, such as learning rate, scheduler, and weight initialization, is required.

Table 2 Mean absolute errors (MAEs) of GNN models trained on QM9 dataset

For Matbench dataset, we trained four different GNNs on three properties: formation energy (Eform), Voigt-Reuss-Hill bulk modulus (log(Kvrh)), and shear modulus (log(Gvrh)). The datasets contained 132,752, 10,987, and 10,987 crystals, respectively, resulting in a total of 12 property models.

Table 3 reports the MAEs of material properties including formation energy, bulk/shear modulus and bandgap with respect to reference DFT-PBE results. All GNN models achieve state-of-the-art accuracy in terms of training, validation and test errors74,76. MEGNet generally obtains the highest MAEs compared to other models. For instance, the calculated validation and test MAEs of MEGNet for formation energy are 0.037 eV atom−1, while other models significantly reduce the error by 40%. The poor performance of MEGNet is attributed to the less informative geometric representation of structures based only on bond distances. Recent studies77 find that distance-only GNNs fail to uniquely distinguish atomic environments, which affects the accuracy of structure-wise properties due to degeneracies caused by the incompleteness of representation. Other models like M3GNet, TensorNet and SO3Net achieve considerably higher accuracy by taking additional geometric information, such as bond angles and relative position vectors, into account. The learning curves for QM9 and Matbench are provided in Supplementary Figs. S1, S2.

Table 3 Mean absolute errors (MAEs) of GNNs trained on Matbench dataset

We also evaluate the efficiency of different GNNs for property prediction. Table 4 shows the inference time of the test set for the QM9 and Matbench datasets for the different GNN models. MEGNet achieves the shortest inference time with 12 s and 11 s for around 6500 small molecules and crystals although the accuracy is the worst. TensorNet generally achieves the best compromise between accuracy and efficiency, taking less than 15 s for both datasets. M3GNet and SO3Net has the longest inference time for molecules and crystals, respectively. This shows that the SO3Net is slower than M3GNet when the number of neighbors within a spatial cutoff sphere is larger.

Table 4 Inference times of GNN models for property prediction

PES benchmarks

The following paragraphs summarize the performance of various GNN model architectures in constructing MLIPs using popular large databases such as the ANI-1x78, MPF-2021.2.8 and the recently release Materials Potential Energy Surface (MatPES) dataset v2025.179. The results and benchmarks are presented below.

The first benchmark dataset is ANI-1x78, which contains roughly 5 million conformers generated from 57,000 distinct molecules containing H, C, N, and O for constructing general-purpose organic molecular MLIPs. We also included the Transfer-Learning M3GNet (M3GNet-TL) MLIPs from the pre-training ANI-1xnr dataset80 by adapting the pretrained embedded layer and only optimizing other model parameters for comparison. We noted that the ANI-1xnr dataset encompasses a significantly larger configuration space compared to ANI-1x, owing to the extensive structural diversity obtained from condensed-phase reactions. These reactions include carbon solid-phase nucleation, graphene ring formation from acetylene, biofuel additive reactions, methane combustion, and the spontaneous formation of glycine from early earth small molecules.

Table 5 shows the MAEs of energies and forces computed with different GNNs with respect to DFT. Both M3GNet and TensorNet achieve comparable training and validation MAEs of energies and forces, while SO3Net significantly outperforms them. A similar conclusion can be drawn from the test errors showing that SO3Net achieves the lowest MAE in terms of energies and forces.

Table 5 Mean absolute errors on ANI-1x subset

The results are consistent with previous findings, indicating that equivariant models are typically more accurate and transferable than invariant models for molecular systems. Moreover, M3GNet-TL reduces the errors in energies and forces by 10–15% compared to M3GNet trained from scratch and also exhibits significantly faster convergence, as shown in Supplementary Fig. S3. The improvements are attributed to the pre-trained embedded layer from ANI-1xnr dataset that covers a greater diversity of local atomic environments.

To further evaluate the extrapolation abilities of GNN models, we compare the energies and forces on the molecules obtained from COMP6 benchmarks with respect to DFT. Figure 4 shows the MAE of energies and forces computed with M3GNet, M3GNet-TL, TensorNet and SO3Net. Both M3GNet and M3GNet-TL perform the worst in terms of energy and force errors above 14 meV atom−1 and 0.14 eV Å−1 on the ANI-MD dataset, which comprises molecular dynamics (MD) trajectories of 14 well-known drug molecules and 2 small proteins. The large errors may be attributed to the poor transferability of MLIPs trained on small molecules to larger ones, as the largest molecule in the training set contains 63 atoms, whereas the molecules in the ANI-MD dataset have 312 atoms. The TensorNet significantly reduces the error of energies and forces to 11 meV atom−1 and 0.1 eV Å−1, while SO3Net further reduces to 2.3 meV atom−1 and 0.044 eV Å−1. This trend can be also found in other benchmark datasets.

Fig. 4: Mean absolute errors on COMP6 benchmark.
figure 4

The bar plot of a energy and b force errors for M3GNet, transfer-learning M3GNet (M3GNet-TL) from ANI-1xnr, TensorNet and SO3Net with respect to DFT.

To further demonstrate the performance of constructed MLIPs from MatGL with state-of-the-art models, we calculated the energy of two well-known molecules with respect to the dihedral torsion. Figure 5a shows the PES of ethane during torsion. All MLIPs, including reference ANI-1x78 and MACE-Large81, predict the same torsion angles for the maxima and minima of the PESs, while the energy barriers are slightly different. For instance, both ANI-1x and M3GNet predict a higher energy barrier of 0.15 eV, whereas MACE-Large obtains 0.125 eV. SO3Net and TensorNet predict the lowest energy barrier of 0.1 eV. For the case of a more complex di-methyl-benzamide molecule, all the MLIPs provide a similar shape of PESs with respect to different dihedral angles. Still, the predicted barrier heights are different. For example, the ANI-1x model has the largest barrier height of 1.5 eV at 180, while both TensorNet and M3GNet considerably underestimate the energy barrier by 0.6 eV. The energy barriers for SO3Net, and MACE-Large range from 0.9 to 1.2 eV.

Fig. 5: Potential energy surface of organic molecules during torsion.
figure 5

The torsion energy profile of a ethane and b dimethyl-benzamide were computed with different MLIPs. The reference ANI-1x78 and MACE-Large81 were plotted in black and purple lines. The black arrows indicate the dihedral torsion of molecules.

The second dataset is the manually selected subset of MPF.2021.2.8., which contains all geometry relaxation trajectories from both the first and second step calculations in the Materials Project. The total number of crystal structures is 185,877. Moreover, the isolated atoms of 89 elements were also included in the training set to improve the extrapolability of the final potential. The details of data generation and selection can be found in ref. 82. Here we excluded SO3Net from the benchmarks due to its relatively high sensitivity to noisy datasets, which led to extremely large fluctuations in training errors.

Table 6 shows that CHGNet generally outperforms M3GNet and is noticeably better than TensorNet in terms of energies, forces and stresses. The convergence of validation loss and PES properties was plotted in Supplementary Fig. S4. This can be attributed to the fact that the CHGNet provides additional message passing between angles and edges compared to M3GNet. Moreover, the DFT calculation settings, such as electronic convergence and grid density in reciprocal space, are less strict, resulting in large numerical noise in forces and stresses, which makes the training particularly challenging for equivariant models that are very sensitive to these properties. Furthermore, most structures are crystals without complicated structural diversity, which reduces the strength of equivariant models by providing a more informative representation of complex atomic environments. More detailed benchmarks on structurally diverse datasets with stricter electronic convergence for constructing general-purpose FPs are required in future studies.’ We also performed benchmarks on crystals, particularly focusing on binary systems obtained from the Materials Project database.

Table 6 Mean absolute error on MPF-2021.2.8 subset

The first step is to investigate the performance of GNNs on the geometry relaxation of binary crystals and corresponding energies with respect to DFT. It should be noted that such benchmarks for existing FPs have been reported in recent studies83,84. Figure 6a shows the cumulative structural fingerprint distance between DFT and MLIP relaxed structures using CrystalNN algorithm85, which indicates the similarity between the two structures based on the local atomic environments. Overall, both M3GNet and TensorNet have similar performance in terms of fingerprint distance. CHGNet only shows a modest improvement, with more structures within a distance of about 0.01 compared to M3GNet and TensorNet. Figure 6b shows the cumulative absolute energy errors of MLIPs with respect to DFT. CHGNet predicts that about 60% of structures have an energy difference below 25 meV atom−1. This is comparable to M3GNet and 10% better than TensorNet.

Fig. 6: Performance of foundation potentials for variable-cell geometry relaxation of binary crystals.
figure 6

a Cumulative absolute fingerprint distance of DFT and MLIP relaxed structures using CrystalNN algorithm, and b Cumulative absolute errors of DFT and MLIP energies of relaxed crystals.

We also compared the predicted bulk modulus with different models. Figure 7 shows the parity plots of bulk modulus computed with FPs and DFT. All models have similar R2 scores and MAEs, reaching 0.8 and 20 GPa. Finally, we computed the heat capacity of binary systems at 300K under phonon harmonic approximation and compared the results with DFT reference data at the PBEsol level obtained from Phonondb. Figure 8 shows that all models are in very good agreement with DFT. A very recent study86 noted a small shift between PBE and PBE-sol on the prediction of phonon properties. Nevertheless, these benchmarks demonstrate that our trained MLIPs can provide a preliminary reliable prediction on material properties by performing geometry relaxations and phonons. These FPs can perform reasonably stable MD simulations across a wide range of systems at low temperatures, as their covered configuration space partially overlaps with relaxation trajectories near the equilibrium region29,43,87.

Fig. 7: Performance of foundation potentials for bulk modulus of binary crystals.
figure 7

Parity plots for Voigt-Reuss-Hill bulk modulus calculated with M3GNet, TensorNet and CHGNet compared to DFT.

Fig. 8: Comparison of foundation potentials for the heat capacity of binary crystals.
figure 8

Parity plots for heat capacity calculated with M3GNet, TensorNet and CHGNet compared to DFT.

We have also conducted additional benchmarks of TensorNet trained on the recently developed MatPES dataset79. Here, we are using only the FPs trained on MatPES PBE data only (TensorNet-MatPES-PBE-v2025.1). These benchmarks include surface energies, vibrational entropies, phonon dispersions, and the structural properties of amorphous materials. These properties are generally derived from structures that are not within the training dataset.

Figure 9 shows the surface energies of fcc Cu and bcc Mo predicted by different MLIPs. The TensorNet-MatPES-PBE-v2025.1 predictions are in excellent agreement with DFT (mostly within 0.1 Jm−2). The custom qSNAP MLIPs88 perform well for fcc Cu surface energies, but exhibit a consistent underestimation of the Mo surface energies. The TensorNet-MPF performs significantly worse for both systems and does not reproduce even the qualitative trends in surface energies between different Miller indices for Mo. This is likely due to well-known deficiencies in the MPF training dataset as discussed by Kaplan et al.79. Figure 10 shows the calculated phonon dispersion and vibration entropy as a function of temperature for silicon (Materials Project ID: mp-149) and gallium oxide (Materials Project ID: mp-1243) from DFT, TensorNet-MatPES-PBE-v2025.1 and custom SNAP and GAP MLIPs. TensorNet-MatPES-PBE-v2025.1 also shows good agreement with DFT and custom MLIPs, including SNAP88 and GAP89. Finally, the structural properties of amorphous Li3PS4 were calculated using TensorNet-MatPES-PBE-v2025.1 and the custom DeepMD potential90. Figure 11 shows that TensorNet-MatPES-PBE-v2025.1 generally agrees with DeepMD in terms of the peak positions of the RDF. The small differences in magnitude may be attributed to the additional Grimme D3 dispersion correction91 and the use of different pseudopotentials. Overall, the above extended benchmarks illustrate that the FPs can be used to study various material properties with reasonably good accuracy.

Fig. 9: Surface energies of elemental metals.
figure 9

Surface energies of a fcc Cu and b bcc Mo computed with DFT and different MLIPs. The DFT data is obtained from ref. 110, while the qSNAP potential is token from ref. 88.

Fig. 10: Phonon properties of silicon and gallium oxide.
figure 10

a Phonon dispersion and b vibration entropy of silicon (left) and gallium oxide (right) computed with different MLIPs and DFT. The DFT data is obtained from PhononDB, while the SNAP and GAP potentials are obtained from ref. 88 and ref. 89, respectively.

Fig. 11: Structural properties of amorphous material.
figure 11

Radial distribution functions computed with a DeepMD and b TensorNet-MatPES-PBE-v2025.1. The DeepMD data is token from ref. 90.

The reliability of material properties extracted from MD simulations critically depends on the accuracy of trained MLIPs92,93. MatGL provides ASE and LAMMPS interfaces to perform MD simulations, enabling the benchmarking of different GNN architectures42,94. In addition to the accuracy of GNNs, computational efficiency is crucial for large-scale atomistic simulations. We used the above MLIPs to perform MD simulations with 1000 timesteps for scalability tests with a single GPU via ASE and LAMMPS interfaces. Figure 12a shows the computational time for NVT simulations of non-periodic water clusters using ASE, with increasing sizes from 15 to 2892 atoms. SO3Net becomes significantly more demanding than TensorNet and M3GNet when simulating clusters with more than 100 atoms. TensorNet is the most efficient for all cases compared to M3GNet and SO3Net due to its model architecture, which does not require costly three-body calculations and tensor products. With a more scalable and optimized LAMMPS interface, Fig. 12b shows the computational time of NPT simulations for silicon diamond supercells ranging from 8 to 5832 atoms, where each Si atom contains around 70 neighbors within a spatial cutoff of 5 Å. CHGNet achieves the shortest computational time, while the computational cost of M3GNet is the highest. This is likely due to the additional cost of a larger cutoff for counting triplets and three-body interactions. These models can already serve as a “foundation" model for preliminary calculations with reasonably good accuracy. Moreover, building customized MLIPs often requires extensive AIMD simulations to sample the snapshots from the trajectories for training. Such demanding AIMD simulations can be replaced by the FPs with considerably reduced costs82.

Fig. 12: Inference time of MD simulations.
figure 12

The number of timesteps per second for a NVT simulations of water clusters with different sizes using ASE and b NPT simulations of various silicon-diamond supercells using LAMMPS is reported. All MD simulations were performed using a single Nvidia RTX A6000 GPU.

Discussion

Graph deep learning has made tremendous progress in atomistic simulations. Here we have implemented MatGL, which covers four major components including data pipelines, state-of-the-art graph deep learning architectures, Pytorch-Lightning training modules, interfaces with atomistic simulation packages, and command-line interfaces.

We also provided detailed documentation and examples to help users become familiar with training their custom models and conducting simulations using ASE and LAMMPS packages in our public Github repository. In addition, we provided several pretrained property prediction models and FPs, which can be used out-of-the-box for organic molecules and materials. With the combination of excellent chemical scalability and large databases, these models empower users to perform simulations across a wide range of applications, speeding up materials discovery by enabling high-throughput screening of hypothetical materials across a large chemical space95,96,97,98. Moreover, users can efficiently train their customized models with significantly faster convergence through fine-tuning from our available pretrained models. For example, our recently developed Dimensionality-Reduced Encoded Clusters with sTratified (DIRECT) sampling method significantly reduces the number of training structures required to cover large configuration spaces generated by high-throughput MD simulations using FPs82. In the GitHub repository, we have provided Jupyter notebook tutorials on fine-tuning FPs for target applications. This fine-tuning procedure can be adapted and combined with high-throughput automation frameworks such as atomate99 for active learning where necessary. Additionally, MatGL allows developers to design their own graph deep learning architectures and benchmark their performance with minimum effort, complimented by the modules available in the library. MatGL has been integrated into various frameworks, including MatSciML100 and the Amsterdam Modeling Suite101, expanding access for researchers in materials science and chemistry to conduct computational studies on a wide range of materials using GNNs. In future work, the efficiency of MLIPs can be further enhanced by integrating multi-GPU support with efficient parallelization algorithms44. Besides, training on massive databases exceeding millions of structures may encounter bottlenecks due to the memory needed to store all graphs and labels. To address this, the lightning memory-mapped database can be utilized to manage such large-scale training with affordable computational resources. Relevant tools for constructing reliable and robust MLIPs-such as uncertainty quantification102,103, active learning workflows104,105, and model interpretability106,107-will also be integrated into MatGL in the near future. We expect that the upcoming version of MatGL will substantially increase the accessible training set size for constructing FPs and enhance the efficiency of large-scale MD simulations, enabling the study of many interesting phenomena in materials science and chemistry.

Methods

Model training

All models were trained using PotentialLightningModule for structure-wise properties and ModelLightningModule for potential energy surfaces (PESs). The optimizer was chosen to be the AMSGrad variant of AdamW with a learning rate of 10−3. The weight decay coefficient was set to 10−5. The cosine annealing scheduler was used to adjust the learning rate during the training. The maximum number of iterations and minimum learning rate were set to 104 and 10−5, respectively. The mean absolute error of predicted and target properties was selected to calculate the loss function. The additional relative importance of energies, forces and stresses (1:1:0.1) was introduced for PES training. The maximum number of epochs was set to 1000, and early stopping was achieved with the patience of 500 epochs. The gradient for model weight updates was accumulated over 4 batches, and the gradient clipping threshold to prevent gradient explosion was set to 2.0. The input settings for the data loaders are listed in Supplementary Tables S1, S2, and a complete set of hyperparameters for each model and training module is provided in Supplementary Tables S3S7. For detailed descriptions of all models, the interested readers are referred to the respective publications.

Benchmark details

For dihedral torsion, the initial structures of ethane and dimethylbenzamide were relaxed using the FIRE algorithm with molecular MLIPs under a stricter force threshold of 0.01 eV Å−1. The conformers for scanning the dihedral angles were generated using RDKit108 at 1 intervals, resulting in a total of 359 single-point calculations to produce the PES. As for benchmarks on geometry relaxation, the 20160 initial DFT-relaxed binary crystals were taken from the Materials Project database. All these structures were re-optimized using FPs with variable cell geometry relaxation within a lighter force threshold of 0.05 eV Å-1. The default settings for CrystalNN were employed to measure the similarity between the DFT and MLIP-relaxed structures based on the fingerprints of their local environments. It should be noted that two structures failed during relaxation with CHGNet due to the failed construction of bond graphs caused by unphysical configurations. To benchmark Voigt-Reuss-Hill bulk modulus and heat capacity, a total of 4653 and 1183 binary crystals with available Voigt-Reuss-Hill bulk modulus and heat capacity data were obtained from the Materials Project and PhononDB, respectively. Additional filters were applied to unconverged DFT calculations and unphysical bulk modulus and the remaining 3576 structures finally were analyzed. As for heat capacity, 1183 binary crystals were compared. All predicted properties derived from MLIPs were calculated using ElasticityCalc and PhononCalc from the MatCalc library. The default settings were used, except for a stricter force convergence threshold of 0.05 eV Å-1. Notably, all phonon calculations were completed successfully with the lighter symmetry search tolerance set to 0.1. The surface energy is defined as

$${\gamma }_{hkl}^{\sigma }=\frac{{E}_{{\rm{slab}}}^{hkl,\sigma }-{E}_{{\rm{bulk}}}^{hkl,\sigma }\cdot {n}_{{\rm{slab}}}}{2{A}_{{\rm{slab}}}},$$
(1)

where Eslab and Ebulk denotes the surface energy with an exposed (hkl) plane and energy of the bulk structure, respectively. Aslab refers to the cross-sectional area of the slab. The fcc Cu and bcc Mo surfaces were included in this benchmark. All surface and bulk structures were obtained from Materials Project. Only atomic positions were relaxed using MLIPs with the force threshold of 0.01 eV Å−1 and then the relaxed structures were used to calculate the energies. The vibrational entropy and phonon dispersion were calculated using MatCalc interfaced with Phonopy. Silicon in the diamond structure (Materials Project ID: mp-149) and gallium oxide (Materials Project ID: mp-1243) were selected as test systems. The initial structures were relaxed using a stricter force convergence threshold of 0.0001 eV Å−1. Atomic displacements of 0.01 Å were applied to compute the force constants. A 20 × 20 × 20 mesh was used for phonon calculations, with all other settings kept at their default values in MatCalc. Finally, an amorphous structure of Li3PS4 was generated using a melt-and-quench protocol, following the methodology outlined in ref. 90. The initial structure is a 4 × 4 × 3 supercell of β-Li3PS4, consisting of 1152 atoms in total. The system was equilibrated at 1500 K for 100 ps and subsequently quenched to 300 K at a cooling rate of 2.5 K/ps under the NPT ensemble. A subsequent 500 ps production run was conducted to compute the radial distribution function.

Dataset details

All datasets except ANI-1x were randomly split into training, validation and test sets with a ratio of 0.9, 0.05 and 0.05, respectively. Due to the large size of the ANI-1x dataset, only a subset was used for demonstration purposes. We randomly sample the conformations of each molecule with the ratio of 0.2, 0.05, and 0.05 for training, validation and testing. With molecules containing less than 10 conformations, all conformations are included in the training to ensure that every molecule in the ANI-1x dataset is included in the training set. The description of datasets was summarized in the following subsection. The QM9 dataset consists of 130,831 organic molecules including H, C, M, O, F. It is a subset of GDB-17 database109 for isotropic polarizability, free energy and the gap between HOMO and LUMO were calculated using DFT at the level of B3LYP/6-31G. The Matbench dataset consists of 132,752 and 10,987 crystals for formation energy and bulk/shear modulus computed with DFT, respectively. All datasets were generated using the Materials Project API on 4/12/2019. The details can be found in ref. 74. The ANI-1x is the extension of ANI-1 dataset78 by performing active learning based on three different samplings including molecular dynamics, normal mode and torsion. All energies and forces of conformers are calculated using DFT at wB97x/6-31G level. The MPF-2021.2.8 dataset consists of 185,877 configurations sampled manually in the relaxation trajectories of 60,000 crystals from Materials Project. Additionally, 89 different isolated elements were also included in the training set. Finally, the MatPES-PBE-v2025.1 dataset consists of 434,712 structures, providing comprehensive coverage of 89 elements. These structures were sampled from 281 million snapshots generated by high-throughput molecular dynamics (MD) simulations at 300 K, conducted on both unit cells and supercells. A two-step DIRECT sampling approach was developed to ensure robust coverage of the configuration space. Interested readers are referred to ref. 79 for more details.