Introduction

Crystal materials design, transitioning from traditional trial-and-error approaches toward rational, structure- and data-driven strategies, has become imperative for fields such as condensed matter physics, energy storage, and mechanical engineering1,2,3. Advanced computational methods provide unprecedented atomistic perspectives to leverage crystal structures for achieving performance breakthroughs4,5. Owing to rapid advancements in artificial intelligence, machine learning methods have emerged as a pivotal engine, enabling forward structure-property relationship establishment through quantum chemistry (QC), density functional theory (DFT), and molecular dynamics (MD) to elucidate intrinsic processes6,7, while concurrently facilitating reverse functional design via machine learning-driven crystal structure prediction (CSP) approaches, such as graph neural networks and diffusion models8,9. The success of AI relies on a data infrastructure that balances high precision, extensive coverage, and low redundancy to ensure robust generalisation10. Shortfalls in coverage or distribution alignment can amplify extrapolation errors, erode simulation stability, and ultimately undermine predictive reliability across diverse atomic configurations11. In crystal systems, the combinatorial explosion of high-dimensional configurational possibilities, arising from sparse sampling, complex symmetry constraints, and intricate interatomic interactions, leads to a data bottleneck that cannot be resolved by simply enlarging datasets or increasing model complexity12,13, thereby hindering the generalisation and predictive accuracy of crystal AI.

Given these limitations, recent efforts have focused on efficient sampling methodologies to overcome the critical data bottleneck. Topology-guided sampling leverages persistent homology to systematically navigate atomic configurations across diverse material morphologies, enabling the efficient identification of active phases in systems like Pd hydrides and Pt clusters14. Symmetry-principle-guided evolutionary algorithms integrate group and graph theory to extract structural features and generate high-quality initial structures, proving highly effective in complex systems such as phosphorus allotropes and diamond-silicon surfaces15. Graph-deep-learning-based techniques rapidly target low-energy regions on the potential energy surface through innovative potential energy surface slicing, achieving remarkable accuracy with minimal computational samples, as demonstrated in studies of boron allotropes and CuIn5Se816. However, reliance on symmetry-idealised configurations or neglect of thermal fluctuations and metastable disordered phases limits the capacity to explore vast polymorph conformational spaces17, while the inherent methodological divergence between minimum potential energy-driven crystal structure prediction and thermodynamics configuration-based structure-property relationship establishment poses significant challenges in constructing a unified crystal design architecture.

In this work, we present PolymorphGen‑MLPKD (Polymorph Generator and Machine Learning Potential Knowledge Distillation), an integrated framework driven by entropy‑symmetry landscapes for topological analysis, targeted generation, and structure‑property relationship establishment of polymorphs. The entropy‑symmetry landscape—a projection onto the instantaneous pair entropy (sS) and the sixth‑order Steinhardt symmetry parameter (Q6)—offers an alternative to the lossy compression inherent in conventional high‑dimensional descriptors such as PCA, UMAP, and t‑SNE for topological visualization18,19,20. PolymorphGen accurately unveils phase behaviours across diverse crystal systems, efficiently generates paracrystalline diamond structures, and constructs a continuous distribution of intermediate states along the graphite-diamond transformation, identifying a pathway with a lower energy barrier than that obtained from nudged elastic band (NEB) calculations. The framework incorporates Auto‑DFT scheduling, multi‑resolution classification, and MLPKD to enable cross‑scale accuracy transfer from density functional theory (DFT) through the message passing neural network (MPNN) to the deep neural network (DNN) with a 106‑fold speed enhancement, based on knowledge distillation. The resulting DNN model trained via knowledge distillation (DNN-KD) accurately captures phonon dispersion, thermodynamic, defect, and elastic properties. Two scaling laws for machine learning potential accuracy are revealed: performance saturates beyond a threshold data density under uniform entropy‑symmetry coverage, yet degrades systematically when coverage is uneven. The DNN‑KD models successfully compare twinning and dislocation‑mediated plasticity in FCC metals and further resolve stress‑induced BCC cluster formation, FCC‑BCC‑HCP phase transitions, and twinning routes in brittle iridium. This work may be of interest to researchers across AI for science, materials science, physics, machine learning, and computational methods, and we anticipate that PolymorphGen‑MLPKD will serve as a guideline for thermodynamic polymorph generation along with MLP data preparation, training, and testing.

Results

Design of PolymorphGen-MLPKD

Establishing a unified crystal AI requires a comprehensive and effective evaluation of structural information. To this end, we propose an entropy-symmetry landscape for polymorphs, which integrates global and local order parameters to quantitatively assess crystal similarity and enable universal topological mapping. For global order, we employ the instantaneous pair entropy (sS), a parameter derived from liquid state theory that expands the excess entropy per atom into an infinite series of multiparticle correlation functions and is widely used in atomistic simulations to drive crystallization processes21,22. Local order is characterized by the sixth-order Steinhardt symmetry parameter (Q6), which quantifies the degree of order within an atom’s first coordination shell and captures short-range orientational correlations among nearest neighbours23. The instantaneous pair entropy sS is derived from the two‑body excess entropy and quantifies radial disorder by integrating the mollified radial distribution function. It is sensitive to changes in density correlations (particularly under solid-liquid coexistence) but is relatively insensitive to angular symmetry. By contrast, the sixth‑order Steinhardt parameter Q6 captures orientational correlations through a spherical harmonic expansion of bond vectors within the first coordination shell. Q6 distinguishes crystal structures, identifies solid and liquid atoms, and detects defects, yet it is largely insensitive to bond‑length variations. Together, sS and Q6 probe orthogonal dimensions of structural order: sS reflects thermodynamic (radial) features, while Q6 encodes crystallographic (angular) information. Their combination therefore provides a complementary, information‑preserving projection of the high‑dimensional configurational space onto a two‑dimensional plane, as illustrated in Fig. 1a. To validate the sensitivity of the entropy-symmetry landscape to lattice perturbations, we applied both normal (εxx, εyy, εzz) and shear (εxy, εyz, εzx) strains to an HCP structure with five independent elastic constants. The landscape exhibited symmetric distribution patterns, allowing interpretable analysis of perturbation type, directionality, and magnitude, as detailed in Supplementary Fig. 1. Furthermore, by introducing random atomic displacements into the NaCl structure, we demonstrated that the entropy-symmetry landscape effectively identifies the degree of atomic perturbation, visualized in Supplementary Fig. 2. To further illustrate that sS and Q6 can resolve structural information with energy-level resolution even without prior knowledge, we performed a melting MD simulation of silicon and tracked the temporal evolution of energy, sS, Q6, and the simple cubic crystal structure parameter. As shown in Supplementary Fig. 3, while the simple cubic parameter, fails to quantify the disordered liquid region after melting, both sS and Q6 evolve smoothly and remain strongly correlated with energy throughout the entire trajectory. This confirms that, during continuous structural evolution such as melting, the entropy-symmetry landscape captures progressive disordering with a resolution comparable to energy-based descriptions, relying solely on configurational geometry without requiring energy calculations or prior knowledge of the phases involved.

Fig. 1: The integrated framework navigating polymorph generation and distilled potential development guided by entropy-symmetry landscapes.
Fig. 1: The integrated framework navigating polymorph generation and distilled potential development guided by entropy-symmetry landscapes.The alternative text for this image may have been generated using AI.
Full size image

a Schematic diagram of the entropy-symmetry landscape, using instantaneous pair entropy (sS) and the sixth-order Steinhardt parameter (Q6) as global and local order parameters, respectively, for topological analysis of polymorphs. Using snapshots of silicon melting from cubic diamond (CD) to liquid as an example, the progression through CD → CD + the first neighbour CD (1st CD) → CD + 1st CD + the second neighbour CD (2nd CD) → CD + 1st CD + 2nd CD + liquid → liquid is visualized based on energy states. Dark red shades indicate low-energy crystalline phases, while light red shades represent high-energy disordered states. Inset atomic structures are coloured according to diamond‑structure type: blue for cubic diamond, cyan for cubic diamond (1st neighbour), light green for cubic diamond (2nd neighbour), orange for hexagonal diamond, yellow for hexagonal diamond (1st neighbour), lime green for hexagonal diamond (2nd neighbour), and light grey for other. b Schematic diagram of the PolymorphGen, which enables efficient construction of configuration libraries on sS-Q6 landscapes, incorporates polymorph generations inspired by genetic mutation. PolymorphGen bypasses the reliance on time-consecutive integration for structure generation, moving from the conventional structure-temperature-pressure (S-T-P) ab initio molecular dynamics (AIMD) framework to the entropy-symmetry-displacement-volume-shape (sS-Q6-D-V-S) framework. c Schematic diagram of the machine learning potential knowledge distillation (MLPKD) framework, which enables multi-resolution classification and, when combined with the PolymorphGen configuration library, facilitates the transfer of cross-scale accuracy from density functional theory (DFT) through message passing neural network (MPNN) to deep neural network (DNN), thereby achieving a 106-fold speed enhancement. Auto-DFT scheduling platform is detailed in Supplementary Fig. 4. The number of standard thermodynamics configurations after multi-resolution classification ranges from 101 to 103, while supplementary configurations range from 102 to 104. d Comparison between previous active-learning-based structure-property relationship frameworks and our integrated PolymorphGen-MLPKD framework, visually demonstrating the non-iterative nature of our streamlined approach that overcomes the accuracy-efficiency limitations. n, number of iterations; TPre, initial data preparation time; TMLP, machine learning potential training time; TExp, configuration exploration time; TDFT, DFT computation time; TPG, PolymorphGen processing time; TMPNN, MPNN model training time; TDNN-KD, DNN model trained via knowledge distillation (DNN-KD) training time. Source data are provided as a Source Data file.

By integrating the entropy-symmetry landscape with genetic mutation-inspired sampling, we developed PolymorphGen to enable targeted polymorph generation within the sS-Q6 plane, thereby overcoming limitations of existing CSP methods, including dataset dependency, neglect of thermal fluctuations, and challenges in exploring metastable disordered materials. PolymorphGen operates in three phases (Fig. 1b): (i) Topological mapping of entropy (sS) and symmetry (Q6) parameters of crystal structures onto a two-dimensional plane provides insight into the phase distribution of input structures; (ii) Treating sS and Q6 as chromosomal elements and atomic positional relationships as the DNA sequence, we employ atomic displacement (D), cell volume (V), and cell shape (S) as mutation gene segments to regulate genetic variation, as illustrated in Supplementary Movie 1; (iii) Using existing configurations as parent structures, new structures are directionally generated on the entropy-symmetry plane through mutation operations on D, V, and S, integrated with a roulette wheel iteration process. Whereas previous studies relied on metadynamics methods that bias configurational energy, we deliberately introduced entropy-symmetry topological mapping as the core driver of iterative configurational evolution. PolymorphGen provides a comprehensive perspective to orchestrate an efficient genetic algorithm, directly mutating configurations to avoid local optima traps and the inherent error accumulation of first-principles calculations. The genetic mutation parameters D, V, and S, which enable atomic displacement and cell shape variation, constitute the core of the Parrinello-Rahman method and are essential for studying crystal-related processes22,24. By transforming the traditional structure-temperature-pressure (S-T-P) paradigm of ab initio molecular dynamics (AIMD) into an sS-Q6-D-V-S framework (Fig. 1b, highlighted region), PolymorphGen incorporates thermodynamic influences into CSP in a manner not realised by conventional methods, yielding a polymorph configuration library that encompasses stable crystals, metastable intermediates, and thermodynamically disordered states. Critically, unlike existing disordered-structure sampling approaches25, PolymorphGen fundamentally circumvents the need for prohibitively expensive quantum computations and iterative active-learning cycles. This methodological development establishes our strategy as a computationally efficient and scalable alternative for exploring complex phase landscapes.

The configuration library generated by PolymorphGen, which covers thermodynamic phase distributions, provides a data foundation for directly establishing structure-property relationships using CSP methods. Inspired by knowledge distillation techniques from large-scale models, the MLPKD framework offers an architectural basis for achieving efficient and high-quality cross-scale accuracy transfer through the entropy-symmetry landscape. The MLPKD framework operates in three phases (Fig. 1c): (i) Configurations uniformly distributed at low density across the full entropy-symmetry landscape are selected as standard thermodynamics configurations. A high-accuracy reference dataset is obtained using DFT combined with the Auto-DFT scheduling platform (Supplementary Fig. 4). (ii) The DFT dataset is used to train a highly generalizable yet costly MPNN model. Multi-resolution classification generates supplementary thermodynamics configurations based on specific properties, and the MPNN model is used to produce a high-efficiency workbook dataset. (iii) A hybrid-fidelity dataset formed by combining the DFT and MPNN datasets is employed to train a low-cost DNN-KD model suitable for large-scale atomic system simulations. The PolymorphGen-MLPKD framework enables efficient and high-quality knowledge distillation of cross-scale accuracy from complex to simple models, achieving a 106-fold speed enhancement. As shown in Fig. 1d, compared to conventional active learning frameworks26, our framework bridges the critical gap between structure generation and structure-property relationship establishment in crystal AI. Its unidirectional architecture avoids multiple cycles of DFT and MLP computations, overcoming accuracy-efficiency limitations by revolutionising the entire work-flow from polymorph analysis and data preparation to training and testing.

PolymorphGen used for CSP

To demonstrate the broad applicability of our entropy-symmetry topological mapping method, we applied it to transform high-dimensional structures from diverse processes27,28,29,30,31, including multithermal-multibaric simulations of MgSiO3 perovskite (Fig. 2a), solid-solid phase transitions in Ti3O5 (Fig. 2b), ice nucleation (Supplementary Fig. 5a), structural reconfigurations in AgI solid-state electrolytes (Supplementary Fig. 5b), and solidification of GeTe nanoparticles (Supplementary Fig. 5c), into interpretable two-dimensional representations. The selection of phases for annotation was guided by a combination of point distribution patterns and energy correspondence. The entropy-symmetry landscape alone does not automatically delineate phase boundaries; rather, it provides a low-dimensional representation in which structural similarity is reflected by point proximity, enabling interpretable visualization that, when combined with available energy or phase-label information, facilitates the identification of distinct phases and transition pathways. Unlike conventional dimensionality reduction techniques27,28, our approach not only accurately discriminates distinct phase structures across all systems, but also reveals smooth continuous distributions of intermediate states and well-defined phase transition pathways, using structural information alone without requiring precise first-principles computations. We note that the entropy-symmetry landscape does not automatically partition phase space into discrete clusters, but rather provides a continuous similarity metric in which distances reflect structural differences. This continuity is physically meaningful for datasets obtained from multithermal-multibaric sampling, where intermediate states naturally form a continuum. Where gaps appear in the landscape, they indicate under sampled metastable regions, which are precisely the targets for prediction by PolymorphGen. Particularly compelling is the topological mapping of the Ti3O5 solid-solid phase transition dataset (Fig. 2b; Supplementary Fig. 5d), which captures not only the β-λ transition but also transition states TS1 and TS2, the high-temperature α phase, and metadynamics sampling configurations. This coherent behaviour underscores that our structure-based entropy-symmetry mapping is interpretable and broadly applicable, rather than being a lossy or non-unique dimensionality reduction method.

Fig. 2: PolymorphGen unveils dynamic phase transitions across crystal systems.
Fig. 2: PolymorphGen unveils dynamic phase transitions across crystal systems.The alternative text for this image may have been generated using AI.
Full size image

a Snapshots of MgSiO3 configurations27 from multithermal multibaric simulations, including perovskite, defective perovskite, liquid-crystal interface, and liquid structures, are visualized based on energy states, with dark hues indicating low-energy crystalline phases and light hues indicating high-energy disordered states. Inset atomic structures are coloured by element: orange‑yellow for Mg, blue for Si, and red for O. b Snapshots of the solid-solid phase transition in Ti3O528, encompassing β, λ, and α crystalline phases, are visualized based on energy states, with dark hues indicating low-energy crystalline phases and light hues indicating high-energy disordered states, while ellipses highlight the regions of transition states TS1 and TS2, as well as configurations sampled by metadynamics. Ellipses are for visual guidance only, highlighting regions identified with additional energy information and do not represent automatic clustering from the entropy-symmetry landscape. Inset atomic structures are coloured by element: blue for Ti and red for O. c PolymorphGen generated and experimental32 structure factor S(Q) of paracrystalline diamond. Inset: The crystal and entropy view of the paracrystalline diamond model generated by PolymorphGen. HD, hexagonal diamond; CD, cubic diamond. d Comparison of the relative frequency distribution of the number of atoms in paracrystalline diamond structures generated by PolymorphGen and MD. e Snapshots of graphite-to-diamond configurations generated by PolymorphGen, visualized based on energy states. The distribution reveals distinct regions corresponding to the graphite (Gr) phase, other phase, CD phase, Gr-other transition phase, and Gr-CD transition phase. Inset atomic structures are coloured by structural type: purple for graphene, dark grey for other, blue for cubic diamond, and cyan for cubic diamond (1st neighbour). f Energy profile of graphite-to-diamond transformation pathways from PolymorphGen compared to the nudged elastic band (NEB) method. Inset: Typical structures of Path I and Path II. Source data are provided as a Source Data file.

The advancement of PolymorphGen in CSP lies in its ability to recognise and effectively inherit the entropy-symmetry characteristics of existing structures. Through physically-informed genetic mutation perturbations and constraints, it directly generates configurations capable of crossing phase transition barriers without relying on computed parameters such as energy. Taking the generation of paracrystalline diamond structures as an example, the static structure factors of the numerous paracrystalline configurations generated by PolymorphGen show strong agreement with experimental measurements32 (Fig. 2c), demonstrating the method’s reliability. Compared to traditional molecular dynamics (MD), PolymorphGen produces a significantly broader variety of paracrystalline diamond types (Fig. 2d). In addition to conventional MD sampling, comparisons with active learning approaches and metadynamics enhanced sampling methods for silicon crystallization (Supplementary Fig. 6) reveal distinct sampling characteristics. Projection of these sampling trajectories onto the entropy-symmetry landscape shows that active learning misses many metastable states despite multi-temperature sampling, while metadynamics at 1700 K explores regions distinct from standard MD at the same temperature (Supplementary Fig. 6a). As shown in Supplementary Fig. 6b, the configurational ensemble generated by PolymorphGen covers a substantially broader region of the entropy‑symmetry landscape and spans a wider energy range than metadynamics sampling. This increased diversity, together with its thermodynamic representativeness, arises from mutation‑based exploration across the two‑dimensional landscape rather than biased sampling along collective variables. Furthermore, a direct quantitative comparison demonstrates that PolymorphGen reduces the CPU time per configuration by approximately two to three orders of magnitude relative to Gamma‑point DFT (representative of metadynamics) and multiple k‑point DFT (representative of active learning), as detailed in Supplementary Fig. 7. Unlike conventional methods that couple exploration with energy evaluation, PolymorphGen decouples these two stages: configurational exploration proceeds without any energy calculation, enabling batch generation of thousands of candidate structures, while the subsequent DFT evaluation is maximally parallelised via our Auto-DFT scheduling platform. This decoupling is the key to achieving both high efficiency and the ability to discover globally low-energy configurations that lie far from linearly interpolated paths. Furthermore, PolymorphGen enables exploration of phase transition pathways from point to plane: using only five initial structures33 as parent configurations, it successfully generated a series of configurations depicting the continuous transition from graphite to diamond (Fig. 2e). Energy analysis of these configurations using our Auto-DFT scheduling platform (detailed in Supplementary Fig. 4) revealed smooth and uniform energy variations, encompassing low-energy structures such as graphite and cubic diamond, medium-energy partially crystalline intermediates including graphite-cubic diamond and graphite-other hybrids, and high-energy transition structures. The key transition structures near the cubic diamond phase, such as graphite-cubic diamond-graphite, align with existing studies33,34, supporting the validity of the generated structures. Notably, the structural reconstruction from graphite to other phases, which were absent in the initial configurations, demonstrates PolymorphGen’s effectiveness in incorporating thermodynamic influences to generate novel structures. We emphasise that PolymorphGen is designed to generate thermodynamically continuous configuration ensembles for MLP training, rather than to directly predict a unique transition pathway. The large number of generated configurations reflects the intrinsic complexity of the configurational space and serves as a rich data resource for training generalizable machine learning potentials. Two phase transition pathways generated by PolymorphGen are presented in Fig.2f: PathI follows the left edge of the landscape, corresponding to the largest energy change between graphite and diamond, while PathII traces the right edge, representing the smallest energy change. Detailed transition structures for both pathways are provided in Supplementary Fig.8. The significant energy differences between them originate from volume changes and atomic rearrangements, mirroring the effects of temperature and pressure in real phase transitions. Notably, PathII exhibits a lower energy barrier than the NEB path33 computed from the same endpoints (Fig.2f). This demonstrates that mutation-based exploration across the two-dimensional landscape can uncover lower-energy transformation routes missed by linearly interpolated NEB calculations, which serve as the standard initial guess in conventional NEB implementations. The decoupled exploration-evaluation paradigm underlying PolymorphGen enables such discoveries by freeing configurational generation from the local constraints inherent in energy-coupled sampling. This advances CSP research from one‑dimensional pathway exploration to two‑dimensional landscape mapping.

PolymorphGen-MLPKD for structure-property relationship establishment

A distinguishing feature of the PolymorphGen-MLPKD framework in establishing structure-property relationships is its expanded exploration of configurations beyond existing methods. We compared the distribution of configurations from traditional AIMD and those added by our method on the entropy-symmetry landscape (Fig. 3a), visually revealing expanded sampling of metastable and thermodynamically disordered states. To ensure a rigorous and unbiased assessment, we constructed two distinct datasets: one comprising AIMD configurations and the other consisting of PolymorphGen structures. Identical MPNN model architectures were trained on each dataset. The test set included configurations strictly excluded from both training sets, evaluated using the Auto-DFT scheduling platform to provide reference data. As quantified in Fig. 3b, the MPNN model trained on our framework demonstrably outperformed its AIMD‑based counterpart, achieving lower root‑mean‑square errors for energy, force and virial predictions, as well as smaller errors in elastic constants (Supplementary Fig. 9). A more stringent test under large compression, summarised in Supplementary Table 1, further illustrates this improved generalisation: the model trained exclusively on AIMD configurations produced unrealistically large energies at reduced volumes that would lead to computational instability, whereas the PolymorphGen‑trained model closely followed the DFT reference with consistently small errors. The insufficiency of merely broad thermodynamic coverage is further evidenced by a comparison with DP‑GEN, an active learning method. Despite sampling an extensive range of 0–15,500 K and 0–500 GPa and using 983,941 configurations (Supplementary Fig. 10), the DNN model trained on this dataset failed to capture the correct mechanical behaviour of iridium, predicting C12 > C44 in contradiction with experiments35 and the known brittle character of Ir. In contrast, our PolymorphGen‑trained model accurately reproduced the elastic constants and brittle nature. This demonstrates that broad coverage alone is not sufficient; the quality and uniformity of the sampled configurations are equally critical. PolymorphGen‑MLPKD thus provides a comprehensive perspective for MLP research by covering stable, metastable and disordered states, avoiding the limitations of prior knowledge gaps, and enabling fair training and testing on a unified scale. As shown in the test results in Fig. 3c, the equivariant MPNN-based MACE36 model achieved the highest accuracy. We validated PolymorphGen-MLPKD on metals with FCC, BCC, and HCP structures, using dataset sizes of 3417 for Al, 3259 for Ir, 4053 for Mo, and 2202 for Zr. To evaluate the accuracy and efficiency, we compared its predictions with experimental and DFT data across multiple domains, including vacancy formation energy (Fig. 3d), elastic properties (Fig. 3e, Supplementary Table 2), thermodynamic properties (Supplementary Fig. 11a, c), transformation kinetics via Bain deformation paths (Supplementary Fig. 11b), dynamic stability from phonon dispersion (Supplementary Fig. 11d), plastic deformation mechanisms (Supplementary Fig. 11e), and twinning mechanisms (Supplementary Fig. 11f). Furthermore, analyses of the generalized stacking fault energy (GSFE) and generalized planar fault energy (GPFE) curves confirmed that twinning is its dominant deformation mechanism (Supplementary Fig. 11f).

Fig. 3: Integrated PolymorphGen-MLPKD framework enabling efficient transfer of cross-scale accuracy.
Fig. 3: Integrated PolymorphGen-MLPKD framework enabling efficient transfer of cross-scale accuracy.The alternative text for this image may have been generated using AI.
Full size image

a Snapshots of the configuration distribution within Ir’s training dataset compare structures derived from ab initio molecular dynamics (AIMD) with those added by PolymorphGen. b Comparison of the root mean squared errors (RMSE) for energy, force, and virial on configurations strictly excluded from both training sets, between machine learning potentials (MLPs) trained on AIMD-derived and PolymorphGen-derived configurations. c Top: Distribution of datasets on the entropy-symmetry landscape. Polymorph datasets covering stable, metastable, and disordered states enable fair training and testing of MLPs. Bottom: RMSEs of the NequIP, DPA, DPMD, and MACE models compared with density functional theory (DFT) calculations. Both training and testing of the models follow this concept. d Vacancy formation energy of Al54, Ir54, Mo55, and Zr56 calculated by message passing neural network (MPNN) models, compared with DFT calculations. e Elastic constants (C11, C12, C44, C13, C33) and bulk modulus (B), shear modulus (G), and Young’s modulus (E) of Al57, Ir58,59,60, Mo61, and Zr62,63 calculated using MPNN models compared with experimental data. Inset: Enlarged view of the selected region. f The force prediction RMSEs for different numbers of training configurations demonstrate model loss convergence. Inset: Radial distribution functions of liquid-Ir computed using MPNN trained by 77 and 3259 configurations, compared with DFT results. g Impact of configurational coverage bias on MLP prediction performance. Incomplete sampling by the biased data (inset) manifests as an under-sampled high-error region in the energy prediction error distribution (centre), which is quantified by consistently higher RMSEs across energy, force, and virial predictions compared to the model trained on uniformly sampled data (around). h The error distributions in energy, force, and virial for DNN, DNN model trained via knowledge distillation (DNN-KD), and MPNN models are compared with DFT calculations. The box chart with normal distribution curve depicts the error distribution using five statistics: minimum, lower quartile, median, upper quartile, and maximum. i Phonon dispersion spectra of DFT, DNN, DNN-KD, and MPNN models. Source data are provided as a Source Data file.

The multi‑resolution classification strategy within PolymorphGen‑MLPKD offers a distinct advantage over opaque data filtering techniques prevalent in the field by enabling interpretable assessment of data redundancy37. This systematic down‑sampling, illustrated in Fig. 3f and Supplementary Fig. 12, reveals a clear convergence behaviour: once a threshold of approximately 1400 uniformly distributed configurations is reached, further increasing the dataset size yields only marginal improvements in force predictions. Remarkably, models trained on as few as 77 configurations accurately capture iridium’s liquid structure, unique brittleness and phonon spectra, as shown in Fig. 3f and Supplementary Figs. 13–14. This establishes a data‑efficiency scaling law: for a given material with uniform configurational coverage, model error saturates beyond a relatively small training set size, implying that configurational diversity, not merely quantity, is the primary driver of accuracy. Second, we examined how the uniformity of configurational coverage affects model generalisation, a distinct aspect tied to the topology of the entropy‑symmetry landscape, and uncovered a coverage‑uniformity scaling law. Using a deliberately biased dataset with gaps in thermodynamic coverage (Fig. 3g inset), we found that prediction errors concentrated precisely in the under‑sampled regions (Fig. 3g centre), and overall RMSEs remained consistently higher than those from uniformly sampled data regardless of total dataset size (Fig. 3g around). This demonstrates that predictive performance is critically sensitive to how evenly training configurations populate the entropy‑symmetry landscape; incomplete or uneven sampling leads to systematic extrapolation errors even with large datasets. Together, these two complementary scaling laws underscore that the entropy‑symmetry landscape provides a powerful low‑dimensional descriptor for assessing and designing optimal training datasets, enabling both data efficiency and robust generalisation in MLP development.

We observed that polymorphs of the same crystal type exhibit highly similar distributions on the entropy-symmetry landscape. This led us to propose that structure-property relationships for isotypic structures can be directly trained by replacing element types within a consistent thermodynamic configuration library. As shown in Supplementary Fig. 15a, this concept was validated using the FCC-Al model trained on FCC-Ir configurations (Supplementary Fig. 15b) and the FCC-IrRe doping model (Supplementary Fig. 15c). Complex models can accurately capture structure-property relationships with small datasets, but their application in large atomic systems is constrained by inherent architectural limitations, such as the exponential increase in computational complexity with the number of atomic neighbours introduced by equivariant message passing in models like MACE. PolymorphGen-MLPKD transfers DFT-level accuracy from MPNN to DNN without changing the model architecture. Compared to DNN models trained directly on DFT datasets, the DNN-KD model, which is distilled from the MPNN teacher, mitigates energy biases typical in small-dataset training. It exhibits smaller, more concentrated errors in force and virial predictions (Fig. 3h), reduces RMSE by approximately 30%, and delivers more accurate predictions of dynamic stability (Fig. 3i).

Brittle metal twinning routes explored by the distilled-potential

We applied PolymorphGen-MLPKD to develop a DNN-KD model for iridium, in which MPNN was used to enhance its phonon scattering properties (Supplementary Fig. 16). The purpose of transferring cross-scale accuracy to the DNN-KD model was to investigate emergent mechanical behaviours, such as dislocations and twinning, that are only observable through large-scale atomic simulations. To demonstrate the universality of our framework across different deformation mechanisms, we extended the same work-flow to aluminum, a system known to deform via dislocation-mediated plasticity38 in stark contrast to the twinning-dominated behaviour of iridium. The comparative results are summarised in Fig. 4a, b. Strikingly, while aluminum ultimately undergoes conventional Shockley partial dislocation slip via intrinsic stacking fault (ISF) nucleation and expansion, our simulations reveal a far richer incipient plasticity pathway. Upon ISF nucleation and propagation into stress-concentrated regions, the faulted structure transiently transforms into nanoscale twins of 1–2 atomic layer thickness, accompanied by the concurrent nucleation of additional ISFs at other locations. As strain proceeds, the system undergoes multiple twinning and subsequent detwinning events, eventually evolving into steady-state dislocation slip. This complex, transient twinning-detwinning cascade during the early stage of plasticity in aluminum is resolved in detail through our framework. The full dynamical evolution is provided in Supplementary Movie 2 (iridium) and Supplementary Movie 3 (aluminum) for direct comparison.

Fig. 4: Twinning route of brittle metal iridium explored by the distilled-potential.
Fig. 4: Twinning route of brittle metal iridium explored by the distilled-potential.The alternative text for this image may have been generated using AI.
Full size image

a The twinning-dominated deformation pathway in iridium proceeds via: intrinsic stacking fault (ISF) nucleation, twin nucleation, and continuous twinning propagation. b The deformation pathway in aluminum, though ultimately governed by dislocation slip, involves a transient twinning-assisted incipient stage comprising: ISF nucleation, ISF expansion, twin nucleation, multiple twinning, detwinning, and eventual transition to steady-state dislocation slip. c Variation in BCC cluster defect content with strain during nanopillar compression at 300 K, 500 K, and 800 K. d Shockley partial dislocation-mediated twinning nucleation and propagation pathway in iridium micro-pillar compression, illustrating the continuous structural evolution from perfect FCC arrangement to BCC cluster defects, to ISF, to extrinsic stacking fault (ESF), and finally to coherent twin boundary (CTB). Source data are provided as a Source Data file.

Experimental observations confirm that iridium deforms by twinning, but the specific twinning route remains unexplored39. We employed micro-pillar compression MD simulations to directly capture the dynamic formation mechanism of three-layer twins; the micro-pillar model is shown in Supplementary Fig. 17. Within the elastic range of compression, we observed the emergence of BCC cluster defects in iridium (Supplementary Fig. 18). Moreover, the structure transitions through a BCC phase during the stress-induced FCC-to-HCP phase transformation (Supplementary Fig. 19a). The complete FCC-BCC-HCP transition pathway is detailed in Supplementary Fig. 19b and is consistent with compression studies in FCC high-entropy alloys40. The population of these BCC cluster defects exhibits temperature dependence, progressively increasing with rising temperature as shown in Fig. 4c. In Fig. 4d, three consecutive (111) planes labelled A, B, and C represent the perfect crystalline repeat units in FCC iridium. Shockley dislocation emission transforms the FCC stacking sequence from ABCABCA to ABC’A’BCA containing BCC cluster defects, and further into the stacking fault sequence ABCBCAB. With increasing applied stress, the atomic stacking order evolves from the continuous stacking fault sequence ABCBCAB to ABCBABC, where an FCC layer is sandwiched between two coherent twin boundaries (CTB). This state further develops into a three-layer twin, in which the stacking order changes from ABCBABC (with an FCC layer between two CTBs) to BACBABC (with two atomic layers between two CTBs), ultimately forming a structure with eight atomic layers between two CTBs (Supplementary Fig. 20). This twinning route is identical to the novel 1-3-2 twinning pathway observed via high-resolution TEM (HRTEM) in metals with high Intrinsic stacking fault energy41. This work demonstrates the applicability and practical utility of PolymorphGen-MLPKD for establishing high-accuracy structure-property relationships through direct CSP-based structure prediction in the study of mechanical behaviour in large-scale atomic systems.

Discussion

In conclusion, we present a universal framework, PolymorphGen-MLPKD, that bridges the critical gap between crystal structure prediction and structure-property relationship establishment in crystal AI, overcoming long-standing accuracy-efficiency limitations. Central to this framework is the entropy‑symmetry landscape, a two‑dimensional projection of the high‑dimensional configurational space onto coordinates defined by two physically meaningful, structure‑based parameters: the instantaneous pair entropy (sS), which reflects thermodynamic (radial) features, and the sixth‑order Steinhardt symmetry parameter (Q6), which encodes crystallographic (angular) information. Their combination provides a complementary, information‑preserving representation that offers a comprehensive and invariant perspective, unveiling dynamic phase transitions across diverse systems, including perovskites, solid‑solid transformations, nanoparticle solidification, and ice nucleation. The framework’s capability in crystal structure prediction is demonstrated through its improved efficiency and configurational diversity relative to existing sampling methods such as MD, AIMD, active learning, metadynamics, and NEB calculations, as evidenced by explorations of paracrystalline diamond and graphite‑to‑diamond transition pathways. Multi‑resolution screening of the entropy‑symmetry landscape reveals two complementary scaling laws for machine learning potentials: data‑efficiency saturation under uniform coverage and performance degradation arising from coverage bias. The utility of the entropy-symmetry landscape as a structural similarity metric is independently validated by the scaling-law analysis, where coverage uniformity on the landscape directly predicts MLP generalisation error, confirming that sS and Q6 provide a reproducible and objective measure of configurational proximity. Our knowledge distillation approach enables cross‑scale accuracy transfer from DFT through MPNN to DNN‑KD with a 106‑fold speed enhancement while maintaining high generalisation capability. The practical utility of PolymorphGen‑MLPKD is confirmed through investigations of distinct plasticity mechanisms, namely dislocation‑mediated slip in aluminum and twinning‑dominated deformation in brittle iridium, demonstrating its applicability across materials with fundamentally different deformation modalities. The core innovation lies in the physics-informed generation and selection of polymorphs, which strategically directs the most representative structures to models of varying complexity, thereby ensuring the accuracy and generalizability of crystal AI at the data level. Our strategy of using fixed standard thermodynamic configurations for isotypic structures effectively controls configuration, one of the key variables along with composition and relative positioning, in MLP training for doped and even high-entropy alloys, substantially mitigating data explosion in structure-property mapping. PolymorphGen-MLPKD will serve as a guideline for thermodynamic polymorph generation and MLP data preparation, training, and testing, laying the critical foundation for the next generation of high-fidelity atomic simulations.

Methods

Entropy-symmetry landscape

We employ sixth-order Steinhardt symmetry parameters (Q6) to quantify the short-range order of the system23. This parameter encodes the bond-orientational order between each central particle and its nearest neighbours into spherical harmonics to sensitively capture local symmetry features, without being limited by the absence of long-range order. For each particle \(i\) and its bond with neighbour \(j\), represented by a vector \({{{{\bf{r}}}}}_{{{{\bf{ij}}}}}\), the local bond-order vector \({q}_{{lm}}\) captures the orientation information of the local environment of particle \(i\). It is defined as the average projection of the directional information of all bonds onto spherical harmonics:

$${q}_{{lm}}\left(i\right)=\frac{1}{{N}_{b}\left(i\right)}\mathop{\sum }_{j=1}^{{N}_{b}\left(i\right)}\,{Y}_{{lm}}\left({\theta }_{{ij}},{\phi }_{{ij}}\right),$$
(1)

where \({Y}_{{lm}}\) are the standard spherical harmonics, which describe patterns of angular distribution in three-dimensional space and are determined by two integer indices \(l\) and \(m\). The spherical harmonics are expressed as:

$${Y}_{{lm}}\left(\theta,\phi \right)={\left(-1\right)}^{m}\sqrt{\frac{\left(2l+1\right)\left(l-m\right)!}{4\pi \left(l+m\right)}}{P}_{l}^{m}\left(\cos \theta \right){e}^{{im}\,\phi },$$
(2)

where \({P}_{l}^{m}\) are the associated Legendre polynomials, and different l values correspond to different orders of spatial symmetry. \({N}_{b}\left(i\right)\) is the number of nearest neighbours of particle \(i\), and \(\left({\theta }_{{ij}},{\phi }_{{ij}}\right)\) is the spherical coordinate of the bond vector \({{{{\bf{r}}}}}_{{{{\bf{ij}}}}}\) connecting central particle \(i\) and neighbour \(j\) in a preset global reference frame. To obtain a physical quantity that reflects the strength of local symmetry and is independent of the choice of coordinate system, Steinhardt introduced the rotation-invariant local order parameter \({q}_{l}\left(i\right)\).This parameter is constructed by summing the squares of all \(l\) components and normalising, thereby eliminating dependence on coordinate rotation:

$${q}_{l}\left(i\right)=\sqrt{\frac{4\pi }{2l+1}\mathop{\sum }_{m=-l}^{l}\,{\left|{q}_{{lm}}\left(i\right)\right|}^{2}}.$$
(3)

By averaging the local bond-order vectors, we aim to describe the overall order of the system:

$$\left\langle {q}_{{lm}}\right\rangle=\frac{1}{N}\mathop{\sum }_{i=1}^{N}\,{q}_{{lm}}\left(i\right).$$
(4)

For the \(l\)-th order Steinhardt symmetry parameter \({Q}_{l}\), it is defined as:

$${Q}_{l}=\sqrt{\frac{4\pi }{2l+1}\mathop{\sum }_{m=-l}^{l}\,{\left|\left\langle {q}_{{lm}}\right\rangle \right|}^{2}}.$$
(5)

Here, we choose Q6, the average bond-orientational order at angular wave number \(l=6\), to quantify the system, because sixth-order spherical harmonics can effectively capture and distinguish common crystal symmetries.

Global order is quantified using liquid state theory, where the excess entropy per atom is expressed as an infinite series of multiparticle correlation functions21, with the two-body term defined as:

$${S}_{2}=-2\pi \rho {k}_{B}\int _{0}^{\infty }\,[g\left(r\right){{{\mathrm{ln}}}} \, g\left(r\right)-g\left(r\right)+1]{r}^{2}{dr},$$
(6)

where \(g\left(r\right)\) is the radial distribution function and \(\rho\) is the density of the system. Here, we employ a mollified version of the radial distribution function

$${g}_{m}\left(r\right)=\frac{1}{4\pi N\rho {r}^{2}}\mathop{\sum }_{i\ne j}\,\frac{1}{\sqrt{2\pi {\sigma }^{2}}}{e}^{-{\left(r-{r}_{{ij}}\right)}^{2}/\left(2{\sigma }^{2}\right)},$$
(7)

as defined by Parrinello et al.22, to compute the instantaneous entropy (sS), where \({r}_{{ij}}\) is the distance between particles \(\left(i\right)\) and \(\left(j\right)\), and \(\sigma\) is a broadening parameter. The cutoff distance \({r}_{\max }\) is chosen to optimise numerical integration of the mollified radial distribution function \({g}_{m}\left(r\right)\), which ensures continuous derivatives with respect to atomic positions, into Eq. (6) using the trapezoid rule.

First-principles calculation

The first-principles calculations were performed using the Vienna ab initio simulation package (VASP)42 v6.4.3. The ion-electron interactions were described using the Projector Augmented Wave (PAW) basis set, with a cutoff energy of 600 eV43. The Perdew-Burke-Ernzerhof (PBE) functional within the Generalised Gradient Approximation (GGA) framework was employed for the exchange-correlation interactions44. The energy and force convergence tolerance of geometry relaxation was 10-6 eV, and 0.01 eV/Å, respectively. VASPKIT45 was used to generate k-points files with a reciprocal space resolution of 2π × 0.04 Å−1. Structures requiring first-principles computations are automatically processed by a high-throughput computation scheduling platform, which manages the entire work-flow, including input file generation, computations across multiple CPU and GPU nodes, and result retrieval.

The initial configurations for exploring the structure-property dynamics of FCC/BCC/HCP metals were created with Atomsk46 using their experimental lattice constants and crystal types. These configurations were then used in on-the-fly machine learning accelerated AIMD simulations performed with VASP, which employed a Bayesian learning algorithm. An isothermal-isobaric (NPT) ensemble, employing a Langevin thermostat based on the Parrinello-Rahman algorithm24,47, was utilized to fully melt systems consisting of 108-atom Ir, 128-atom Mo, 108-atom Al, and 128-atom Zr at a pressure of 10 GPa, followed by cooling to a temperature of 300 K, with each stage lasting 200 ps. Configurations obtained exclusively from these AIMD simulations were used for topological analysis and as starting points for structure generation through genetic mutation, ensuring comprehensive sampling of the phase landscape.

MLPs construction and validation

The MPNN, DNN, and DNN-KD models were trained using the DeePMD-kit48 v3.0.1 package, which employs a plugin mechanism to integrate diverse models, enabling fair comparisons under unified datasets and training parameters. For DNN and DNN-KD, the classic ‘se_e2_a’ descriptor type was utilized, with model compression and fine-tuning applied after each training cycle. The descriptor was constructed using a neural network with layers containing 32, 64, and 128 neurons, respectively. The fitting network consisted of six layers with 240, 240, 240, 240, and 240 neurons, ensuring robust learning of complex interatomic interactions. In contrast, the MPNN model adopted the ‘mace’ descriptor type within the DeePMD-kit v3 package, incorporating 128 equivariant messages. All models employed a cutoff radius of 8.00 Å to capture interatomic interactions and used an exponentially decaying learning rate, starting at 0.001 and decreasing to 3.51 × 10-8, with distinct loss function weights assigned for optimizing energy, force, and virial terms.

All validations of the machine learning potentials were performed using the open-source code Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS49) 29 Aug 2024 version, with standardized workflows for specific property evaluations implemented through auxiliary packages employing LAMMPS as the computational solver. Elastic constants were derived using the auto-test module of the DP-GEN26 package, while Bain path, γ surface, generalized planar fault energy (GPFE), and generalized stacking fault energy (GSFE) curves were calculated via the Lava Wrapper package50.

MD setup

All MD simulations were conducted using the open-source code LAMMPS, integrated with our trained DNN-KD model. For the micro-pillar compression MD simulation, the simulation volume was constructed with a diameter-to-height ratio of 1:1:1.6 and oriented along the principal crystallographic directions [100], [010], and [001] of the FCC lattice, as shown in Supplementary Fig. 17. The simulation temperature was set to 300 K. After energy minimization and NPT relaxation for 100 ps, uniaxial compression along the z-axis was performed with a strain rate of 5.0 × 107 s-1. Fixed layers with a thickness of 6 Å were applied at the top and bottom along the z-axis to inhibit the periodic propagation of dislocations, thereby preventing artificial interactions across periodic boundaries and ensuring the accuracy of the plastic deformation simulation. The Open Visualisation Tool (OVITO)51 was used to visualise the plastic deformation process in the MD simulations, specifically to identify FCC planar faults.

Paracrystalline diamond structure generation

The initial configuration was a 2 × 2 × 2 supercell of C60 (mp-1196583) from the Materials Project, comprising 1920 atoms. Conventional MD simulations were performed using LAMMPS49 with the Tersoff potential52. After energy minimization, the system was heated from 300 K to 5000 K over 1 ns at 50 GPa under the NPT ensemble, held at 5000 K for 1 ns, and then cooled to 300 K with pressure reduced to 30 GPa over 1 ns. Structures from this MD trajectory were selected as the initial configuration library for polymorph generation. In PolymorphGen, the cutoff for the symmetry parameter (Q6) was set to 2.4 Å, and the program automatically computed the initial entropy-symmetry landscape. The landscape range for genetically mutated configurations was expanded by 10% from the current values. For paracrystalline diamond, each target point was evolved over 30 generations with up to 50 individuals per generation.

Graphite-to-diamond transition exploration

Initial configurations were taken from ref. 33, with a Q6 cutoff of 3.6 Å. The landscape range for mutation was set equal to the current entropy-symmetry values. Ten evenly spaced target points were automatically selected along each edge of the landscape. Each target was evolved over 30 generations with up to 50 individuals per generation. The procedure generated 117,799 configurations. After multi-resolution classification with screening thresholds of 0.1 and 0.005 on the entropy-symmetry landscape, 6724 configurations were retained in the first round, and their energy distribution was computed using the Auto-DFT platform.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.