Introduction

In the realm of materials science, structural disorder, arising from impurities, defects1,2 or random atomic placements in high-entropy alloys3, enhances the diversity of material properties. While introducing structural disorder is a common strategy to modify material properties, it complicates the precise structure determination, both experimentally and theoretically4,5,6. Experimental techniques for probing disordered materials face limitations. X-ray diffraction (XRD) reveals long-range order but lacks local atomistic details7. Solid-state Nuclear Magnetic Resonance (NMR) resolves atomic positions and interactions but is limited to certain atoms with NMR activity8. Mössbauer spectroscopy (MES) offers detailed insights into atomic positions and coordination but is restricted to elements like Fe and Ni and suffers from temperature sensitivity9,10. These constraints hinder direct correlations between properties and atomic structures.

Theoretical approaches also face challenges in the discovery of disordered materials. First, chemical doping results in numerous nearly degenerate configurations existing in the thermodynamic limit. Identifying these low-energy configurations is essential for the high-throughput screening workflow. Second, chemical doping often causes structural changes that complicate property analysis and prediction due to reduced symmetry and more complex electronic structure calculations. Numerous computational methods have been developed for disordered materials. Early methods like the Virtual Crystal Approximation (VCA)11 and Coherent Potential Approximation (CPA)12 treated disordered structures as ideal crystals, overlooking doping induced lattice distortion and local atomic changes. The Special Quasirandom Structure (SQS) method13, introduced in 1990, improved accuracy by simulating short-range correlations but struggled with efficiency for complex systems. Structure enumeration programs, like Site Occupancy Disorder (SOD)14, enumlib15, Supercell16, and disorder17, generate all possible non-redundant structures but are computationally intensive with first-principles methods. The classic Cluster Expansion (CE) method is a well-established technique for efficiently calculating energies of various configurations in disordered materials18,19,20. More recently, the CE framework has been extended to incorporate force constants, enabling the prediction of thermodynamic properties, phonon density of states (DOS), and phase diagrams21,22,23. While, as highlighted by Nguyen et al.24, CE is effective for predicting ground-state energies when atomic relaxations are minimal, its predictive accuracy can be significantly diminished in systems exhibiting substantial atomic relaxations arising from chemical disorder. Ricardo et al.25 highlighted a strong correlation between electrostatic energy (E_elec) and ground state energy (E_opt), though E_elec is computation-free and widely used in high throughput screening, it may fail in non-ionic systems26,27,28. Fu et al.29 found that the single-point energy (E_sp) can reproduce E_opt results for systems with mild lattice relaxation. To develop an effective screening workflow, incorporating structural relaxation is essential. The main computational challenge is the costly structural relaxation of each initial structure in a large-scale dataset using density functional theory (DFT) calculations.

In recent years, many Machine Learning Potentials (MLPs)30,31,32,33 have been developed to predict the ground state structure from the initial configuration, known as the initial structure to relaxed structure (IS2RS) task34. These MLPs can significantly reduce computational costs by bypassing the need for DFT calculations to facilitate materials discovery. However, for disordered inorganic structures with varying compositions, MLPs face substantial challenges due to combinatorial explosion caused by doping. Ideally, they require separate training for each compositional space, which imposes prohibitively high data requirements. Active learning-based methods have been developed to alleviate this issue32,33,35,36,37, but the fundamental need to approximate the PES across diverse compositions remains a bottleneck. Furthermore, MLP-based relaxation is inherently indirect, as it approximates the PES before performing energy minimization. This process introduces error accumulation and potential deviations from true relaxed structures, particularly when generalizing to novel systems38. To address these limitations, recent efforts have explored end-to-end methods38,39,40,41,42. These approaches bypass MLP training entirely and directly predict relaxed structures from initial configurations, offering a promising solution for structural prediction in complex disordered systems. Yang et al.41,42 proposed DeepRelax, an iteration-free deep generative model using a periodicity-aware equivariant graph neural network (PaEGNN) with uncertainty quantification, enabling efficient and robust structural relaxation across diverse systems for scalable material discovery. Leveraging the power of graph convolutional neural networks (GCN), Kim and Zuo et al.39,40 applied the pix2pix domain translation model and BOWSR algorithm based on structure features to relax geometry without DFT. While achieving remarkable performance, they still require extensive DFT data for training and are typically material-specific43. Most notably, Yoon et al.38 used graph neural network (GNN) to develop DOGSS method, a machine-learned harmonic force field that approximates the ground state structure and properties of inorganic multicomponent surfaces. This approach eliminates the need for expensive electronic data such as energies, forces, and/or stresses, but it still requires improvement to address its dependence on extensive input structures.

To the best of our knowledge, the current mainstream end-to-end approaches are generally based on deep learning frameworks. While these methods are highly effective once established, their development is often requiring huge computing resources due to their data-hungry and black-box nature. Consequently, their applications in novel complex systems outside established databases like the Materials Project, is relatively rare for the out-of-distribution learning issue (i.e., a distribution different from the training distribution)44.

Inspired by the DOGSS, we propose a simple, chemistry-driven approach called the Structure Beautification Algorithm (SBA), which could predict ground state structure from initial configuration using surrogate harmonic potential directly constructed from a small dataset with chemistry-driven parameterization. As tested in rigid systems such as FeCo2SixAl1-x and ZnxCd1-xS, as well as the flexible iron carbide system, SBA has proven to be data-efficient, interpretable, and intuitive for approximating ground state structures. Building on these features, we establish an efficient high-throughput screening workflow for chemically disordered materials with extensive configurational space, accelerating material discovery.

Results and disscussion

Rigid structure systems

Heusler alloys are attractive for electromagnetic applications due to their unique properties, which are significantly influenced by atomic disorder45,46,47,48,49,50,51,52. Researches have concentrated on studying stable configurations and their properties in FeCo2SixAl1−x3,53,54,55,56,57,58,59. However, chemical disorder complicates synthesis and characterization. Similarly, in ZnxCd1-xS compounds, varying Zn content allows precise control over bandgap and optical properties, though it introduces computational challenges due to disordered structures60,61,62,63,64,65,66. Chemical disorder complicates experimental characterization. While theoretical studies focus on the most stable structures67, metastable ones with favorable properties may also exist68. Energy alone doesn’t determine the most relevant structure, statistical averaging with Boltzmann weights69, as shown in Supplementary Fig. S1, offers a more accurate analysis. A comprehensive understanding of disordered materials requires considering all low-energy configurations.

E_elec and E_sp are common descriptors for structure screening, and E_SBA_sp represents the single-point energy after SBA relaxation. We compare the ground state energy (E_opt) of each structure with E_elec and E_sp (Fig. 1). Pearson coefficients70 were computed based on Eq. (1) in the Supporting Information for E_elec/E_opt, E_sp/E_opt, and E_SBA_sp/E_opt. A coefficient of 1 indicates a perfect correlation, while a coefficient of -1 signifies a perfect anti-correlation. For FeCo2Si0.5Al0.5, the correlation coefficients are 82.19% for E_elec/E_opt and 90.37% for E_sp/E_opt, while E_SBA_sp/E_opt reaches 99.36% (Fig. 1a). In the Zn0.15Cd0.85S system, E_elec fails due to the ionic nature and E_sp shows a lower correlation of 54.19% with E_opt. In contrast, E_SBA_sp/E_opt is improved to 91.92% (Fig. 1b). Higher energy correlation of E_SBA_sp/E_opt enhances the search for more accurate low-energy structures, and improves the E_sp approach for structure screening.

Fig. 1: Energy landscapes of different systems under different approaches.
figure 1

a Comparative energy analysis of different methods in FeCo2Si0.5Al0.5 system. The left y-axis represents energy range of ground state energy (E_opt), single-point energy (E_sp) and the single-point energy after SBA relaxation (E_SBA_sp), while the right y-axis represents the electrostatic energy (E_elec). E_opt (gray circles), E_sp (red circles), E_SBA_sp (green circles) and E_elec (bule circles). b Comparison of energy profiles across various approaches for Zn0.15Cd0.85S. The left y-axis represents energy range of E_opt (gray circles) and E_SBA_sp (green circles), while the right y-axis represents the E_sp range (red circles).

In order to investigate the efficiency of the screening process, we also compare the cost-effectiveness of different methods. Here, we use Receiver Operating Characteristic Curve (ROC) to evaluate these methods in predicting energetically favorable configurations. ROC offers a comprehensive perspective on the performance of different classifier and exhibits robustness when changing the distribution of test samples data. To construct ROC, we define a true positive rate (TPR) as a stable structure predicted as stable, a false positive rate (FPR) as an unstable structure predicted as stable. We define 30 low-energy structures as target from E_opt rankings in the FeCo2Si0.5Al0.5 system and evaluate the performance of various methods (Fig. 2a). The ROC curve’s proximity to the top left corner indicates more effective classification of true low-energy structures. The area under the ROC curve (AUC) quantifies this, with values approaching 1 signifying superior performance. With an AUC of 0.99, single-point energy calculation after SBA (SBA+sp) is highly desirable to classify all stable structures for FeCo2Si0.5Al0.5 system. In contrast, single-point (sp) and electrostatic energy (elec) methods, with AUCs of 0.81 and 0.87 respectively, show less satisfactory performance. Moreover, the computational cost analysis (Fig. 2b) supports this, showing that the SBA+sp method dramatically cut total costs to 31.94 and waste to 0.94 h, compared to 80 and 50 h for elec and sp, respectively. This underscores its superior performance in reducing computational overhead. Trace back to the root cause, we find that SBA significantly reduces the per atom force from 0.64 eV/Å to 0.25 eV/Å, yielding more accurate geometries from the initial configurations. This improvement leads to higher accuracy and efficiency.

Fig. 2: Performance and cost analysis of low-energy structure prediction.
figure 2

a Receiver operating characteristic curve for low energy configurations prediction. It displays the ability of electrostatic energy (elec), single-point energy (sp), and SBA calculations to classify stable structures when applied to initial structures. The x-axis is the fraction of unstable structures classified as stable. The y-axis is the fraction of stable structures classified as stable. The area under curve (AUC) is also shown for all methods. b Sampling cost for screening low-energy structures in different methods. Total-cost (gray-blue bar) is the overall cost to screen target low-energy structures, while Waste-cost (gray-orange bar) represents the extra computational expense incurred due to inaccuracies in different methods.

Benefiting from the exceptional structure prediction ability of the SBA, we establish a generalized workflow for screening low-energy structures in disordered materials (Fig. 3), different systems undergo distinct processes. For rigid systems, we initially apply SBA to relax initial structures, followed by a straightforward sp calculation and bypassing time-cost DFT procedure, allowing for the efficient identification of low-energy structures. Since SBA advances the value of sp calculation and this workflow may have the potential to replace high-throughput screening in such systems.

Fig. 3: High-Throughput screening of low-energy configurations in chemically disordered materials.
figure 3

This flowchart outlines the key steps in identifying low-energy configurations, including structure generation, energy evaluation and stability classification.

To validate the rationality of relaxed structure by SBA, we compare the electronic properties of FeCo2Si0.5Al0.5 for initial, ground state, and SBA-relaxed configurations. Figure 4a shows all three structures exhibit a consistent spin-down band gap of 0.70 eV near the Fermi level, indicating half-metallic behavior. Pearson coefficients of 94.8% and 99.9% for initial and SBA-relaxed structures compared to the ground state demonstrate accuracy structure prediction ability of SBA. For 10 randomly selected configurations, SBA relaxation reduces discrepancies in the averaged band gap and magnetic moment, aligning properties closer to the ground state (Fig. 4b). Likewise, for Zn0.15Cd0.85S, the average band gap deviation is just 1.5% (Fig. 4c), and the DOS similarity between SBA-relaxed and ground state structures is 96.54% (Fig. 4d). This remarkable accuracy allows SBA to not only replace high-throughput calculations for screening low-energy structures but also to predict electronic properties directly, achieving over 95% savings in property evaluation costs. CE methods predict thermodynamic properties using a limited set of first-principles calculations but fail to explicitly account for lattice relaxation effects21,22,23. In contrast, SBA incorporates relaxation into harmonic potential construction, generating optimized structural analogs to mitigate lattice relaxation impact. SBA is particularly suited for systems requiring explicit relaxation treatment, while CE remains advantageous for large-scale configurational sampling due to its lower computational overhead. To more efficiently and comprehensively demonstrate the robustness and universality of the SBA, we use the Spanish Initiative for Electronic Simulations with Thousands of Atoms code (SIESTA) program71 which offers lower computational cost, to perform calculations on the Ni1-xMox (x = 0.4) alloy system, supplementary Fig. S2S3 further highlight the excellent performance of SBA.

Fig. 4: Electronic structure and magnetic properties of FeCo2Si0.5Al0.5 and Zn0.15Cd0.85S systems.
figure 4

a Comparison of the DOS for the initial structure (light-gray shading), SBA-relaxed structure (light-bule shading) and ground state structure (light-purple shading) for FeCo2Si0.5Al0.5 system. b The band gap and magnetic moment difference for the initial structures and SBA-relaxed structures for FeCo2Si0.5Al0.5 system. The left y-axis stands band gap (bule bar) and right y-axis is magnetic moment difference (orange bar). c Comparison of the DOS for the initial structure (light-gray shading), SBA-relaxed structure (light bule shading) and ground state structure (light- purple shading) in the Zn0.15Cd0.85S system. d Comparison of the average band gap for the initial structure (light-purple bar), SBA-relaxed structure (gray-blue bar), and ground state structure (gray-orange bar) in the Zn0.15Cd0.85S system.

Flexible structure with floppy potential energy surface

Iron carbide intermetallic compounds play a crucial role in catalysis for the carbon nanotubes (CNTs)72, pollutant removal73, electro-catalysis74, electronic water splitting and Fischer-Tropsch synthesis. It is well known that the catalytic activities are mainly determined by their structures75,76, which strongly affects the number of active sites and the electronic structure of heterogeneous catalysts. Discerning the active structure is significant to establish the intrinsic structure-activity relationship, and it is the basis for high-performance catalysts design. However, iron and carbon could form variable disordered structures during the reaction conditions, which brings a challenge for experimental characterization. Therefore, numerous studies have been carried out to improve the catalytic performance by designing well-defined structures. In theoretical studies, Gao et al.77. used molecular dynamics to analyze carburization in Body-Centered Cubic (BCC) iron and found lattice changes to Face-Centered Cubic (FCC) and Hexagonal Close-Packed (HCP) in late stage. Yuan et al.78 systematically explored iron carbide phases through the global structure search approach coupled with DFT methods and enriched the understanding of local structure and properties of iron carbides.

In general, the iron carbide catalysts are usually synthesized through iron carburization by carbon-source feedstock. Accordingly, in this work, the FCC iron lattice serves as the host, and the effect of the tiny content-permeated carbon atoms located at the octahedral site79 on the iron structure is investigated. Enumeration and Boltzmann distribution weight (Fig. 5a) are employed to derive reasonable low-energy structures, with Pearson coefficient evaluating the correlation between E_elec/E_opt, E_sp/E_opt. Correlations of 46.19% and 50.3% (Fig. 5b) indicate that both E_sp and E_elec are inadequate for reliably screening stable candidates in this flexible system. Although the SBA method is applied to relax the initial structures, the correlation of E_SBA_sp/E_opt only improves slightly to 54.2%. This poor correlation is attributed to the mismatch in Fe/C atomic sizes in the C-doped system. However, the averaged per atom force within the structure decreases from 2.77 eV/Å to 1.34 eV/Å, which indicates that the structures become more reasonable in geometry after SBA relaxation. Further speaking, when performing full relaxation based on SBA-relaxed geometries, different initial structures exhibit varying degrees of computational resource savings, with maximum savings reaching up to 60% (Fig. 6a). The ionic steps and time required to reach a convergent state are both reduced by 33% (Fig. 6b). These results demonstrate the impact of SBA in accelerating calculations for flexible systems. Supplementary Figs. S4S5 and Table S1, derived from SIESTA calculations for the Al2.5CoNi4.5 system, provide additional evidence supporting the performance of SBA.

Fig. 5: Energy distribution and computational approaches in the Fe32C4 system.
figure 5

a Boltzmann proportion of low-energy configurations under specific temperature condition. b Energy landscape of different computational approaches in the Fe32C4 System. The energy landscape for various approaches, including ground state energy (E_opt), single-point energy (E_sp), single-point energy after SBA relaxation (E_SBA_sp) and electrostatic energy (E_elec). The left y-axis represents the energy range of E_opt, E_sp, and E_SBA_sp, while the right y-axis shows the range of E_elec. E_opt (gray circles), E_sp (red circles), E_elec (blue circles), and E_SBA_sp (green circles).

Fig. 6: Comparing time efficiency and convergence rate in structural optimization with DFT structure optimization.
figure 6

a Time cost improvement of each configuration after SBA relaxation. Time cost improvement refers to the savings in computational resources compared to DFT structure optimization. b The comparison of structural optimization convergence rates. The left y-axis stands for ionic steps and right y-axis is time cost of DFT relaxation (gray-purple bar) and SBA-relaxed relaxation (gray-orange bar).

A cost-effective method, i.e., E_elec approach, is valuable to reduce the structural space when dealing with flexible system with vast configuration space. Although the accuracy of E_elec method is limited, it is highly effective for rapid structure screening and has received positive feedback in the literatures26,27,28. Alternatively, there is an advanced sampling space reduction algorithms to speed up the structural screening process, i.e., LAsou technique, developed by our group. It is based on machine learning potentials but proved its efficiency in many material fields36,80. By combining LAsou with SBA, the process of structural screening would become more simplified and manageable in the future.

In summary, we introduce an algorithm that leverages the locality of chemical environments and develop a parameters-free harmonic potential for refining the unreasonable initial structures. This algorithm represents a significant advancement, moving beyond the prevailing reliance on data-driven MLPs for predicting stable structures. For rigid systems, SBA attains over 90% agreement in energy between SBA-relaxed and ground state structures, effectively identifying low-energy configurations and reducing property calculation costs by more than 95%. Even for the challenging flexible systems, it achieves a 30% reduction in computational cost, comparable performance to the DOGSS model. Since the SBA method does not depend on large training dataset, it exhibits notable transferability and practicality. By employing its advanced relaxation capability, we establish a comprehensive workflow for the systematic identification of low-energy structures in disordered materials. This methodology aims to improve the efficiency of the screening process and accelerate the discovery of novel materials.

Methods

Structure beautification algorithm

As illustrated in Fig. 7, algorithms for predicting ground state structures from input initial structures can be roughly classified into three categories. MLPs are typically constructed using computationally expensive DFT data to fit the global potential energy surface (PES), enabling predictions of ground-state configurations from structural inputs (Fig. 7a). While active learning strategies33,34,36,37,80,81,82 improve data efficiency, unresolved challenges remain. In complex systems like disordered doping, the vast configuration space severely limits active learning efficacy for limited transferability across different compositional spaces. To address this practical issue, Yoon et al.38 introduced an innovative approach to accelerate structure relaxation of inorganic multicomponent surface. This approach avoids the need for expensive datasets containing energies and force, only relying on initial and relaxed structure pairs. Rather than directly constructing a global PES, DOGSS first predicts the parameters for a local PES of a given input structure using a GNN model, represented by classic model potential, such as harmonic or L-J potential. The predicted local PES is then minimized to yield relaxed structure (Fig. 7b). This model is not evaluated on how well it fits energies and forces, but on how well the predicted structure matches the ground state structure. It greatly simplifies data preparation, achieving better transferability, wider applicability and practical utility38. Nevertheless, DOGSS is constructed on a deep learning framework within GNN, it necessitates a substantially large training dataset (GASPy) for initialization.

Fig. 7: Different algorithms and workflows for predicting ground structures.
figure 7

a The overall workflow to utilize MLPs for ground structure. b Workflow of the DOGSS Algorithm38. c Proposed Workflow in this work.

Follow the direction innovated by Yoon38, we are exploring the possibility of simplifying dataset preparation by demystifying the black box of machine learning part with chemical insights. As illustrated in Fig. 7c, we propose a straightforward chemistry-driven model to predict the ground state of input structure. This approach circumvents the traditional necessity for extensive, high-dimensional training datasets, offering a more efficient pathway for high-throughput screening of novel complex disordered structures. The fundamental chemical insight behind SBA is that the complex configurational space can be significantly simplified at the coarse-grained local chemical environment level using graph isomorphism. Graph theory is widely applied in areas such as data science and chemical engineering83,84. Greeley et al.85 applied a graph theory-based approach to simplify surface adsorption complexities, facilitating the identification of unique configurations and the systematic estimation of high coverage models on low-symmetry catalytic surfaces.

In our method, each atom within a disordered configuration is characterized by a subgraph representing its local chemical environment, comprising the central atom and its nearest neighbors within a defined cutoff, as illustrated in Table 1. These subgraphs, represented as undirected graphs (atoms as nodes, bonds as edges), are then subjected to redundancy filtering to identify unique topological motifs for reference. For simplicity, duplicate motifs are randomly represented by one subgraph. Similar to DOGSS, SBA constructs a local harmonic potential for a given structure and then minimized it to approximate the ground state structure. In contrast to traditional machine learning potentials requiring extensive iterative training, our harmonic potential parameters are directly derived from a very small pre-computed dataset of relaxed structures without iterative fitting. The necessary subgraphs for assembling the topological structure are identified within the reference subgraphs, enabling the direct retrieval of the pairwise parameters for optimal spring distances lij in harmonic potential. For each pair of atom i and atom j, the required spring constant parameter is set to k = 1/lij2, following the force-directed graph drawing algorithm86. Hence, the local harmonic potential is constructed to approximate the ground state spatial structure without iterative fitting.

Table 1 Overview of the input micro-environments extraction for constructing the harmonic potential in the three different systems

Besides energies and atomic forces of harmonic potential, the viral stress inside a periodic cell is derived as formulated by Thompson et al.87 to deal with variable lattice issues. The energy minimization was carried out using the FIRE optimizer88 from the Atomic Simulation Environment (ASE) package89. A stringent force convergence criterion of 0.001 eV/Å. During this process, atomic positions were relaxed until the maximum force on any atom decreases below the convergence criterion.

$$E\left(r\right)=\mathop{\sum }\limits_{i\ne j}\frac{1}{{l}_{{ij}}^{2}}{({r}_{{ij}}-{l}_{{ij}})}^{2}$$
(1)

Computational details

To evaluate the feasibility of SBA as illustrated in Fig. 7c, we demonstrated the performance of our model on both rigid and flexible systems. In the context of chemical doping, rigid systems exhibit minor deviation from initial structures, whereas flexible systems can result in significant lattice distortions. Based on these system characteristics, different strategies can be developed for high-throughput screening. Specially, we used FeCo2SixAl1-x (x = 0.5) and ZnxCd1-xS (x = 0.15) systems as cases for rigid systems, and iron-carbide Fe32Cx (x = 4) for flexible systems. All initial structures were generated using the open-source Supercell program16, allowing for the customization of supercell sizes and lattice site occupancy. 153, 325, and 71 structures were generated for FeCo2SixAl1-x (x = 0.5), ZnxCd1-xS (x = 0.15), and Fe32Cx (x = 4) systems, respectively.

We performed first-principles calculations using the plane wave code Vienna ab initio simulation package (VASP)90,91, the generalized gradient approximation of the Perdew-Burke-Ernzerhof parameterization (GGA-PBE)92 was adopted for the exchange and correlation functions. In addition, the PBE + U method93 (Ueff = 2 eV for Co and Fe)56 was employed. All atoms were allowed to relax to their equilibrium states, with an energy cutoff set of 500 eV and convergence criteria of 1 × 10–4 eV for energy and 0.02 eV/Å for forces.