Introduction

Solid-state nuclear magnetic resonance (SSNMR) is a powerful tool for probing structural differences in local environments for both crystalline and amorphous materials. As a local probe of structure, SSNMR can be a highly effective characterization tool, because long-range order (often required for diffraction methods) is not needed. Consequently, there is broad applicability of NMR across chemical, biological and materials science fields, characterizing diverse systems ranging from battery anodes, to biological solids, to zeolites1,2,3,4,5,6.

The most familiar NMR methods have focused on nuclear spin–½ (I = 1/2) systems, such as 1H and13C; however, a majority of NMR-active isotopes are quadrupolar, with I > 1/2, studies on which can yield exquisite details about the interaction between the nuclear spin’s electric quadrupole moment and its electric field gradients (EFG) produced by the surrounding electron clouds7,8,9,10,11,12,13,14. Even with the revolutionary demonstration of new NMR pulse-sequence methods for quadrupolar species in the 1990’s (resulting in highly resolved spectra)15SSNMR experiments using quadrupolar probe nuclei are still often plagued by complicated lineshapes that can overlap and become difficult to elucidate for structural features. Hence, ‘NMR crystallography’—combining NMR with other experimental techniques such as diffraction and computational methods like density functional theory (DFT) to achieve a comprehensive, data-consistent picture of a material’s structure16,17,18,19,20,21 -- have been transformative to interpretation of SSNMR of quadrupolar species.

NMR crystallography relies on state-of-the-art first-principles calculations such as DFT. Despite this advantage, the computational cost of DFT is relatively large for broad adoption. More importantly, the reliability of these calculations in predicting experimental parameters has to be assessed one isotope at a time, with the literature focusing on 1H,13C,29Si,31P and17O in various systems22,23,24,25.

Literature benchmarks have provided large datasets of computed data available via community databases, such as in the Materials Project (materialsproject.org) and the Collaborative Computational Project for NMR (CCP-NC)26,27 that can be utilized for advanced machine learning (ML) studies to reduce the computational cost. The cubic scaling28 of DFT calculation time with respect to the number of valence electrons in the system limits these datasets to focusing on comparatively small unit cells of perfect crystalline materials modeled at a temperature of 0 K. Still, appropriately trained ML algorithms have demonstrated the ability to capture local geometry to predict δiso with accuracy close to DFT while requiring only a fraction of computing time28,29,30,31,32,33,34,35,36,37. While most of the machine learning efforts have been focused on the prediction of δiso, the experimentally measured isotropic chemical shift (or σiso, the DFT computed chemical shielding)38,39,40,41, there have been fewer studies of quadrupolar nuclei that demonstrate the ability of machine learning algorithms in predicting expressions of the electric field gradient (EFG) tensor parameters, such as CQ42,43,44. NMR of quadrupolar nuclei often results in complex lineshapes that must be deconvoluted in order to extract chemical insights from the connection between the structure and spectroscopic lineshape. The lineshape itself is representative of the electron density distribution surrounding nuclei. The quadrupolar tensor elements provide a complementary measurement of small perturbations to local environments, especially when it is hard to distinguish different sites based on isotropic chemical shift alone45,46. Thus, the development of a machine learning method for the prediction of EFG tensor elements, and expressions of those elements such as CQ, can be highly informative for NMR crystallography studies.

Herein, we present a solid-state NMR 27Al benchmarking set with both DFT calculated EFG and magnetic shielding tensor elements and their experimentally-measured counterparts, reported in the literature. It is worth noting that the most common way for the magnetic shielding tensor elements to be reported is using expressions that employ the “Haeberlen” convention. We follow that convention here, for ready comparison between computed and experimentally-measured quantities, reporting both computed values and their experimental complements for isotropic chemical shielding (σiso) in the Supplementary Information Table S1 and Table S2. The full definition of NMR parameters can be found in the Supplementary Information Section V. The DFT calculations were performed by two popular DFT packages: Vienna Ab initio Simulation Package (VASP) and Cambridge Serial Total Energy Package (CASTEP)47,48. The reliability of DFT predictions of values for σiso, CQ, and tensor elements (Vab) for the EFG tensor V in the principal axis system, for 27Al materials was confirmed. We further trained a “random-forest” machine learning model to predict the quadrupolar coupling constant CQ as a widely-used experimental parameter for compounds containing 4-, 5- and 6-coordinate 27Al sites based on a larger DFT calculated dataset with 1681 aluminum-containing crystalline solid materials. To train the model, we constructed two sets of features, structural features and elemental features (or sometimes termed “alchemical features” in the language of machine learning literature)49based on the crystal structure to represent the 27Al local environment. We have found the 27Al CQ value is closely correlated with the geometric properties of the next-neighbor bonding environment (surprisingly, regardless of the chemical identity of the bonded species). The next-neighbor bonding environment is typically depicted for ease of visualization as a space-filling polyhedron. Distortions to the polyhedron given by variance of bond lengths and bond angles, in combination with other features denoting elemental variance, produce a simple but effective model.

Results and discussion

27Al DFT benchmarking

We begin by benchmarking the ability of DFT to predict chemical shielding tensors against experimentally compiled chemical shift tensors. Unfortunately, in cases where the central transition pattern is strongly influenced by the second-order quadrupolar interaction, it can be challenging to extract information on the chemical shielding tensor50,51. In particular, many references do not report the anisotropy of the chemical shift and the asymmetry parameter from the shielding tensor, because these are difficult to know with precision, when quadrupolar interactions are present52leaving only the isotropic chemical shift to compare with our computational dataset.

Figure 1 shows the correlation plot between the experimental isotropic chemical shift (δiso) and DFT calculated isotropic shielding (σiso) with two different packages (VASP and CASTEP). Both DFT packages demonstrate the ability to accurately predict 27Al isotropic chemical shifts with R2 = 0.98 and RMSE = 4.0 ppm and 4.4 ppm values, respectively. Due to VASP’s unique definition of the shielding tensor53we have inverted the sign of its output to ensure consistency in the interpretation of the plots. Further details on this issue can be found in our previous work24. Further, Fig. 1 (c) demonstrates a strong correlation between the two packages with R2 = 0.99 and RMSE = 3.0 ppm, suggesting that future calculations with either of these codes should yield comparable results. As expected for isotropic chemical shift (shielding), there is grouping or “clustering” of the data points, shown in the correlation plots of Fig. 1 based on the local coordination numbers (4, 5 and 6). To better understand the performance of DFT within each individual coordinate environment, we also plot the correlation between DFT and experiment separately, for reference. Those data are found in the Supplementary Information Section IX. Outliers can be identified by plotting the standardized residual values against each independent variable (Supplementary Information Figure S1) and identifying those that fall outside of a given confidence interval, which for this study was set at 99%. A discussion about the possible origin for such outliers can be found in Supplementary Information Section II.

Fig. 1
figure 1

(a) VASP calculated σiso (ppm) versus experimental δiso (ppm). (b) CASTEP calculated σiso (ppm) versus experimental δiso (ppm). (c) CASTEP calculated σiso (ppm) versus VASP calculated σiso (ppm). In plot (a) and (b), the outlier species are highlighted in red. Outliers identified by a standardized residual plot with a 99% confidence range are shown in red. (see Supplementary Information, Figure S1).

We compared the computed diagonalized EFG tensor components against experimentally reported values for CQ and ηQ. The diagonalized EFG tensor for quadrupolar nuclei also can be translated into these convenient algebraic expressions, CQ and ηQ, that reflect the appearance of the experimentally-measured spectra. For quadrupolar species such as 27Al, both CQ and ηQ values from the EFG tensor are often reported in the literature, which enables a more sensible and direct comparison between experimental acquired and DFT computed results. (We report in the Supplementary Information, Section X, the principal components of the 2nd -rank symmetrical EFG tensor, V (i.e., Vxx, Vyy, and Vzz), and comparison to each of the DFT-computed components). Figure 2 shows the high degree of correlation between experimentally measured |CQ| and the corresponding values calculated by DFT packages (VASP and CASTEP). It is not possible to measure the sign of CQ using 1D NMR spectrum and a special double resonance experiment must be used in this case54,55thus nearly all experimental papers choose to report the magnitude of CQ. The strong correlation between DFT and experiment for both VASP and CASTEP, with R2 = 0.96 for VASP and R2 = 0.95 for CASTEP, demonstrates that DFT has the ability to accurately predict the EFG tensor. We note two outliers using the same confidence interval sampling method used previously. Two significantly different CQ values of β-AlF3 were reported by previous publications with one stating the CQ of the single 27Al site in β-AlF3 is |3.4 MHz| while the other one stating a CQ of |0.8 MHz|.56,57 Our calculation result (-1.31 MHz) suggests that 0.8 MHz lies closer to the computed value, and this result is supported by a more recent publication in 201458. The second outlier is the previously noted (CaO)4(Al2O3)3 with experimentally-reported CQ = |2.4 MHz| and VASP calculated CQ = 4.41 MHz. It is still unclear if our idealized structural model is an accurate representation of the local structural motifs in the measured sample of (CaO)4(Al2O3)3 resulting in an inappropriate comparison of NMR parameters. Figure 2 (c) shows a strong correlation between the two DFT packages, with R2 = 0.99, for CQ, suggesting that future calculations with either of these codes should yield comparable results.

Fig. 2
figure 2

(a) VASP calculated |CQ| (MHz) versus experimental |CQ| (MHz). (b) CASTEP calculated |CQ| (MHz) versus experimental |CQ| (MHz). (c) CASTEP calculated CQ (MHz) versus VASP calculated CQ (MHz). In plots (a) and (b), the outlier species are highlighted in red.

The correlation between experimentally-reported ηQ and DFT-computed values for ηQ is shown in Fig. 3. Both CASTEP and VASP show a strong correlation, R2 = 0.95, which suggests that these codes remain self-consistent with respect to the full expression of the EFG tensor. Any correlation between computed and experimental values is tenuous, at best, with many outliers. It may not be surprising that the correlation is weak, since ηQ values, at present, offer limited utility for benchmarking EFG tensors considering the nature of ηQ’s mathematical definition which makes it numerically unstable and prone to small perturbations59,60. Some experimentalists resort to assuming an ηQ value based on knowledge of the crystal structures, usually at one of the two extremes, as 0 or 161. Consequently, we are demonstrating with these data that the experimentally-reported asymmetry parameter may not be sufficiently robust for benchmarking comparisons. Nevertheless, the high degree of correlation of the individual tensor elements (Vxx, Vyy, Vzz), shown in the Supplementary Information, Section X, shows promise, even when the expression of these elements, ηQ, reflecting the measured lineshape is less robust.

Fig. 3
figure 3

Correlation of experimentally reported values for ηQ with DFT calculated ηQ. (a) VASP calculated versus experimentally-reported values; (b) CASTEP-calculated versus experimentally-reported values; and (c) correlation of CASTEP versus VASP values for ηQ.

Fast prediction of 27Al CQ with machine learning

As shown above, CQ is specified well (predicted well) by DFT and therefore could be a good target for machine learning. In the next section we will be focusing on constructing a machine learning model to predict the CQ value measured using 27Al SSNMR experimental data based on crystal structures, in order to enable fast computation of this informative experimental NMR parameter. We chose CQ here for our model training target because it is a frequently reported parameter to represent aspects of the EFG, and its values appear to be robust (in contrast with, for example, ηQ.) Hence, this parameter represents an effective way to compare experiments with calculated values, and therefore an indicator of accurate machine learning predictions.

By leveraging the strengths of both DFT calculations and machine learning algorithms, we aim to develop a powerful predictive tool that bridges the gap between experimental observations and theoretical predictions in solid-state NMR spectroscopy of quadrupolar nuclei.

DFT calculated 27Al database

To predict CQ values for 27Al with machine learning, we constructed a VASP-calculated database with 1681 aluminum-containing solid crystalline materials utilizing the high-throughput DFT framework of the Materials Project27. The sites in the database can be classified as belonging to three types of local coordination environments: 4-coordinate tetrahedral (termed “T:4”), 5-coordinate trigonal bipyramidal (“T:5”) and 6-coordinate octahedral (“O:6”). Unusual geometries such as 2-coordinate linear or bent geometries, 3-coordinate trigonal planar, or 4-coordinate square planar were excluded.

Feature engineering

One of the most critical aspects of a successful machine learning model lies in “feature engineering.” In terms of materials science, features are usually properties related to the materials or values that can be derived or calculated based on materials’ structural or chemical information62,63. In terms of these chemical entities, our effort is to select features that provide a means for recognizing patterns in the data, and to correlate an NMR measurement with one or more specific chemical (or structural) properties. When successfully identified, features, either singly or combined, can form a numerical representation of the material, usually expressed in the form of a 1D vector. For machine learning prediction, these numerical representations need to capture the variance of the target parameter across different materials to be successful. The process of feature engineering can be as simple as collecting the atomic numbers (i.e., the chemical identity of an atom participating in a bond), while for many data sets, more complex constructed features are needed64.

There has been considerable research on feature engineering for materials science to predict NMR parameters such as the use of smooth overlap of atomic positions (SOAP) descriptors49,65Coulomb matrix34 and Behler − Parrinello symmetrical functions (BFPS)66. While these features are capable of describing the variance of geometries for structures with different NMR parameters such as isotropic chemical shifts/shieldings, they were designed to be general in order to be useful in many different types of applications (beyond NMR)67. For specific targets which aim to extract highly localized perturbations (as in NMR spectroscopy), these features may yield suboptimal results. For example, the size of a SOAP kernel scales quadratically with the number of elemental species considered, which makes it slow to process when applied to datasets with a variety of elements. Instead of using complex descriptors like SOAP, we can employ customized, NMR-specific features to streamline and optimize our feature set, for instance, based on the local environment of the target nucleus (e.g 27Al) under study. This approach not only enhances the model’s performance but also improves computational efficiency.

Here we propose two types of customized features to predict the CQ of the EFG tensor: structural features, and elemental features. Structural features are extracted from the local geometry of the target nucleus alone, without taking into consideration any difference between neighboring atomic species. Significant research in solid-state 27Al NMR of aluminum-containing materials has focused on the empirical correlation between NMR measurable parameters such as CQ and simple descriptive parameters derived from the local geometry68,69,70,71,72. It appears that many of these empirical correlations are particularly useful for the recent efforts of building computational predictive models for NMR spectroscopy. For example, Ghose and Tsang68 defined the longitudinal strain and the shear strain to quantify the distortion of the local polyhedron from the Platonic solid-like forms (i.e., with identical faces of the geometric solid). Later Baur et al. 69 suggested a distortion index (DI) to measure the angular distortion of the local geometry. These parameters were shown to have a high level of correlation with CQ value.

It is important to highlight that this is not the only approach to such EFG predictions. Autschbach and coworkers developed a method employing atomic orbitals (AOs) and localized molecular orbitals (MOs) for multiple systems such as13C, 33S, 14N, 27Al, 93Nb and 99Ru. In addition to empirical correlations, Autschbach et al.13 analyzed the AO contributions to the EFG through a semi-quantitative exploration using an AO contribution model and quantitatively with first-principles computations accompanying analyses of the EFG tensor in terms of localized MOs. Determining ways to capture features via molecular orbitals would be an interesting comparison to the method we are employing here, but ultimately is entirely separate and beyond the scope of what we are presenting here.

Using the DI parameter introduced by Baur as motivation, we implemented this DI parameter in Python along with eight other features derived from local polyhedral geometry: namely the maximum, minimum, standard deviation and mean of the first-order bond lengths (fbl) and bond angles (fba). A full list of structural features and their corresponding abbreviations can be found in Supplementary Information Table S6. Figure 4 shows a correlation “heat map” between the DFT-calculated NMR parameter CQ and these structural features. (Details on “feature importance” can be found in the Supplementary Information, Section VII). The standard deviation of the first-order bond length “std(fbl)” has a high level of correlation with CQ, which illustrates the power of such simple features when used for the right target. The distortion index (DI) has the second-largest correlation with CQ. The std(fbl) and DI characterize the distortion of the local polyhedron from its ideal (Platonic) form (e.g. a perfect octahedron or tetrahedron) in terms of bond length and bond angle, respectively. We found these two features are complementary to each other in the prediction of CQ. More details about feature complementarity can be found in the Supplementary Information, Section IV. The correlation matrix also reveals strong interrelationships among the structural features themselves, suggesting a potential redundancy in the information they convey—commonly referred to as multicollinearity in machine learning. While multicollinearity may detrimentally affect the performance of linear models, it typically does not exert a significant influence on the performance of tree-based models, such as random forests.

Fig. 4
figure 4

Correlation heat map across CQ and structural features. “fbl” and “fba” here are abbreviations of the first-order bond length and the first-order bond angle. The entries are arranged to match the color gradient of the CQ row/column. The number in each block is Pearson’s correlation coefficient (PCC).

Using just the structural features, we trained a random forest model for 27Al CQ which derives the target value by performing data segmentation with an ensemble of decision trees73. Figure 5 shows the correlation between the calculated DFT 27Al CQ and the model-predicted CQ. The plot shows that the set of simple structural features can already predict CQ with a R2 of 0.95 and RMSE of 0.77 MHz. We do note that there are still a number of outliers (better depicted in Supplementary Information Section II) suggesting characteristics other than structural features can play a significant role in dictating NMR properties. Also, since the majority of the 27Al sites in the dataset are only coordinated with oxygen, to further test the predictive performance of the model based on the structural features with more atomic variance, we rebalanced the data using the SMOTE (Synthetic Minority Oversampling) technique74 to oversample the minority group (sites with non-oxygen neighbors) and undersample the majority group (sites with pure oxygen neighbors), then compare the model before and after the rebalance. More details on implementation of SMOTE can be found in the Supplementary Information Section VIII.

Fig. 5
figure 5

Comparison between random forest-predicted 27Al CQ with DFT-calculated 27Al CQ for aluminum-containing compounds. The random forest model was trained with structural features only (i.e., not with elemental properties). The size of the test set is 1171 individual 27Al sites.

Expanding this analysis further, it is expected that any EFG tensor is not only related to the geometry of the local environment but is also strongly influenced by the properties of surrounding atomic species because it is derived from the electron density distribution. To further improve the prediction of CQ, we therefore need to represent the variation in local chemical composition. We selected twelve elemental properties such as atomic number, electron affinity, and other properties (Supplementary Information Table S7) and utilized three treatments of those elemental features, grouping them into 3 features sets, shown schematically in Fig. 6 (e.g., “Simple statistics…”, “Distance normalized deviation…” and “Pairwise atomic properties…”).

Fig. 6
figure 6

Illustration of the feature engineering process for element-specific features. A list of atomic properties for each atom within the first coordination shell was collected and then transferred into 3 sets of features: simple statistics of atomic properties, distance-normalized deviation of those atomic properties, and pairwise atomic properties matrices. * pc and pn are the atomic properties of the central atom, c, and coordinate atom, n; N is the coordination number; rcn is the corresponding bond length. pn are the atomic properties of the atoms within the first coordination shell; rmn are the inter-atomic distances between atom m and atom n.

We first obtain the twelve elemental properties for each atom in the first coordination shell around the 27Al sites. The first set of features is represented by simple statistics of each of the elemental properties: its maximum, its minimum, standard deviation, and average value. The second set of features measures the differences between the neighbor atoms and the core atom (aluminum in our case).

$$\sum_n{\frac{\left|{\varvec{p}}_{\varvec{c}}-{\varvec{p}}_{\varvec{n}}\right|}{\varvec{N}\cdot\:{\varvec{r}}_{\varvec{c}\varvec{n}}}}$$
(1)

Here pc and pn are the atomic properties of the central atom (c) and coordinate atoms (n); N is the coordination number; rcn is the corresponding bond length.

For the third set of features, we draw inspiration from the classic Coulomb matrix. For each of the twelve elemental properties, a matrix considering all the atoms within the first neighbor shell was generated.

$$\:{\text{M}}_{\text{i}\text{j}}=\left\{\begin{array}{c}1,\:\:\:\:\:\:\:\:i=j\\\:\frac{{\text{p}}_{\text{i}}{\text{p}}_{\text{j}}}{{{\text{r}}_{\text{i}\text{j}}}^{2}},\:\:i\ne\:j\end{array}\right.$$
(2)

Like a Coulomb matrix, this feature also considers the pairwise comparison of the selected properties between two atoms in the lattice. One challenge is that when the number of atoms considered is different, the size of the resultant matrix will also be different. In our specific case, the size of the matrix for 4-, 5- and 6-coordinated Al sites will be different. This is troublesome for machine learning predictions because most algorithms require the dimensionality of the feature space to be uniform across all the samples. To solve the problem, we decompose the matrix with singular value decomposition (SVD) and use 5 singular values, the maximum number of possible singular values for our system, as our features instead of the whole matrix.

We retrained the random forest model with both structural features and elemental features (structural + elemental) which improves the model accuracy to R2 = 0.98 and RMSE = 0.61 MHz for CQ. To further assess the performance, we also compared our models with a benchmark using the SOAP features49,65. The SOAP model was also trained with a random forest algorithm based on the same set of data as the other two shown. The only difference is the features used for training. Instead of using our structural and chemical features, we use SOAP features generated by an open-source package Dscribe63which results in a big feature set of 4,163,280. Figure 7 shows a performance comparison between models based on our proposed features (structural, structural + elemental) and that based on SOAP. As shown in Fig. 7a, both of our proposed features perform significantly better than SOAP, irrespective of the size of the sample. Structural + elemental features perform better than structural features alone when the sample size gets larger, which gives confidence for using this combination of features for very large datasets. Figure 7b and d show the correlation plots between the VASP-calculated and the machine learning-predicted |CQ| values based on the three models. The structural + elemental model significantly reduces the number of extreme outliers, evident in both Fig. 7c and d. The SOAP model achieves a usable performance benchmark of R2 = 0.92 and RMSE = 0.97 MHz.

Fig. 7
figure 7

Comparison of the random forest models trained based on three different feature sets (structural + elemental, structural and SOAP features). (a) The learning curve plot of model performance (Test RMSE) over sample size for all three models. (bd) Correlations between random forest-predicted and VASP-calculated 27Al |CQ| values for aluminum-containing compounds. The random forest model was trained with: (b) structural and elemental features, (c) structural features only, and (d) SOAP features.

It is noteworthy, despite the significantly increased computational cost of the SOAP features, this method lacks the same degree of accuracy in comparison to our straightforward feature set. Most importantly, the SOAP features produce some strong outliers. Consequently, we show that a simple set of features that are customized for a specific problem, such as NMR parameter predictions, can outperform universal features because this method excludes unnecessary information that could significantly decrease the performance of the model in terms of both efficiency and accuracy.

Conclusions

By studying the correlation between experimentally measured 27Al NMR parameters and DFT calculated values with a relatively large benchmarking set, we can confirm that DFT calculations are accurate in predicting isotropic chemical shielding σiso and quadrupolar coupling constant |CQ| for crystalline materials that contain aluminum species. Similar to our previous benchmarking effort on spin ½ nuclei for 29Si, DFT predictions of asymmetry parameters (both ηCS and ηQ) are shown to be more prone to error due to the sensitivity of this parameter to slight variations in local geometry and the difficulty of determining η experimentally with precision.

Having shown DFT’s accuracy at predicting 27Al NMR parameters, we built a simple machine learning model to predict 27Al CQ values based on a large VASP-calculated NMR dataset of 1681 aluminum-containing solid materials. The structural and elemental features that we selected were proven to be effective in predicting CQ, likely by capturing the variation of local environments to which experimentally-measured NMR parameters are very sensitive. It is surprising for us to find that among all the features, the pure geometrical variations such as that of bond lengths are the dominant features for CQ prediction. This demonstration shows the possibility of building simple but effective features for the prediction of materials’ properties, instead of using larger universal features.

Also, we can get a better understanding of the relationship between local geometry and SSNMR spectra that, specifically, SSNMR spectra for quadrupolar nuclei are determined primarily by local geometry distortions. These data are publicly available for further investigation, via Materials Project. Our final model was proven to be effective in predicting |CQ| for 4-, 5- and 6-coordinate aluminum sites with R2 = 0.98 and RMSE = 0.61 MHz. This accuracy is comparable with the accuracy of DFT calculations versus experiment (RMSE = 0.70 MHz for VASP), thus making this machine learning method a fast and agile complement to DFT calculations.

Methods

Data sets

Benchmarking Data: We have collected experimental 27Al NMR parameters from the literature on 56 different crystalline materials, accounting for 105 unique sites, including a few repeated structures with independent measurements. The distribution of coordination number of the 27Al sites are: 41 for 4-coordinate, 9 for 5-coordinate, and 55 for 6-coordinate. All the parameters reported were collected via SSNMR employing either magic-angle spinning (MAS) or multiple-quantum MAS (MQMAS) in the experiments. All of the structures were calculated with both VASP and CASTEP.

Machine learning data: For machine learning model training, a larger dataset of DFT-computed 27Al NMR parameters was constructed by VASP calculation. The dataset is composed of 1681 aluminum-containing structures which correspond to 8081 27Al sites (5852 after removing duplicates). The coordinating environment of the 27Al sites was confined to 4-coordinate (4696 sites), 5-coordinate (202 sites), or 6-coordinate aluminum (3183 sites). There are 104 different compositions in terms of the first coordination sphere (e.g., neighboring heteronuclear species) of 27Al sites in the dataset. The histogram of the 10 most commonly-found compositions is reported in the Supplementary Information Figure S6(b). All the crystal structures were obtained from the Materials Project and were geometry optimized before NMR calculations.

DFT details

DFT calculations with CASTEP were performed within the Perdew-Burke-Enzerhof (PBE) Generalized Gradient Approximation (GGA) formulation of the exchange-correlation functional. These were performed in two steps: an initial geometry optimization where the lattice was allowed to adjust, followed by an NMR calculation on the relaxed structure. On-the-fly ultra-soft pseudopotentials were used as an approximation of nuclear and core electron interactions. Convergence tests were performed on γ-LiAlO2 to find optimal energy cutoffs and k-points. See Supplementary Information for more details. It was determined that 750 eV as an energy cutoff with Monkhorst-Pack grid of 5 × 4 × 4 was enough to converge the NMR calculations to a single value.

DFT calculations were also performed using the projector augmented wave (PAW) method48,75 as implemented in the Vienna Ab Initio Simulation Package (VASP)76,77,78 within the PBE-GGA) formulation of the exchange-correlation functional79. A cut-off for the plane waves of 520 eV was used and a uniform k-point density of approximately 1,000/atom was employed. We note that the computational and convergence parameters were chosen in compliance with the settings used in the Materials Project27 to enable direct comparisons with the large set of available Materials Project data.

Machine learning details

The RandomForestRegressor object from the Scikit-Learn library (https://scikit-learn.org/stable/) was employed to construct the random forest model. A subset comprising 20% (1171 sites) of the overall dataset was randomly allocated to serve as the test set, while the remaining 80% (4681) was utilized for training purposes. The model underwent training utilizing this training dataset and subsequently, its performance was assessed using the designated 20% test dataset. Hyperparameter optimization was conducted via the RandomizedSearchCV method provided by Scikit-Learn, incorporating a 5-fold cross-validation strategy across 100 iterations. For additional information regarding the hyperparameter search range, please refer to the code repository at github.com/wushanyun64/27Al_CQ_prediction.