Introduction

Molecular polarizability \({\boldsymbol{\alpha }}\), the measure of an electron cloud’s response to an external electric field, plays a fundamental role in determining a material’s dielectric and optical properties1,2,3,4,5. The widespread use of quantum mechanical (QM) methods, such as the density functional theory (DFT) allows the accurate calculations of polarizabilities for small molecules and solids. However, when it comes to large-scale systems, such as proteins and polymers, the calculation of their polarizabilities remains a daunting task, mainly because the computational cost scales superlinearly with the system size6. Traditional empirical methods like the bond polarizability model7 and atom-dipole interaction models8 are facing rigorous challenges in accuracy and reliability9. Fragment-based methods, which partition molecular systems into smaller subsystems for analysis10,11,12,13,14, improve scalability but require significant expertise in defining partitions15 and still face resource constraints for large-scale applications16.

The emergence of machine learning (ML)-based polarizability models can potentially tackle this challenge as they achieve a great balance between accuracy and efficiency17,18,19,20,21,22,23,24. The efficacy of ML-based polarizability models for molecules and crystalline solids has been demonstrated in previous studies17,25,26,27,28. However, predicting polarizabilities for proteins and polymers remains challenging due to the significant computational cost and effort required to generate accurate polarizability training data using high-precision DFT calculations. Therefore, reducing the cost of data set preparation for model training is crucial for the efficient modeling of large-scale systems.

Utilizing small cluster structures extracted from large-scale systems as training data for ML-based polarizability models may provide a viable strategy. Previous studies have demonstrated the feasibility of simulating bulk systems with atom-centered machine learning force fields (MLFFs) trained on small fragment data as MLFFs only rely on atomic energies with a local environment dependence29,30,31,32. In principle, this should also be applicable to atom-centered ML-based polarizability models as they infer molecular polarizability from individual atomic contributions in the same way. However, allocation of atomic polarizabilities is not unique and rigorous, since current ML-based polarizability models are typically trained for predictions of molecular polarizabilities. It has been reported that if only the global quantity is rigorously defined during training, the decomposition of the global quantity into local contributions by ML models can take place in numerous different ways29,33,34. For the ML-based polarizability models, the distributed atomic polarizabilities allocated by themselves can be flexible and arbitrary, and in some instances, incapable of characterizing the polarization of atoms correctly. This can inevitably introduce uncertainty into the model predictions, thus affecting the transferability of ML-based polarizability models. To this end, further research is essential to develop robust methodologies that address uncertainties in atomic polarizability predictions, thereby enhancing the reliability and transferability of ML-based polarizability models from small clusters to large systems without target data.

The tensorial neuroevolution potential (TNEP) models for molecular polarizability, proposed and implemented in our previous work28, have shown high accuracy and extraordinary efficiency and were successfully applied to liquid water and perovskite BaZrO3. In this work, the transferability of TNEP models trained on cluster data to condensed-phase systems was investigated, after which the atomic polarizability constraint was manually introduced into the TNEP framework. First, an original TNEP model was trained on cluster data truncated from bulk systems of n-heneicosane with a maximum cutoff radius of 7 Å, which was determined by the convergence test based on the QM method. Subsequently, a constrained TNEP model (referred to as the TNEP-C model) was trained on the same training data set augmented with atomic polarizabilities derived from the semi-empirical QM method (referred to as GFN2-computed atomic polarizabilities). Test data sets, including cluster data of varying sizes, were constructed to evaluate the extrapolative performance of these two models. Comparisons of schemes for partitioning molecular polarizability into atomic contributions from these two TNEP models and the QM method using the Hirshfeld partitioning scheme were conducted to further elucidate the key improvement by incorporating the atomic polarizability constraint into the TNEP model. Finally, the performance of these two models on bulk systems was also investigated using committee error estimates (CEEs) as the indicator.

Results

Performance of the original TNEP model

The extrapolative performance of the original TNEP model was evaluated on a series of test data sets, including structures of varying sizes (truncated from bulk systems of n-heneicosane with cutoff radii ranging from 6 to 13 Å in 1 Å increments). The test data sets were labeled as “R6~R13”, where “R” stands for the “cutoff radius” used in data sets constructions. Performance metrics, including root mean square error (RMSE) and the coefficient of determination (\({R}^{2}\)), were calculated to evaluate the model’s accuracy in predicting per-atom diagonal and off-diagonal elements of molecular polarizabilities for configurations in test data sets, using DFT reference values as the benchmark standard.

While the original TNEP model achieved high consistency with DFT reference values in predicting molecular polarizabilities for small data sets such as R6 and R7, the prediction errors increased significantly when extrapolating to much larger clusters. As shown in Fig. 1, a systematic increase in RMSEs of the per-atom diagonal elements of molecular polarizability (\({\bar{\alpha }}_{{\mathit{xx}}}^{{\rm{TNEP}},{\rm{mol}}},{\bar{\alpha }}_{{\mathit{yy}}}^{{\rm{TNEP}},{\rm{mol}}},{\bar{\alpha }}_{{\mathit{zz}}}^{{\rm{TNEP}},{\rm{mol}}}\)) was observed for configurations exceeding the size of those in the training data set, accompanied by a corresponding decrease in \({R}^{2}\) values. In contrast, the predictions for per-atom off-diagonal elements of molecular polarizabilities (\({\bar{\alpha }}_{{\mathit{xy}}}^{{\rm{TNEP}},{\rm{mol}}},{\bar{\alpha }}_{{\mathit{yz}}}^{{\rm{TNEP}},{\rm{mol}}},{\bar{\alpha }}_{{\mathit{xz}}}^{{\rm{TNEP}},{\rm{mol}}}\)) remained relatively stable, with slight increases in RMSEs and minor decreases in \({R}^{2}\).

Fig. 1: Performance of the original TNEP model for predicting the per-atom molecular polarizabilities of configurations in test data sets.
figure 1

Performance of the original TNEP model for predicting per-atom diagonal and off-diagonal elements of molecular polarizabilities of configurations in R6~R13 test data sets, evaluated by a RMSEs and b \({R}^{2}\) values.

Parity plots of the diagonal elements of the predicted molecular polarizabilities versus the DFT reference values in Fig. 2 confirmed the large prediction errors by the original TNEP model for clusters much larger than those in the training data set. As the size of molecular clusters increased, the diagonal elements of molecular polarizabilities predicted by the TNEP model (\({\alpha }_{{\rm{diag}}}^{{\rm{TNEP}},{\rm{mol}}}\)) gradually deviated from the DFT reference values (\({\alpha }_{{\rm{diag}}}^{{\rm{ref}},{\rm{mol}}}\)), with significant overestimations for large clusters such as those in R12 and R13 test data sets. Conversely, the off-diagonal components predicted by the TNEP model were in good correlation with those calculated by the DFT method (Supplementary Fig. 1).

Fig. 2: Parity plots of the diagonal elements of the molecular polarizabilities predicted by the original TNEP model versus the DFT reference values for configurations in test data sets.
figure 2

Parity plots of the diagonal elements of the molecular polarizabilities predicted by the original TNEP model versus the DFT reference values for configurations in a R6 test data set, b R7 test data set, c R8 test data set, d R9 test data set, e R10 test data set, f R11 test data set, g R12 test data set, and h R13 test data set.

The opposite trends of prediction errors for diagonal and off-diagonal elements of molecular polarizability tensors may originate from their intrinsic characteristics. Diagonal elements of molecular polarizability tensors in ML-based polarizability models are typically divided into local atomic contributions modulated by the chemical environments, however, the schemes to assign atomic contributions by the TNEP model are closely related to the differences in the configurational features of the training and test data sets, and this will be discussed in the next section. On the contrary, for an isotropic system, the off-diagonal elements of polarizability tensors are mainly affected by molecular symmetry such as rotational operations35,36, weakening the influence of differences in configurational features between the training and test data sets.

Analysis of key factors affecting the transferability of the original TNEP model

The significant discrepancy between predicted diagonal elements of molecular polarizabilities and reference values in Fig. 2 suggests that the original TNEP model tends to uniformly overestimate the polarizabilities of giant clusters unseen in the training data set. Due to the atom-centered structure of TNEP models, the discrepancy in molecular polarizabilities necessitates the inspection of the local atomic contributions to the total polarizability. Atomic polarizability in TNEP models is calculated from individual artificial neural networks (ANNs) by using local chemical environments consisting of all atoms inside a cutoff sphere of radius as inputs. Therefore, the differences in atomic environments between the training and test data sets (input to the ANNs) and the distributions of atomic polarizabilities (output to the ANNs) were both analyzed to investigate the main factors that may affect the validity of the TNEP model trained on small molecules when applied to larger systems.

For the inspection of the inputs of ANNs, the similarity in atomic environments across different data sets was first compared by using the descriptor space analysis for carbon atoms and hydrogen atoms in the training and test data sets. Supplementary Figs. 2 and 3 suggested that the projections of the training data set almost entirely covered those of R6~R13 test data sets. This revealed that transferring from the training data set to test data sets should not introduce significant changes to the diversity of local atomic environments.

For the inspection of the outputs of ANNs, atomic polarizability distributions for hydrogen and carbon atoms in clusters of varying sizes centered on a certain carbon atom were calculated by the original TNEP model and the QM method using the Hirshfeld partitioning scheme, respectively. Note that only averaged isotropic polarizability (\({\bar{\alpha}}_{{\rm{iso}}}^{{\rm{atomic}}}=({\alpha }_{{\mathit{xx}}}^{{\rm{atomic}}}+{\alpha }_{{\mathit{yy}}}^{{\rm{atomic}}}+{\alpha }_{{\mathit{zz}}}^{{\rm{atomic}}})/3\)) was considered here, and QM calculations were implemented on clusters with cutoff radii ranging from 3 to 9 Å due to the computational costs. The results from the QM method using the Hirshfeld partitioning scheme (referred to as QM-based atomic polarizabilities) were scaled to ensure consistency with the reference data for the TNEP model at the computational level. Fig. 3 shows that the original TNEP model tends to allocate polarizabilities to hydrogen and carbon atoms in a manner that differs markedly from the QM-based atomic polarizabilities. Specifically, hydrogen atoms were assigned excessively higher values, while carbon atoms were assigned substantially lower values. As the size of the clusters increased, the polarizabilities assigned to carbon and hydrogen atoms by the original TNEP model gradually approached a common value, indicating a diminishing capability of the model to differentiate between different elements. In addition, the instability of the predicted atomic polarizabilities was generally on the rise (the anomaly observed in clusters with a cutoff radius of 3 Å may be due to the limited number of atoms in the cluster). This indicated that for atoms of a given type, new atomic environments may have emerged while the model failed in representing them, and the proportion of atoms in such environments increased accordingly as the clusters expanded.

Fig. 3: Comparisons of atomic polarizability distributions calculated by the QM method using the Hirshfeld partitioning scheme and the original TNEP model for clusters of varying sizes centered on a certain carbon atom.
figure 3

Comparisons of distributed atomic polarizabilities of a hydrogen and b carbon atoms in clusters of varying sizes centered on a certain carbon atom calculated by the QM method using the Hirshfeld partitioning scheme and the original TNEP model.

This limitation arose because the original TNEP model only accounted for the loss function of total polarizability during the training process, which led to internal flexibility of atomic polarizability contributions. The scheme for decomposing total polarizability into atomic contributions is sensitive to the configurational features of training data sets, including chemical compositions and proportions of atoms in various chemical environments. Table 1 shows that the stoichiometric ratios and the fractions of carbon atoms in bulk-like environments rose accordingly as the size of the configurations in data sets increased. First, the differences in the stoichiometric ratios between the training and the test data sets hinder the model’s ability to distinguish between chemical elements in such transfer, similar to findings demonstrated in previous work on MLFF29. Second, the limited fraction of carbon atoms in bulk-like environments within the training data set contributes to the original TNEP model being undertrained in representing bulk-like atomic environments. To this end, when applied to much larger structures, the TNEP model insists on making predictions with physically unreasonable schemes, leading to an overall overestimation of molecular polarizabilities.

Table 1 Results of configurational features including stoichiometric ratios and fractions of carbon atoms in bulk-like environments inside the cutoff radius of the symmetry functions for the training and test data sets, and the periodic system of n-heneicosane

Performance of the TNEP-C model

Since the original TNEP model failed to give a reliable decomposition of molecular polarizabilities among the constituent atoms, GFN2-computed atomic polarizabilities were introduced into the training process of the TNEP model as constraints. Before integrating, a systematic error correction of atomic polarizabilities was necessary according to the previous work37, and a quadratic fit with the zero intercept provided the best match between the GFN2-computed and DFT-computed polarizabilities of the labeled data (Supplementary Fig. 4).

The extrapolative performance of the TNEP-C model demonstrated advantages over the original TNEP model. While maintaining the performance for training data set (Supplementary Fig. 5) and R6~R7 test data sets, the RMSEs of the per-atom diagonal elements of molecular polarizabilities predicted by the TNEP-C model decreased substantially for large data sets from R8 to R13 (Fig. 4 and Supplementary Table 1). The parity plots of the diagonal and off-diagonal elements of the predicted molecular polarizabilities versus the DFT reference values shown in Supplementary Figs. 6 and 7 also confirmed that by learning GFN2-computed atomic polarizabilities, the TNEP-C model exhibited improved accuracy on large clusters.

Fig. 4: Performance of the TNEP-C model for predicting the per-atom molecular polarizabilities of configurations in test data sets.
figure 4

Performance of the TNEP-C model for predicting per-atom diagonal and off-diagonal elements of molecular polarizabilities of configurations in R6~R13 test data sets, evaluated by a RMSEs and b \({R}^{2}\) values. Dashed lines represent the results from the original TNEP model and the solid lines represent the results from the TNEP-C model.

Atomic polarizability distributions for hydrogen and carbon atoms in clusters of varying sizes centered on a certain carbon atom were also calculated to investigate the improvement of the TNEP-C model. As shown in Fig. 5a, b, the results from the TNEP-C model were in good agreement with QM-based atomic polarizabilities and remained robust when extrapolating. Take a cluster with a cutoff radius of 6 Å as an example, distributed atomic polarizabilities were plotted as polarizability ellipsoids and atoms were colored based on their contributions to molecular polarizability. The values of the polarizability tensors for hydrogen atoms should be substantially smaller compared with those of carbon atoms, due to their smaller electronic population (Fig. 5c). However, the original TNEP model itself tended to allocate comparable values to carbon and hydrogen atoms (Fig. 5d). In contrast, the TNEP-C model can correctly differentiate between carbon and hydrogen atoms (Fig. 5e). This indicated that the TNEP-C model has embedded a physics-compliant partitioning strategy for partitioning total polarizabilities into atomic contributions. Consequently, the constrained model delivered enhanced robustness in extrapolation compared with the unconstrained counterpart, as evidenced by reduced errors when predicting diagonal elements of molecular polarizabilities for larger clusters beyond the training domain.

Fig. 5: Comparisons of atomic polarizability distributions calculated by the QM method using the Hirshfeld partitioning scheme, the original TNEP model and the TNEP-C model for clusters of varying sizes centered on a certain carbon atom.
figure 5

Comparisons of distributed atomic polarizabilities of a hydrogen and b carbon atoms in clusters of varying sizes centered on a certain carbon atom calculated by the QM method using the Hirshfeld partitioning scheme, the original TNEP model and the TNEP-C model; graphical representation of distributed atomic polarizabilities calculated by c the QM method using the Hirshfeld partitioning scheme, d the original TNEP model and e the TNEP-C model for a cluster with a cutoff radius of 6 Å.

Extrapolating the original TNEP and TNEP-C models to bulk systems

In addition to the evaluation on large molecular clusters, we also assessed the transferability of the original TNEP and TNEP-C models on bulk systems. The bulk data set contains 125,000 configurations sampled from the MD trajectories of n-heneicosane with an interval of 2 fs. Since calculations of polarizabilities at the DFT level for such a large system are unattainable, alternative uncertainty estimation metrics are required to evaluate the prediction errors instead of RMSE. Here, we employed the CEE algorithm38, which can provide a metric to quantify the generalization error in the form of the committee disagreement. CEE has been previously used to evaluate the performance of ML models39,40,41,42, including its application to TNEP models43. In this work, five instances of both the original TNEP and TNEP-C models were independently trained. The corresponding CEEs were computed on the test data set and compared with RMSEs for validation. The results indicated that while CEE tended to underestimate RMSE, the overall consistency between these two metrics suggested that CEE can serve as a reasonable approximation for evaluating prediction errors on systems that pose challenges to the DFT methods, and this correlation has also been reported in previous studies 43.

Significant enhancements were observed in the predictions for the per-atom diagonal elements of molecular polarizabilities. Figure 6a presents that the CEE can reach as high as 0.1 a.u. per atom when extrapolating to the bulk systems for the original TNEP model, while for the TNEP-C model, this error is reduced to 0.03 a.u. per atom (Fig. 6b). This result further demonstrates that, if only the molecular polarizabilities are fitted, TNEP models can achieve high accuracy on the training data set with several internal schemes to partition total polarizabilities into atomic contributions. While the atom-specific polarizabilities assigned by the original TNEP model itself can vary greatly with changes in configurational features of training data sets, and sometimes these atomic predictions are physically inconsistent44. This uncertainty can lead to poor transferability across different test data sets, and can be substantially reduced via the implementation of constraints on atomic polarizabilities. In contrast, the original TNEP and TNEP-C models exhibited comparable accuracy in predicting the off-diagonal elements of molecular polarizabilities (Fig. 6c, d), even when transferring to the bulk systems. The reason for this lies in the fact that the isotropic GFN2-computed atomic polarizabilities incorporated into the TNEP-C model do not impose constraints on the off-diagonal components. Consequently, the predictions for the off-diagonal components remain primarily governed by the original TNEP model’s inherent capability, which is less affected by the variations in the system sizes and configurational features compared with the diagonal elements.

Fig. 6: Comparisons of prediction errors of per-atom molecular polarizabilities for the test data sets and the bulk data set of n-heneicosane obtained by the original TNEP model and the TNEP-C model.
figure 6

Comparisons of prediction errors for per-atom (a, b) diagonal and (c, d) off-diagonal elements of molecular polarizabilities for the test data sets and the bulk data set of n-heneicosane, obtained by the original TNEP model and the TNEP-C model, respectively.

Rooms for further enhancement remain in this approach. For instance, while the GFN2-computed atomic polarizabilities are generally reasonable and readily obtainable, the current model may be limited when applied to systems with intense anisotropic effects due to the absence of contributions from off-diagonal components. Introducing atomic polarizabilities obtained from the partitioning of ground-state and field-perturbed electron densities of a molecular system such as quantum theory of atoms in molecules (QTAIM)45,46,47 as training constraints in TNEP models may potentially yield better results, but it also poses a challenge in terms of the computational costs.

Discussion

We demonstrated that incorporating atomic polarizability constraints into the TNEP model can significantly enhance its transferability, enabling accurate predictions of polarizabilities for condensed-phase systems based only on small molecular cluster data. By integrating atomic polarizabilities derived from semi-empirical QM calculations, the TNEP-C model learned a physically grounded partitioning scheme to divide atomic contributions, especially for configurations with increased stoichiometric ratios and proportions of atoms in bulk-like environments. Consequently, the TNEP-C model showed boosted performance in extrapolation, with largely reduced errors in predicting polarizabilities of large clusters and bulk systems.

In principle, this approach can be extended to organic systems with higher complexity at the chemical composition level, such as systems that include elements like oxygen and nitrogen. For systems involving metallic atoms, more sophisticated methods for assigning atomic polarizabilities (like QTAIM) will be essential and necessitate further testing and validation to ensure accuracy and reliability. Importantly, this methodology could also be applied to other atom-centered ML-based polarizability models, thereby providing a robust strategy for scalable, data-efficient predictions of molecular polarizability in complex condensed-phase materials. This would potentially pave the way for more insightful simulations of molecular properties, particularly in understanding electronic and spectroscopic characteristics of various materials.

Methods

The training data set was composed of molecular clusters truncated from the configurations extracted from the MD trajectories of n-heneicosane (C21H44) with varying cutoff radii from 3 to 7 Å in 1 Å increments. To improve the efficiency of the data set construction, we started with the smallest clusters truncated with a cutoff radius of 3 Å, and continuously supplemented the clusters with larger cutoff radii through the farthest point sampling (FPS) method iteratively. This section was organized as follows: First, the method to construct the clusters from the bulk system was introduced, followed by the computational details of calculating molecular polarizabilities for the initial training dataset. Subsequently, the principle of the original TNEP model was briefly outlined. The explorations of larger clusters through the FPS method were detailed afterwards, and this section finally ended with the calculations of atomic polarizabilities and refactoring of TNEP models for atomic polarizabilities.

Construction of molecular clusters

An orthogonal structure of n-heneicosane with a bilayer of 6 × 8 unit cells (6240 molecules) was constructed as the initial configuration for MD simulations48,49,50. The temperature was held at 301 K using the Nosé–Hoover thermostat51,52 with a time constant of 0.1 ps, and the pressure was maintained at 1 bar using the Parinello-Rahman barostat53. The MD simulation was performed using the COMPASS force field54 in LAMMPS55. Snapshots were dumped every 25 ps from an MD production run of 250 ps, and a total of 11 snapshots were obtained.

Molecular clusters of different sizes were truncated from the bulk structures of n-heneicosane by extracting atoms within cutoff radii from 3 to 13 Å in 1 Å increments surrounding each carbon atom. Carbon atoms outside a certain radius will be kept if the valency was situated on a hydrogen or bonded to two carbon atoms within the radius, as illustrated in Fig. 7. Subsequently, free valencies were saturated with hydrogen atoms56. Constrained optimizations were performed for every cluster using the COMPASS force field to adjust the positions of hydrogen atoms while keeping the carbon skeleton fixed.

Fig. 7: Schematic representation of molecular clusters derived from the bulk structure of n-heneicosane.
figure 7

The central carbon atoms of the clusters are highlighted in blue and orange, respectively, and the red sticks represent the atoms artificially introduced to saturate the free valences.

Calculations of molecular polarizability

Molecular polarizabilities were calculated for each structure in the training data sets using the DFT method. Molecular polarizabilities were calculated by solving the coupled perturbed self-consistent field equations using the GTH-PBE pseudopotential and the DZVP-MOLOPT-SR-GTH basis set (400 Ry cutoff, Γ point)57,58,59. All DFT calculations were carried out using the Gaussian Plane Waves method (GPW) in CP2K 60,61.

Principle of the original TNEP model

The TNEP model for predicting tensorial properties is developed based on the NEP framework, which is implemented in the GPUMD package62. In the NEP-based potential energy surface (PES) model, the total energy of one system is given by the sum of atomic site energies \({U}_{i}\), which are computed using individual ANNs and depend on the local atomic chemical environments. Following the work of Behler and Parrinello63, the input layer consists of descriptor vectors of high dimensions constructed from Chebyshev and Legendre’s polynomials64,65. Explicit expressions of the descriptor vector and more detailed information on the NEP-based PES model are introduced in refs. 66,67,68,69.

The molecular polarizability tensor is a second-order symmetric tensor with nine components for a given structure with \(N\) atoms28. Components of \({\boldsymbol{\alpha }}\) can be expressed as:

$${\alpha }_{\mu \nu }=\mathop{\sum}\limits_{i}^{N}{U}_{i}{\delta }_{\mu \nu }-\mathop{\sum}\limits_{i}^{N}\mathop{\sum}\limits_{j\ne i}{r}_{{ij}}^{\mu }\frac{\partial {U}_{i}}{\partial {r}_{{ij}}^{v}}$$
(1)

where ν refers to the direction of the applied external electric field (e.g., \(x,{y},{z}\) in Cartesian coordinates), while μ denotes the direction of the induced dipole moment (e.g., \(x,{y},{z}\) in Cartesian coordinates). The polarizability tensor component \({\alpha }_{\mu \nu }\) quantifies the linear response between the external electric field applied in the ν-direction and the induced dipole moment in the μ-direction. When \(\mu =\nu\), \({\alpha }_{\mu \nu }\) corresponds to the diagonal element of the molecular polarizability tensor, while \(\mu \ne \nu\) represents the off-diagonal component. \({\delta }_{\mu \nu }\) is the Kronecker delta. \({r}_{{ij}}^{\mu }\) is the μ-component of the vector \({{\boldsymbol{r}}}_{{ij}}\equiv {{\boldsymbol{r}}}_{j}-{{\boldsymbol{r}}}_{i}\), and \({{\boldsymbol{r}}}_{j}\) is the position of neighboring atom j around atom i. \({U}_{i}\) here has the dimension of polarizability.

The loss function of the original TNEP model is given by the weighted sum of the RMSEs of the molecular polarizability \({{\mathcal{L}}}_{{\rm{mol}}}\left({\mathbf{z}}\right)\) as well as two regularization terms, as:

$${\mathcal{L}}\left({\mathbf{z}}\right)={{\mathcal{L}}}_{{\rm{mol}}}\left({\mathbf{z}}\right)+{{{\lambda }}}_{1}\frac{1}{{N}_{{\rm{par}}}}\mathop{\sum }\limits_{n=1}^{{N}_{{\rm{par}}}}|{z}_{n}|+{{{\lambda }}}_{2}\sqrt{\frac{1}{{N}_{{\rm{par}}}}\mathop{\sum }\nolimits_{n=1}^{{N}_{{\rm{par}}}}{z}_{n}^{2}}$$
(2)

where \({\mathbf{z}}\) is a set of trainable parameters from the descriptors and the ANN model, and \({N}_{{\rm{par}}}\) is the total number of tunable parameters. The last two terms represent \({{\mathcal{L}}}_{1}\) and \({{\mathcal{L}}}_{2}\) regularizations. The weights \({{\rm{\lambda }}}_{1}\) and \({{\rm{\lambda }}}_{2}\) are tunable hyperparameters.

The loss term accounting for molecular polarizability is defined as:

$$\begin{array}{l}{{\mathcal{L}}}_{{\rm{mol}}}\left({\mathbf{z}}\right)=\left\{\frac{1}{6{N}_{{\rm{str}}}}\mathop{\sum }\limits_{n=1}^{{N}_{{\rm{str}}}}\left[\mathop{\sum}\limits_{\mu =\nu }{\left({\alpha }_{\mu \nu }^{{\rm{TNEP}},{\rm{mol}}}({n,\mathbf{z}})-{\alpha }_{\mu \nu }^{{\rm{ref}},{\rm{mol}}}(n)\right)}^{2}\right.\right.\\\qquad\qquad\qquad\left.\left.+\,{{{\lambda }}}_{{\rm{s}}}^{{\rm{mol}}}\left(\mathop{\sum}\limits_{\mu > \nu }{\left({\alpha }_{\mu \nu }^{{\rm{TNEP}},{\rm{mol}}}\,(n,{\mathbf{z}})-{\alpha }_{\mu \nu }^{{\rm{ref}},{\rm{mol}}}(n)\right)}^{2}\right)\right]\right\}^{\frac{1}{2}}\end{array}$$
(3)

where \({N}_{{\rm{str}}}\) is the number of structures in the whole training data set. \({\alpha }_{\mu \nu }^{{\rm{TNEP}},{\rm{mol}}}\left(n,{\mathbf{z}}\right)\) is the molecular polarizability component predicted by the original TNEP model with parameters \({\mathbf{z}}\) for the \({n}^{{\rm{th}}}\) structure while \({\alpha }_{\mu \nu }^{{\rm{ref}},{\rm{mol}}}\left(n\right)\) is the corresponding reference molecular polarizability component typically obtained by the DFT method. Since molecular polarizability is a symmetric second-order tensor (\({\alpha }_{\mu \nu }={\alpha }_{\nu \mu }\)), we utilize the lower-triangular off-diagonal components (\(\mu > \nu\)) of \({\alpha }_{\mu \nu }^{{\rm{TNEP}},{\rm{mol}}}\left(n,{\mathbf{z}}\right)\) and \({\alpha }_{\mu \nu }^{{\rm{ref}},{\rm{mol}}}\left(n\right)\) for implementation. \({{\rm{\lambda }}}_{{\rm{s}}}^{{\rm{mol}}}\) is introduced as a regularization parameter to balance the contributions from the diagonal and off-diagonal components.

For the radial components of the original TNEP model, a cutoff radius of 7 Å and seven radial functions (each being a linear combination of 11 basis functions) were used in this work. For the angular components, a cutoff radius of 4 Å and seven radial functions (each being a linear combination of 11 basis functions) were used. The maximum expansion order for the three, four, and five-body terms of angular descriptor components is 4, 2, and 1, respectively. The fitting component is an ANN composed of one hidden layer with 30 neurons. For the regularization parameters, \({{\rm{\lambda }}}_{{\rm{s}}}^{{\rm{mol}}}\) was set to 1, \({{\rm{\lambda }}}_{1}\) and \({{\rm{\lambda }}}_{2}\) both were set to 0.03. The original TNEP model was trained for 300,000 generations using the SNES algorithm with a population size of 80.

Iterative explorations of larger clusters through the FPS method

To improve computational efficiency, for the smallest cutoff radius (3 Å), only one structure for each stoichiometric ratio of hydrocarbon was randomly selected and labeled as the training data set to train the pre-TNEP model. On this basis, clusters left were labeled as the unselected data, and the completeness of the current training data set in relation to the unselected one was evaluated by descriptor space analysis with the pre-TNEP model70. New samples were added via the FPS method if needed to build the initial data set.

New samples were further added using the FPS method based on the initial data set, its corresponding TNEP model, and all cluster structures constructed with a cutoff radius of 4 Å represented as the unselected data set. This iteration continued until the cluster structures constructed with a cutoff radius of 7 Å were supplemented. The schematic representation of supplementing the training data set by the FPS method is demonstrated in Fig. 8. The final training data set contains 1980 configurations, with 380 configurations constructed with a cutoff radius of 3 Å, and every 400 configurations constructed with cutoff radii ranging from 4 to 7 Å in 1 Å increments. The reason for choosing a converged cutoff radius of 7 Å was discussed in Supplementary Information.

Fig. 8: Schematic representation of supplementing the training data set by the FPS method.
figure 8

Points in different colors in the projection represent configurations in different data sets.

Every 100 structures were randomly selected from the unselected data sets with a certain cutoff radius from 6 to 13 Å in 1 Å increments, and were labeled as R6~R13 test data sets. The extrapolative performance of the TNEP models was evaluated on the test data sets.

Calculations of atomic polarizability

Atomic polarizabilities of structures in the training data set were calculated by the GFN2-xTB method71. GFN2-computed atomic polarizabilities were selected as training constraints because they were derived based on atom types including the element number, hybridization state of carbon atoms, and some basic structural information, and were potentially physically more motivated to be transferred. GFN2-computed atomic polarizabilities depend on pre-computed atomic polarizabilities at a certain molecular geometry, i.e., with the atom having a GFN2-xTB computed atomic partial charge \({q}_{r}\) and a covalent coordination number \({{CN}}_{\mathrm{cov}}^{r}\) (the index \(r\) indicates values for the reference structures) 71,72,73.

In addition, a more accurate approach based on the QM determination and Hirshfeld partitioning scheme was employed. This approach served as an independent benchmark for evaluating the ability to partition isotropic molecular polarizability into atomic contributions of TNEP models for comparison. This method relies on the observation that atomic polarizability is proportional to the fuzzy atomic volume of the electron cloud74,75,76,77:

$${{\rm{\alpha }}}_{{\rm{eff}}}\left(0\right)\equiv {{\rm{\alpha }}}_{{\rm{f}}{\rm{ree}}}\left(0\right)\frac{{V}_{{\rm{eff}}}}{{V}_{{\rm{free}}}}$$
(4)

where \({\alpha }_{{\rm{eff}}}(0)\) and \({\alpha }_{{\rm{free}}}(0)\) are static polarizability for the atom-in-a-molecule (effective atomic polarizability) and the free-atom, \({V}_{{\rm{eff}}}\) and \({V}_{{\rm{free}}}\) are measures of the “volume” of the atom in a molecule and the free atom, respectively.

By employing the Hirshfeld partitioning scheme based on the electron density calculated from DFT calculations76,78,79, the ratio of the atom-in-a-molecule volume to the free-atom volume of each atom can be derived, and the QM-based atomic polarizability can be deduced by Eq. 4 subsequently.

Principle of the TNEP-C model

The architecture of the TNEP-C model is shown in Fig. 9. To pose a constraint on atomic polarizability, an additional term is included in the loss function of the TNEP-C model as:

$$\begin{array}{l}{{\mathcal{L}}}_{{\rm{atomic}}}\left({\mathbf{z}}\right)=\left\{\displaystyle\frac{1}{6{N}_{{\rm{str}}}}\mathop{\sum }\limits_{n=1}^{{N}_{{\rm{str}}}}\mathop{\sum }\limits_{i=1}^{{N}_{{\rm{a}}}}\left[\mathop{\sum}\limits_{\mu =\nu }{\left({\alpha }_{\mu \nu }^{{\rm{TNEP}}-{\rm{C}},{\rm{atomic}}}(n,i,{\mathbf{z}}){-}{\alpha }_{\mu \nu }^{{\rm{ref}},{\rm{atomic}}}(n,i)\right)}^{2}\right.\right.\\\qquad\qquad\qquad\left.\left.+\,{{\rm{\lambda }}}_{{\rm{s}}}^{{\rm{atomic}}}\left(\mathop{\sum}\limits_{\mu > \nu }{\left({\alpha }_{\mu \nu }^{{\rm{TNEP}}-{\rm{C}},{\rm{atomic}}}(n,i,{\mathbf{z}}){-}{\alpha }_{\mu \nu }^{{\rm{ref}},{\rm{atomic}}}(n,i)\right)}^{2}\right)\right]\right\}^{\frac{1}{2}}\end{array}$$
(5)

where \({N}_{{\rm{str}}}\) is the number of structures in the whole training data set and \({N}_{{\rm{a}}}\) is the number of atoms in the \({n}^{{\rm{th}}}\) structure. \({\alpha }_{\mu \nu }^{{\rm{TNEP}}-{\rm{C}},{\rm{atomic}}}(n,i,{\mathbf{z}})\) is the atomic polarizability component predicted by the TNEP-C model with parameters \({\mathbf{z}}\) for the \({i}^{{\rm{th}}}\) atom in the \({n}^{{\rm{th}}}\) structure. \({\alpha }_{\mu \nu }^{{\rm{ref}},{\rm{atomic}}}(n,i)\) is the corresponding reference atomic polarizability component, which is obtained by the GFN2-xTB method in this work. Since the GFN2-computed atomic polarizabilities are inherently isotropic, the contributions from the off-diagonal components are zero in the training process. Consequently, \({{\rm{\lambda }}}_{{\rm{s}}}^{{\rm{atomic}}}\), which is designed to balance the contributions from diagonal and off-diagonal elements, is set to a default value with no tuning required.

Fig. 9: Schematic representation of the TNEP-C architecture.
figure 9

The section highlighted in red represents the introduced loss function term enforcing the atomic polarizability constraint.

The total loss function for the TNEP-C model is thus defined as:

$${\mathcal{L}}\left({\mathbf{z}}\right)={{\mathcal{L}}}_{{\rm{mol}}}\left({\mathbf{z}}\right)+{{{{\lambda }}}_{{\rm{atomic}}}\,{{\cdot}}\,{\mathcal{L}}}_{{\rm{atomic}}}\left({\mathbf{z}}\right)+{{{\lambda }}}_{1}\frac{1}{{N}_{{\rm{par}}}}\mathop{\sum }\limits_{n=1}^{{N}_{{\rm{par}}}}|{z}_{n}|+{{{\lambda }}}_{2}\sqrt{\frac{1}{{N}_{{\rm{par}}}}\mathop{\sum }\nolimits_{n=1}^{{N}_{{\rm{par}}}}{z}_{n}^{2}}$$
(6)

where \({{\rm{\lambda }}}_{{\rm{atomic}}}\) is the weight of the atomic polarizability term to balance the contributions from the molecular polarizability and atomic polarizability.

For the TNEP-C model, \({{\rm{\lambda }}}_{{\rm{atomic}}}\) was set to 0.2 and other settings were kept identical to those implemented in the original TNEP model. The TNEP-C model was trained for 400,000 generations to ensure that the loss terms for molecular polarizability in the training and test data sets had largely converged.