Introduction

Among the seventeen Sustainable Development Goals (SDGs) for the year 2030 set by United Nations, eight are reliant on maintaining a healthy soil environment. However, soil contamination by heavy metals (HMs) poses a formidable threat to arable soil resources worldwide1. Due to the persistent, nondegradable, and bioaccumulative nature of toxic HMs, HM-contaminated soils have long-lasting detrimental effects on global food safety and human health. According to estimates, more than 9 million premature deaths are attributable to HM pollution2, underscoring the critical importance of accurately assessing the environmental risks of soil HMs and developing appropriate soil remediation strategies3.

Regional or continental maps of HM distributions in agricultural soils have been lacking until recently. Prior to the late 1990s, only a few regional- to continental-scale soil geochemical surveys with sufficiently large sampling coverage were carried out to assess the topsoil distribution of HMs. Notably, China conducted three major National Soil Surveys in 1959, 1979, and 2022 with increased sampling density and soil properties4. Additionally, in 2012, China conducted its first national survey focused on agricultural soil pollution, particularly HM contents5. In the European Union (EU), three major soil geochemical surveys were conducted with repeated sampling every few years: the FORum of European Geological Surveys (FOREGS)6 in 1997, the GEochemical Mapping of Agricultural and grazing land Soil (GEMAS)7 in 2008, and the Land Use/land Cover Area frame Survey (LUCAS)8,9 in 2009. The LUCAS Topsoil Survey sampled a greater variety of land-use classes and soil types with more soil properties and a greater sampling density compared with the other two surveys (on average, 1 site/5000 km2 in FORGES, 1 site/2500 km2 in GEMAS, and 1 site/200 km2 in LUCAS). These data and the derived maps have proven to be useful tools for environmental and resource management at regional or continental scales. For example, spatial assessments of total Hg10, Cu11, Zn12, and Cd13 in European soils have been conducted utilizing their total contents in LUCAS, which facilitated the identification of the key properties that determine the spatial variability of HMs in soils.

While total HM contents are commonly used to assess soil contamination risk, it is insufficient for assessing HM mobility, bioavailability, and toxicity in soils14. The interactions and adsorption of HMs on soil components induce considerable variability in bioavailability across different soil matrices15. Consequently, the bioavailable fraction of HMs, which represents the portion that can be taken up by plants and soil organisms, has been proposed as a more reliable parameter for risk assessment16. Various methods, including chemical extraction and modeling methods, have been developed to determine the bioavailability of HMs in soils17,18. Among these, the Multi-Surface Models (MSMs) have shown promising results in predicting the bioavailability of HMs, such as Cd, Ni, and Zn, across different soil types19. MSMs consider that HMs in the soil solution are in thermodynamic equilibrium with different solid phases, including organic matter (OM), iron oxides, and clay minerals19, and therefore, use geochemical surface complexation models to capture the complex interaction and adsorption mechanisms between HMs and soil matrices with greater generalizability. However, MSMs require a large number of model parameters20,21,22, and difficulties arise from the diverging reaction parameters from various geochemical surface complexation models due to different simplifications and assumptions23. Therefore, a reliable and convenient modeling approach is urgently needed to accurately predict the mechanistic adsorption distribution, and bioavailability of HMs in soils at larger scales.

Machine learning (ML) has become an indispensable tool in environmental science and geoscience due to its robustness and high prediction accuracy, which provides a new path forward in directly leveraging the ever-expanding wealth of scientific data available in the literature24. This technique has demonstrated success in high-resolution soil property mapping. One application involves estimating global patterns of soil organic carbon (SOC) decomposition rates by fusing multisource data to uncover the hidden nonlinear relationships among geochemical factors25. Moreover, ML techniques have emerged as promising alternatives for dealing with multiphysical or chemical processes by embedding the knowledge of underlying physical or chemical laws25. Recently, hybrid ML geochemical models have been proposed for determining the fate and transport of uranium anions in subsurface environments, which has the potential to solve the problems that limit the development of self-consistent geochemical surface complexation models as well as MSMs for large-scale applications26.

In this study, we developed a geochemical-integrated ML framework to assess the Cd speciation distribution of HMs in soils. Our approach, which swiftly predicts the bioavailable fraction of soil Cd, offers a mechanistic geochemical interpretation at the continental scale. It is instrumental in facilitating risk assessment, guiding policy formulation, and implementing targeted remediation measures for soil HM pollution, which in turn promotes sustainable agricultural practices and substantially contributes to long-term environmental health across various scales, spanning from regional to continental and even global levels.

Results and discussion

ML-based framework

Our geochemical-integrated ML framework utilized regional soil surveys, crop uptake collection and MSM modeling to predict the Cd associated with soil interfaces and the dissolved fraction of nonindustrial soil Cd at the continental scale (Fig. 1). The overall structure of the work can be broadly divided into five sections (details on the ML framework and validation procedures are described in “Methods”): (1) To predict the amorphous ferrihydrite content (represented by hydrous ferric oxides and abbreviated as HFO) and Cd speciation distribution (abbreviated as dist) in the EU and China, four datasets (EUHFO, EUdist, CNHFO and CNdist) were compiled; (2) ML algorithms were developed on the EUHFO and CNHFO to predict the HFO content at each site in the EUdist and CNdist (Methods and Texts S1S2); (3) The Cddist dataset was formed by the Latin hypercube sampling technique and was combined with MSM output variables to train ML models (Methods and Text S3). The best-performing ML algorithm was used to predict the solid/liquid distribution at each site in EUdist and CNdist, and a comparative analysis of the differences between maps of the total and dissolved fractions of Cd was conducted, and (4) Knowledge transfer (KT) models were established within the ML framework to estimate the accumulation of Cd in wheat grains and roots.

Fig. 1: Flow diagram of the modeling framework.
figure 1

The pipeline integrates data collection, database construction, model development, and spatial visualization.

Relationships between soil properties and Cd distribution patterns

Based on previous research on the retention of Cd in soils, the bioavailable Cd content in soils is predominantly controlled by adsorption onto four major reactive soil components: soil organic matter (SOM), dissolved organic matter (DOM), clay, and HFO. Cd adsorption to SOM and DOM was described with different ratios of humic acid (HA) and fulvic acid (FA). HA-Cd, FA-Cd, Clay-Cd, and HFO-Cd represent the amount of Cd adsorbed on these surfaces (Methods and Text S7). The reliability of the MSMs in both the EU and China was demonstrated by validating the correlation between the MSM-calculated equilibrium dissolved Cd (MSMs-Cd) and empirical data on plant uptake and extracted Cd. The robust linear correlation demonstrated that MSM-calculated dissolved Cd effectively indicates soil Cd bioavailability (Text S4) and was used to evaluate the Cd bioavailability. Within the geochemical-integrated ML framework, the gradient boosting regression tree (GBRT) model trained on the Cddist dataset achieved satisfactory R2 values of 0.998 and 0.989 on the internal and external test sets, respectively, and thus was chosen as the final algorithm to predict the solid‒liquid distribution of Cd at each site (Text S3). The dissolved Cd concentration predicted by GBRT was termed as ML-Cd.

Taking the prediction of the ML-Cd as an example, three different feature analysis techniques were utilized to evaluate the importance of different features (Figs. S1 and S2). The total Cd content, pH, and SOC (accounting for 58% of OM27) content were identified as the three most important features for the prediction of the ML-Cd, among which SOC was the dominant sorbent in the soil. As shown in Fig. S1a, SHapley Additive explanation (SHAP) analysis (a feature importance method, detailed in Methods) for predicting ML-Cd revealed that pH, SOC, clay, and HFO were negatively correlated with the content of ML-Cd. Conversely, higher concentrations of total Cd were associated with greater predictions of ML-Cd. A distinct pattern of total Cd content was observed where a dense cluster of high Cd concentration instances (red points) with small and positive SHAP values, while instances of lower Cd concentration (blue points) extended further toward the left, suggesting that low total Cd concentration had a stronger negative impact on ML-Cd. More details about the interaction effects can be found in the dependence plot, interaction plot, and heatmap, and the details are given in Text S5 and Fig. S3.

HFO is identified as an important interface for Cd, since it can enhance Cd fixation during soil redox cycles28. Furthermore, it plays a crucial role in certain localities, especially in areas with high pH and low SOC content29. For example, the pH on the Ibérian Peninsula increases from north to south, accompanied by decreases in the contents of SOC and HFO (Fig. S4)30,31. Three representative areas within the region were selected, the corresponding distributions of Cd species on various soil components were computed for each soil site within these areas, and the 25 sites with the highest proportions of HFO-Cd (the percentage ratio of HFO-Cd relative to total soil Cd) are presented in Fig. 2. The results indicated that the proportion of Cd adsorbed by HFO in these three areas increased from north to south, with average values of 1.5 ± 2.9%, 15.5 ± 10.5%, and 30.7 ± 4.0%, respectively. Similarly, along the east coast of Italy, where the soil exhibited high pH, low SOC content, and relatively high HFO levels, the adsorption capacity of HFO was more pronounced, with an average HFO-Cd proportion of 33.3 ± 6.0%. The fraction of Cd on HFO ranged from 0 to 53.1% of the total Cd in the four selected areas.

Fig. 2: The distribution of Cd associated with different soil phases under different soil property conditions.
figure 2

a pH interpolated map of the EU (Adapted from Ballabio et al.30 licensed under CC BY 4.0.), with the study area shown as a black inset, and (b) four hotspots exemplifying the varying significance of the HFO-Cd fraction.

Comparison of the spatial distribution of the total and dissolved fractions of soil Cd between the EU and China

The Cd content in soil is influenced not only by human activities (e.g., industrial activities, mining processes, and fertilization inputs13) but also by the regional geological background and weathering-to-soil processes. Specifically, soils over carbonate bedrocks accumulate Cd through a self-regulating cycle: carbonate rocks have a high potential for releasing Cd, while OM and Fe/Mn oxides immobilize it via adsorption/complexation32. This equilibrium between mobilization and retention explains the consistent spatial overlap of high-Cd areas and carbonate-rich regions in both EU and China (Fig. 3a, b)33. As shown in Tables S1 and S2, the total Cd content in the EU varied between 0.11 and 1.55 mg kg−1, with an average of 0.37 mg kg−1 and a standard deviation of 0.17 mg kg−1, while the total Cd content in China ranged from 0.01 to 14.22 mg kg−1, with an average of 0.41 and a standard deviation of 1.03. The total Cd content in Chinese topsoil was approximately 10.8% greater than that in the EU.

Fig. 3: Spatial interpolation map of total and dissolved Cd.
figure 3

a European map of total Cd13, b Chinese map of total Cd; c European map of dissolved Cd, and d Chinese map of dissolved Cd.

The prediction of the Cd solid‒liquid distribution at each site was carried out using the geochemical-integrated ML framework (GBRT trained on the Cddist dataset), and the major statistics of the prediction results are summarized in Table S3. Figs. 3c, d and S5 present distribution maps illustrating the ML-Cd and adsorbed-Cd concentrations on different soil reactive components in the EU or China. In particular, the predicted range for the ML-Cd within nonindustrial topsoil in the EU was 0.3–971.2 μg L−1, with a mean value of 96.9 μg L−1 and a standard deviation of 116.5 μg L−1. In China, the corresponding range was 0.06–4151.2 μg L−1, with a mean of 113.2 μg L−1 and a standard deviation of 337.3 μg L−1. ML-Cd was on average 16.8% greater than that in the EU. Table S4 and Fig. S6 present the statistical results and distribution maps of the proportions of various Cd forms to the total Cd content in China and the EU, respectively. On average, the proportions of ML-Cd to the total Cd in the EU and China were 26.3 ± 22.9% and 24.6 ± 18.0%, respectively. Moreover, the percentages of SOM-bound Cd (HA-Cd + FA-Cd) were 61.1 ± 20.0% and 65.4 ± 19.3%, respectively, surpassing the percentages of Cd associated with the other interfaces (clay-Cd, constituting 3.5 ± 3.3% and 5.8 ± 4.6%, and HFO-Cd, constituting 9.1 ± 11.7% and 4.3 ± 6.8% of the total Cd in the EU and China, respectively). Based on Tables S1 and S2, the relatively higher total Cd and lower contents of OC and HFO in Chinese topsoil were thought to be the main soil components contributing to the greater bioavailability. With nearly comparable mean pH values (6.33 in the EU and 6.64 in China), the mean OC content in China (14.76 g kg−1) was markedly lower than that in the EU (36.53 g kg−1), and a similar pattern was also observed for the mean HFO content (2.6 g kg−1 in the EU and 1.6 g kg−1 in the CN). Consequently, as shown in Table S3, the mean adsorption amounts of Cd on HA and HFO in China are less than those in the EU (148.6 μg L−1 and 7.2 μg L−1 in China, respectively, compared to 194.3 μg L−1 and 28.1 μg L−1 in the EU).

As shown in Fig. 3a, the highest amount of total Cd in nonindustrial topsoil was found in Ireland, followed by northern Spain, northern Sweden, Finland, and Poland. Lithuania, Slovenia, central Romania, and the west coast of Greece also exhibited relatively high Cd contents. Nevertheless, Fig. 3c depicts a different picture in which northern Sweden, Finland, and Poland exhibited the highest bioavailability rather than Ireland; on the other hand, Italy showed higher total Cd levels but a very low Cd bioavailability. The total Cd and MSM-Cd in England, Estonia, Latvia, Hungary, and Bulgaria posed low environmental risks. By combining SHAP analysis and maps of pH30, SOC31, and clay31, it was determined that pH and SOC content were the primary drivers behind the differences in the spatial distributions of the total and ML-Cd (Fig. S6). Typically, low soil pH was the major driver for regions with lower total Cd contents but higher Cd bioavailability, such as Sweden and Poland. This is consistent with the scenario depicted in Fig. S6, where regions with a high proportion of ML-Cd closely align with areas of lower pH. On the other hand, despite the high total Cd content in Ireland, the high SOC content in Ireland resulted in substantial Cd adsorption by HA and FA and therefore a comparatively low Cd bioavailability.

In contrast to those in the EU, the distributions of total Cd, ML-Cd and the proportion of ML-Cd in China show a similar pattern to that of total Cd. As shown in Fig. 3b, d, areas exhibiting higher levels of both total and ML-Cd were found in Taiwan Province, Guangxi Province, and the junction of Yunnan and Guizhou Provinces. This phenomenon can be attributed to a noticeable pattern in the distribution of soil properties across China. Specifically, soil pH tends to be greater in the north and lower in the south5, while SOC and clay content tend to be lower in the north and higher in the south5,34.

Knowledge transfer for bioavailability prediction

Most regional or continental risk assessments of soil Cd solely relied on total content13, while bioavailable Cd measurements content remain problematic for risk evaluation because there are different types of bioavailable Cd contents as defined by different extraction procedures. Therefore, we integrated the above geochemical adsorption processes with the crop Cd uptake within the soil‒plant system using knowledge transfer (KT). The KT model was established by incorporating the predictive outcomes of the Cd speciation distribution obtained from the best-performing GBRT model (Table S5), and trained alongside a nonmechanistic data-driven (DD) model on Chinese wheat data. Although comparative performance metrics (performance score in Table S6; SHAP for DD shown in Fig. 4a, c, SHAP for KT shown in Fig. 4b, d) demonstrated similar accuracy between models, the KT model distinguished itself by encoding geochemical interaction governing Cd distribution. Feature importance analysis (Fig. 4b, d) confirms that ML-Cd and clay Cd exhibit predominant influence in the KT model, outperforming all other soil properties by substantial margins. It is physically sound that both the soluble and electrostatically clay-bound Cd can be attributed to the exchangeable fraction35. These findings were also consistent with that Cd bioavailability is not solely determined by the ML-Cd fraction but also influenced by the presence of Cd at other interfaces during the dynamic Cd uptake process36. Hence, the KT model effectively addresses this complexity by leveraging the knowledge transferred from a different but related task in an end-to-end paradigm. Consequently, our geochemical-integrated ML framework successfully bridged the knowledge gap by integrating knowledge from geochemical processes into crop uptake processes, leading to a better understanding of predicting Cd accumulation in both wheat roots and grains. Its potential applications in diverse species and complex scenarios await further validations.

Fig. 4: Comparison of knowledge transfer and nonmechanistic data-driven models in predicting Cd plant uptake.
figure 4

Performance and SHAP values for predictive models of wheat Cd accumulation, a data-driven (DD) model in grain, b knowledge transfer (KT) model in grain, c KT model in root, d KT model in root.

Conclusions

Bioavailability and risks of HMs in agricultural soils remain one of the important obstacles to achieving global food safety and security. Previous studies have focused on specific case studies, whereas accurate regional- and continental-scale assessments of Cd bioavailability and risk have been lacking. This study presents an approach that integrates the mechanistic aspects of geochemical complexation processes for predicting the Cd speciation distribution in soils and crop uptake, bridging the gap between laboratory research and field applications. The end-to-end geochemical-integrated ML framework using only geographical location and few soil properties (pH, clay fraction, and SOC content) drastically reduced the costs and time needed for accurate estimation of bioavailable Cd. By leveraging existing survey data, the framework generated maps depicting Cd contents in various phases across the EU and China, achieving substantial savings in monitoring costs. Crop uptake of Cd with mechanistic interpretation was also accomplished through the extended KT model, providing valuable insights for environmental risk assessments, sustainable food security policies, and effective soil remediation strategies. Our findings underscore that relying solely on total Cd content can underestimate risk, highlighting the importance of considering different Cd species in risk assessments. The developed framework can also be readily extended to encompass most other HMs, although prediction of bioavailable lead (Pb) and mercury (Hg) would remain challenging due to their complex behaviors and transformations in soils19,37. If more sophisticated soil-HM interaction mechanisms are considered into the framework, it may provide a comprehensive understanding of HM interactions and their associated risks in soils.

The accuracy of predictions is subjected to several data and methodological constraints. First and foremost, while MSMs ideally require chemically reactive Cd (typically 0.43 M HNO3-extractable) as input, our reliance on aqua regia-extractable total Cd—the only consistently available metric across datasets—may overestimate MSMs-Cd38. Future refinements should prioritize either developing pH-dependent empirical functions to convert total Cd to reactive fractions, or incorporating aging models39 to better constrain chemically available Cd inputs for MSMs. For European predictions, 73% of LUCAS soil Cd concentrations fell below the detection limit (0.07 mg kg−1). Although the predictive R2 of the interpolated map reached 0.45 at a resolution of 100 m13, there are inherent uncertainties associated with the interpolated data compared to source data. Similarly, Chinese estimates combine 40-year-old national survey data with heterogeneous literature sources, potentially misrepresenting current contamination patterns. While the risk assessment of HMs tends to prioritize hotspots with high HM contents, it is important to acknowledge these limitations and strive for improved data collection and data quality in future studies. Methodological approximations, including equilibrium state assumptions (despite the nonequilibrium nature of soil processes) and interpolated soil properties, further introduce uncertainties. Geological processes such as carbonate weathering release Cd into soils, while pedological factors such as OM and Fe/Mn oxides drive its re-immobilization, collectively shaping the net available Cd pool. Incorporating these coupled mechanisms could refine predictive accuracy of Cd bioavailability.

Despite the known biases in the ML framework predictions, it remains valuable for decision-making at regional and continental scales40. By improving data accuracy, enhanced predictions and risk assessments for soil Cd contamination can be achieved, facilitating informed decision-making for sustainable environmental protection.

Methods

Datasets

Soil properties: Soil organic carbon (SOC), clay minerals, and amorphous ferrihydrite (HFO) were selected as the primary solid phases responsible for Cd adsorption in soils in MSMs as well as in the geochemical-integrated machine learning framework. The HFO is particularly essential for the retention of Cd in alkaline soils19,41. However, HFO content has not been determined in most literature or in the LUCAS survey series (Table S7) except for LUCAS 2018. To address this gap, we established two datasets for predicting HFO content in Europe and China: (1) the EUHFO dataset comprising HFO measurements from 2510 LUCAS 2018 sampling sites with complete characterization of 11 soil properties, and (2) the CNHFO dataset incorporating 132 literature-derived HFO measurements from China, limited to three commonly reported parameters (pH, OC, and clay content) due to data availability constraints. The resultant HFO were then compiled into the datasets among other soil properties for the prediction of Cd speciation distribution at soil interfaces (EUdist and CNdist). All data were strictly collected from topsoil layers (0–20 cm depth). An overview of the main statistics and probability density distribution histograms of these four datasets is given in Tables S1S2 and S8S9 and Figs. S7S10. The detailed procedures for data processing for these datasets are available in Text S6.

Environmental Covariates: The formation of HFO is closely related to the interactions between soil properties, vegetation, climate, parent material, land surface moisture and thermal conditions, and human activity over time. Following Liu et al.5, we incorporated 14 environmental covariates (climate, topography, and land surface characteristics) as part of predictive variables in EUHFO and CNHFO for modeling HFO content. These environmental covariates are derived from satellite data such as MODIS and other global datasets (details in Table S10 and accompanying notes).

Spatial Covariates: To take the spatial associations of HFO content between eigenvalues into consideration, longitude, latitude, and the 6 spatial interpolation covariates used in random forest spatial interpolation (RFSI), which include values obtained at three observation points near the target point and the distance between the points42, were introduced as covariates carrying information about spatial correlation.

Multisurface model description and validation

The EUdist and CNdist, which are filled with predicted HFO, were then used in the MSMs to predict the dissolved Cd fraction and speciation distribution in those two areas. MSMs are thermodynamic equilibrium-based geochemical models19 that include four reactive surfaces for Cd adsorption, i.e., soil organic matter (SOM), dissolved organic matter (DOM), iron oxides, and clay minerals (<0.002 mm). Thus, the total Cd, SOM, DOM, HFO, and clay fraction for each data point in both EUdist and CNdist were used as input variables to calculate the equilibrium dissolved Cd (MSMs-Cd) and Cd speciation (Cd adsorption on the various solid phases) by MSMs. By comparison with the EC-derived background electrolyte concentration, 0.001 M CaCl2 was suggested to represent coexisting ions41. The soil-solution ratio was set at 10 g L−1 based on its demonstrated performance perform for predicting both Cd in porewater and crops uptake23,43. The calculations were performed using the ORCHESTRA program44. The detailed model settings and calculation processes are described in Texts S7, S8. Validation against plant uptake demonstrated that MSMs-Cd is an effective indicator of the bioavailability of Cd in soils (Text S4)43,45.

Machine learning model development and selection

Latin hypercube sampling-based dataset: Latin hypercube sampling (LHS) and its variants are widely used random sampling techniques that allow for more efficient recreation of input distributions with fewer samples from a multidimensional distribution (More details about LHS and other sampling methods can be found in Text S9). The sampling methods were applied in this study to determine which ML model requires the smallest training set to achieve optimal prediction performance. LHS was employed repeatedly to generate a series of combinations of soil properties and total Cd content to reconstruct the distribution of soil properties in the EUdist (which has a larger volume and more consistent measurement standards than CNdist) using fewer data points and expanding the predictive capabilities of ML models (Text S10). MSMs were employed on the datasets to calculate the solid/liquid distribution of Cd for these combinations (paired soil properties-total Cd content). These combinations served as descriptors, while the outputs of the MSMs served as targets, collectively forming the LSH-based dataset (labeled LSH-N based on the sampling repetitions (N)). Various ML models were trained and compared on LSH-N to choose the optimal ML model within the smallest dataset. As a supplement, LSH-4000 was eventually chosen as the final training set and was given the name Cddist (Text S3).

Machine learning model development: Ten ML algorithms were used in this study, including five traditional learning methods (ridge regression (Ridge), Lasso regression (Lasso), Elastic Net regression (ElasticNet), K-Nearest Neighbors (KNN) and Support Vector Regression (SVR)), four ensemble models (random forest (RF), extremely randomized trees (ERT), Gradient Boosting Regression Tree (GBRT), and eXtreme Gradient Boosting (XGBoost)), and one deep learning model (Multi-Layer Perceptron (MLP)). The hyperparameters to be adjusted for these algorithms were listed in Table S11. All models were employed in constructing geochemical-integrated ML models (More detailed procedure in Text S3).

To obtain the best model for predicting HFO, ten ML models were trained on the EUHFO and CNHFO datasets. To select the optimal ML model that can mimic MSMs computations with minimal training data, the LSH technique was applied repeatedly to draw different numbers (N) of subsets (LSH-N) in EUdist to form LSH-N. Then, the distribution of Cd(II) at solid‒liquid interfaces in the LSH datasets was calculated using the ORCHESTRA program. Finally, the four ensemble models and MLP were trained and compared using soil properties in LSH-N as features and the outputs of ORCHESTRA as targets.

Feature importance evaluation: To evaluate the importance of different descriptors and interpret the predicted model, three feature importance analysis methods, namely, permutation feature importance analysis, impurity feature importance analysis, and the SHapley Additive exPlanation (SHAP) method, were applied. Compared to common feature importance methods, SHAP estimates not only how much but also how each feature contributes to model prediction, providing a fresh perspective for interpreting the interplay between soil properties and HMs46. Moreover, the results can also be used to simplify the ML model, and the most valuable features can be identified by combining the feature importance and correlation results46.

Knowledge transfer: Transfer learning based on neural networks has made considerable progress in the field of processing large amounts of image and text data due to its advantages of faster speed, better performance, and cost savings47. However, it is a challenge to effectively transfer knowledge in small datasets in the environmental field. Therefore, a knowledge transfer (KT) algorithm based on tree algorithms and tabular data has been proposed48. Specifically, the Cd speciation distribution results predicted by the best-performing GBRT model were used as additional features to predict plant adsorption. We compare the performance of the models after KT with that of the standalone nonmechanistic data-driven (DD) models trained solely on the absorption data of Chinese wheat.