Abstract
Evaluating heavy metals bioavailable is crucial for comprehensive soil contamination assessment but challenging at large scales due to complex and resource-intensive analytical procedures, and the amount of dissolved metal in soils represents the relative solubility and potential mobility of cadmium, which is a key factor determining bioavailability. Here, we developed a geochemical-integrated machine learning framework using multi-source data to predict cadmium speciation distribution in European and Chinese non-industrial topsoils. Average total cadmium content in Chinese topsoils (0.41 mg kg−1) was ~10.8% higher than the Europe, while average dissolved cadmium content (113.2 μg L−1) was ~16.8% higher. Mechanistic interpretation revealed that lower pH, soil organic matter, and amorphous ferrihydrite contents mainly attributed to the higher bioavailability in China. The framework, coupled with knowledge transfer bridging the knowledge gap between geochemical processes and crop uptake, would facilitate the informed decision-making and targeted remediation measures for sustainable agricultural practices and long-term environmental health.

Similar content being viewed by others
Introduction
Among the seventeen Sustainable Development Goals (SDGs) for the year 2030 set by United Nations, eight are reliant on maintaining a healthy soil environment. However, soil contamination by heavy metals (HMs) poses a formidable threat to arable soil resources worldwide1. Due to the persistent, nondegradable, and bioaccumulative nature of toxic HMs, HM-contaminated soils have long-lasting detrimental effects on global food safety and human health. According to estimates, more than 9 million premature deaths are attributable to HM pollution2, underscoring the critical importance of accurately assessing the environmental risks of soil HMs and developing appropriate soil remediation strategies3.
Regional or continental maps of HM distributions in agricultural soils have been lacking until recently. Prior to the late 1990s, only a few regional- to continental-scale soil geochemical surveys with sufficiently large sampling coverage were carried out to assess the topsoil distribution of HMs. Notably, China conducted three major National Soil Surveys in 1959, 1979, and 2022 with increased sampling density and soil properties4. Additionally, in 2012, China conducted its first national survey focused on agricultural soil pollution, particularly HM contents5. In the European Union (EU), three major soil geochemical surveys were conducted with repeated sampling every few years: the FORum of European Geological Surveys (FOREGS)6 in 1997, the GEochemical Mapping of Agricultural and grazing land Soil (GEMAS)7 in 2008, and the Land Use/land Cover Area frame Survey (LUCAS)8,9 in 2009. The LUCAS Topsoil Survey sampled a greater variety of land-use classes and soil types with more soil properties and a greater sampling density compared with the other two surveys (on average, 1 site/5000 km2 in FORGES, 1 site/2500 km2 in GEMAS, and 1 site/200 km2 in LUCAS). These data and the derived maps have proven to be useful tools for environmental and resource management at regional or continental scales. For example, spatial assessments of total Hg10, Cu11, Zn12, and Cd13 in European soils have been conducted utilizing their total contents in LUCAS, which facilitated the identification of the key properties that determine the spatial variability of HMs in soils.
While total HM contents are commonly used to assess soil contamination risk, it is insufficient for assessing HM mobility, bioavailability, and toxicity in soils14. The interactions and adsorption of HMs on soil components induce considerable variability in bioavailability across different soil matrices15. Consequently, the bioavailable fraction of HMs, which represents the portion that can be taken up by plants and soil organisms, has been proposed as a more reliable parameter for risk assessment16. Various methods, including chemical extraction and modeling methods, have been developed to determine the bioavailability of HMs in soils17,18. Among these, the Multi-Surface Models (MSMs) have shown promising results in predicting the bioavailability of HMs, such as Cd, Ni, and Zn, across different soil types19. MSMs consider that HMs in the soil solution are in thermodynamic equilibrium with different solid phases, including organic matter (OM), iron oxides, and clay minerals19, and therefore, use geochemical surface complexation models to capture the complex interaction and adsorption mechanisms between HMs and soil matrices with greater generalizability. However, MSMs require a large number of model parameters20,21,22, and difficulties arise from the diverging reaction parameters from various geochemical surface complexation models due to different simplifications and assumptions23. Therefore, a reliable and convenient modeling approach is urgently needed to accurately predict the mechanistic adsorption distribution, and bioavailability of HMs in soils at larger scales.
Machine learning (ML) has become an indispensable tool in environmental science and geoscience due to its robustness and high prediction accuracy, which provides a new path forward in directly leveraging the ever-expanding wealth of scientific data available in the literature24. This technique has demonstrated success in high-resolution soil property mapping. One application involves estimating global patterns of soil organic carbon (SOC) decomposition rates by fusing multisource data to uncover the hidden nonlinear relationships among geochemical factors25. Moreover, ML techniques have emerged as promising alternatives for dealing with multiphysical or chemical processes by embedding the knowledge of underlying physical or chemical laws25. Recently, hybrid ML geochemical models have been proposed for determining the fate and transport of uranium anions in subsurface environments, which has the potential to solve the problems that limit the development of self-consistent geochemical surface complexation models as well as MSMs for large-scale applications26.
In this study, we developed a geochemical-integrated ML framework to assess the Cd speciation distribution of HMs in soils. Our approach, which swiftly predicts the bioavailable fraction of soil Cd, offers a mechanistic geochemical interpretation at the continental scale. It is instrumental in facilitating risk assessment, guiding policy formulation, and implementing targeted remediation measures for soil HM pollution, which in turn promotes sustainable agricultural practices and substantially contributes to long-term environmental health across various scales, spanning from regional to continental and even global levels.
Results and discussion
ML-based framework
Our geochemical-integrated ML framework utilized regional soil surveys, crop uptake collection and MSM modeling to predict the Cd associated with soil interfaces and the dissolved fraction of nonindustrial soil Cd at the continental scale (Fig. 1). The overall structure of the work can be broadly divided into five sections (details on the ML framework and validation procedures are described in “Methods”): (1) To predict the amorphous ferrihydrite content (represented by hydrous ferric oxides and abbreviated as HFO) and Cd speciation distribution (abbreviated as dist) in the EU and China, four datasets (EUHFO, EUdist, CNHFO and CNdist) were compiled; (2) ML algorithms were developed on the EUHFO and CNHFO to predict the HFO content at each site in the EUdist and CNdist (Methods and Texts S1–S2); (3) The Cddist dataset was formed by the Latin hypercube sampling technique and was combined with MSM output variables to train ML models (Methods and Text S3). The best-performing ML algorithm was used to predict the solid/liquid distribution at each site in EUdist and CNdist, and a comparative analysis of the differences between maps of the total and dissolved fractions of Cd was conducted, and (4) Knowledge transfer (KT) models were established within the ML framework to estimate the accumulation of Cd in wheat grains and roots.
Relationships between soil properties and Cd distribution patterns
Based on previous research on the retention of Cd in soils, the bioavailable Cd content in soils is predominantly controlled by adsorption onto four major reactive soil components: soil organic matter (SOM), dissolved organic matter (DOM), clay, and HFO. Cd adsorption to SOM and DOM was described with different ratios of humic acid (HA) and fulvic acid (FA). HA-Cd, FA-Cd, Clay-Cd, and HFO-Cd represent the amount of Cd adsorbed on these surfaces (Methods and Text S7). The reliability of the MSMs in both the EU and China was demonstrated by validating the correlation between the MSM-calculated equilibrium dissolved Cd (MSMs-Cd) and empirical data on plant uptake and extracted Cd. The robust linear correlation demonstrated that MSM-calculated dissolved Cd effectively indicates soil Cd bioavailability (Text S4) and was used to evaluate the Cd bioavailability. Within the geochemical-integrated ML framework, the gradient boosting regression tree (GBRT) model trained on the Cddist dataset achieved satisfactory R2 values of 0.998 and 0.989 on the internal and external test sets, respectively, and thus was chosen as the final algorithm to predict the solid‒liquid distribution of Cd at each site (Text S3). The dissolved Cd concentration predicted by GBRT was termed as ML-Cd.
Taking the prediction of the ML-Cd as an example, three different feature analysis techniques were utilized to evaluate the importance of different features (Figs. S1 and S2). The total Cd content, pH, and SOC (accounting for 58% of OM27) content were identified as the three most important features for the prediction of the ML-Cd, among which SOC was the dominant sorbent in the soil. As shown in Fig. S1a, SHapley Additive explanation (SHAP) analysis (a feature importance method, detailed in Methods) for predicting ML-Cd revealed that pH, SOC, clay, and HFO were negatively correlated with the content of ML-Cd. Conversely, higher concentrations of total Cd were associated with greater predictions of ML-Cd. A distinct pattern of total Cd content was observed where a dense cluster of high Cd concentration instances (red points) with small and positive SHAP values, while instances of lower Cd concentration (blue points) extended further toward the left, suggesting that low total Cd concentration had a stronger negative impact on ML-Cd. More details about the interaction effects can be found in the dependence plot, interaction plot, and heatmap, and the details are given in Text S5 and Fig. S3.
HFO is identified as an important interface for Cd, since it can enhance Cd fixation during soil redox cycles28. Furthermore, it plays a crucial role in certain localities, especially in areas with high pH and low SOC content29. For example, the pH on the Ibérian Peninsula increases from north to south, accompanied by decreases in the contents of SOC and HFO (Fig. S4)30,31. Three representative areas within the region were selected, the corresponding distributions of Cd species on various soil components were computed for each soil site within these areas, and the 25 sites with the highest proportions of HFO-Cd (the percentage ratio of HFO-Cd relative to total soil Cd) are presented in Fig. 2. The results indicated that the proportion of Cd adsorbed by HFO in these three areas increased from north to south, with average values of 1.5 ± 2.9%, 15.5 ± 10.5%, and 30.7 ± 4.0%, respectively. Similarly, along the east coast of Italy, where the soil exhibited high pH, low SOC content, and relatively high HFO levels, the adsorption capacity of HFO was more pronounced, with an average HFO-Cd proportion of 33.3 ± 6.0%. The fraction of Cd on HFO ranged from 0 to 53.1% of the total Cd in the four selected areas.
a pH interpolated map of the EU (Adapted from Ballabio et al.30 licensed under CC BY 4.0.), with the study area shown as a black inset, and (b) four hotspots exemplifying the varying significance of the HFO-Cd fraction.
Comparison of the spatial distribution of the total and dissolved fractions of soil Cd between the EU and China
The Cd content in soil is influenced not only by human activities (e.g., industrial activities, mining processes, and fertilization inputs13) but also by the regional geological background and weathering-to-soil processes. Specifically, soils over carbonate bedrocks accumulate Cd through a self-regulating cycle: carbonate rocks have a high potential for releasing Cd, while OM and Fe/Mn oxides immobilize it via adsorption/complexation32. This equilibrium between mobilization and retention explains the consistent spatial overlap of high-Cd areas and carbonate-rich regions in both EU and China (Fig. 3a, b)33. As shown in Tables S1 and S2, the total Cd content in the EU varied between 0.11 and 1.55 mg kg−1, with an average of 0.37 mg kg−1 and a standard deviation of 0.17 mg kg−1, while the total Cd content in China ranged from 0.01 to 14.22 mg kg−1, with an average of 0.41 and a standard deviation of 1.03. The total Cd content in Chinese topsoil was approximately 10.8% greater than that in the EU.
a European map of total Cd13, b Chinese map of total Cd; c European map of dissolved Cd, and d Chinese map of dissolved Cd.
The prediction of the Cd solid‒liquid distribution at each site was carried out using the geochemical-integrated ML framework (GBRT trained on the Cddist dataset), and the major statistics of the prediction results are summarized in Table S3. Figs. 3c, d and S5 present distribution maps illustrating the ML-Cd and adsorbed-Cd concentrations on different soil reactive components in the EU or China. In particular, the predicted range for the ML-Cd within nonindustrial topsoil in the EU was 0.3–971.2 μg L−1, with a mean value of 96.9 μg L−1 and a standard deviation of 116.5 μg L−1. In China, the corresponding range was 0.06–4151.2 μg L−1, with a mean of 113.2 μg L−1 and a standard deviation of 337.3 μg L−1. ML-Cd was on average 16.8% greater than that in the EU. Table S4 and Fig. S6 present the statistical results and distribution maps of the proportions of various Cd forms to the total Cd content in China and the EU, respectively. On average, the proportions of ML-Cd to the total Cd in the EU and China were 26.3 ± 22.9% and 24.6 ± 18.0%, respectively. Moreover, the percentages of SOM-bound Cd (HA-Cd + FA-Cd) were 61.1 ± 20.0% and 65.4 ± 19.3%, respectively, surpassing the percentages of Cd associated with the other interfaces (clay-Cd, constituting 3.5 ± 3.3% and 5.8 ± 4.6%, and HFO-Cd, constituting 9.1 ± 11.7% and 4.3 ± 6.8% of the total Cd in the EU and China, respectively). Based on Tables S1 and S2, the relatively higher total Cd and lower contents of OC and HFO in Chinese topsoil were thought to be the main soil components contributing to the greater bioavailability. With nearly comparable mean pH values (6.33 in the EU and 6.64 in China), the mean OC content in China (14.76 g kg−1) was markedly lower than that in the EU (36.53 g kg−1), and a similar pattern was also observed for the mean HFO content (2.6 g kg−1 in the EU and 1.6 g kg−1 in the CN). Consequently, as shown in Table S3, the mean adsorption amounts of Cd on HA and HFO in China are less than those in the EU (148.6 μg L−1 and 7.2 μg L−1 in China, respectively, compared to 194.3 μg L−1 and 28.1 μg L−1 in the EU).
As shown in Fig. 3a, the highest amount of total Cd in nonindustrial topsoil was found in Ireland, followed by northern Spain, northern Sweden, Finland, and Poland. Lithuania, Slovenia, central Romania, and the west coast of Greece also exhibited relatively high Cd contents. Nevertheless, Fig. 3c depicts a different picture in which northern Sweden, Finland, and Poland exhibited the highest bioavailability rather than Ireland; on the other hand, Italy showed higher total Cd levels but a very low Cd bioavailability. The total Cd and MSM-Cd in England, Estonia, Latvia, Hungary, and Bulgaria posed low environmental risks. By combining SHAP analysis and maps of pH30, SOC31, and clay31, it was determined that pH and SOC content were the primary drivers behind the differences in the spatial distributions of the total and ML-Cd (Fig. S6). Typically, low soil pH was the major driver for regions with lower total Cd contents but higher Cd bioavailability, such as Sweden and Poland. This is consistent with the scenario depicted in Fig. S6, where regions with a high proportion of ML-Cd closely align with areas of lower pH. On the other hand, despite the high total Cd content in Ireland, the high SOC content in Ireland resulted in substantial Cd adsorption by HA and FA and therefore a comparatively low Cd bioavailability.
In contrast to those in the EU, the distributions of total Cd, ML-Cd and the proportion of ML-Cd in China show a similar pattern to that of total Cd. As shown in Fig. 3b, d, areas exhibiting higher levels of both total and ML-Cd were found in Taiwan Province, Guangxi Province, and the junction of Yunnan and Guizhou Provinces. This phenomenon can be attributed to a noticeable pattern in the distribution of soil properties across China. Specifically, soil pH tends to be greater in the north and lower in the south5, while SOC and clay content tend to be lower in the north and higher in the south5,34.
Knowledge transfer for bioavailability prediction
Most regional or continental risk assessments of soil Cd solely relied on total content13, while bioavailable Cd measurements content remain problematic for risk evaluation because there are different types of bioavailable Cd contents as defined by different extraction procedures. Therefore, we integrated the above geochemical adsorption processes with the crop Cd uptake within the soil‒plant system using knowledge transfer (KT). The KT model was established by incorporating the predictive outcomes of the Cd speciation distribution obtained from the best-performing GBRT model (Table S5), and trained alongside a nonmechanistic data-driven (DD) model on Chinese wheat data. Although comparative performance metrics (performance score in Table S6; SHAP for DD shown in Fig. 4a, c, SHAP for KT shown in Fig. 4b, d) demonstrated similar accuracy between models, the KT model distinguished itself by encoding geochemical interaction governing Cd distribution. Feature importance analysis (Fig. 4b, d) confirms that ML-Cd and clay Cd exhibit predominant influence in the KT model, outperforming all other soil properties by substantial margins. It is physically sound that both the soluble and electrostatically clay-bound Cd can be attributed to the exchangeable fraction35. These findings were also consistent with that Cd bioavailability is not solely determined by the ML-Cd fraction but also influenced by the presence of Cd at other interfaces during the dynamic Cd uptake process36. Hence, the KT model effectively addresses this complexity by leveraging the knowledge transferred from a different but related task in an end-to-end paradigm. Consequently, our geochemical-integrated ML framework successfully bridged the knowledge gap by integrating knowledge from geochemical processes into crop uptake processes, leading to a better understanding of predicting Cd accumulation in both wheat roots and grains. Its potential applications in diverse species and complex scenarios await further validations.
Conclusions
Bioavailability and risks of HMs in agricultural soils remain one of the important obstacles to achieving global food safety and security. Previous studies have focused on specific case studies, whereas accurate regional- and continental-scale assessments of Cd bioavailability and risk have been lacking. This study presents an approach that integrates the mechanistic aspects of geochemical complexation processes for predicting the Cd speciation distribution in soils and crop uptake, bridging the gap between laboratory research and field applications. The end-to-end geochemical-integrated ML framework using only geographical location and few soil properties (pH, clay fraction, and SOC content) drastically reduced the costs and time needed for accurate estimation of bioavailable Cd. By leveraging existing survey data, the framework generated maps depicting Cd contents in various phases across the EU and China, achieving substantial savings in monitoring costs. Crop uptake of Cd with mechanistic interpretation was also accomplished through the extended KT model, providing valuable insights for environmental risk assessments, sustainable food security policies, and effective soil remediation strategies. Our findings underscore that relying solely on total Cd content can underestimate risk, highlighting the importance of considering different Cd species in risk assessments. The developed framework can also be readily extended to encompass most other HMs, although prediction of bioavailable lead (Pb) and mercury (Hg) would remain challenging due to their complex behaviors and transformations in soils19,37. If more sophisticated soil-HM interaction mechanisms are considered into the framework, it may provide a comprehensive understanding of HM interactions and their associated risks in soils.
The accuracy of predictions is subjected to several data and methodological constraints. First and foremost, while MSMs ideally require chemically reactive Cd (typically 0.43 M HNO3-extractable) as input, our reliance on aqua regia-extractable total Cd—the only consistently available metric across datasets—may overestimate MSMs-Cd38. Future refinements should prioritize either developing pH-dependent empirical functions to convert total Cd to reactive fractions, or incorporating aging models39 to better constrain chemically available Cd inputs for MSMs. For European predictions, 73% of LUCAS soil Cd concentrations fell below the detection limit (0.07 mg kg−1). Although the predictive R2 of the interpolated map reached 0.45 at a resolution of 100 m13, there are inherent uncertainties associated with the interpolated data compared to source data. Similarly, Chinese estimates combine 40-year-old national survey data with heterogeneous literature sources, potentially misrepresenting current contamination patterns. While the risk assessment of HMs tends to prioritize hotspots with high HM contents, it is important to acknowledge these limitations and strive for improved data collection and data quality in future studies. Methodological approximations, including equilibrium state assumptions (despite the nonequilibrium nature of soil processes) and interpolated soil properties, further introduce uncertainties. Geological processes such as carbonate weathering release Cd into soils, while pedological factors such as OM and Fe/Mn oxides drive its re-immobilization, collectively shaping the net available Cd pool. Incorporating these coupled mechanisms could refine predictive accuracy of Cd bioavailability.
Despite the known biases in the ML framework predictions, it remains valuable for decision-making at regional and continental scales40. By improving data accuracy, enhanced predictions and risk assessments for soil Cd contamination can be achieved, facilitating informed decision-making for sustainable environmental protection.
Methods
Datasets
Soil properties: Soil organic carbon (SOC), clay minerals, and amorphous ferrihydrite (HFO) were selected as the primary solid phases responsible for Cd adsorption in soils in MSMs as well as in the geochemical-integrated machine learning framework. The HFO is particularly essential for the retention of Cd in alkaline soils19,41. However, HFO content has not been determined in most literature or in the LUCAS survey series (Table S7) except for LUCAS 2018. To address this gap, we established two datasets for predicting HFO content in Europe and China: (1) the EUHFO dataset comprising HFO measurements from 2510 LUCAS 2018 sampling sites with complete characterization of 11 soil properties, and (2) the CNHFO dataset incorporating 132 literature-derived HFO measurements from China, limited to three commonly reported parameters (pH, OC, and clay content) due to data availability constraints. The resultant HFO were then compiled into the datasets among other soil properties for the prediction of Cd speciation distribution at soil interfaces (EUdist and CNdist). All data were strictly collected from topsoil layers (0–20 cm depth). An overview of the main statistics and probability density distribution histograms of these four datasets is given in Tables S1–S2 and S8–S9 and Figs. S7–S10. The detailed procedures for data processing for these datasets are available in Text S6.
Environmental Covariates: The formation of HFO is closely related to the interactions between soil properties, vegetation, climate, parent material, land surface moisture and thermal conditions, and human activity over time. Following Liu et al.5, we incorporated 14 environmental covariates (climate, topography, and land surface characteristics) as part of predictive variables in EUHFO and CNHFO for modeling HFO content. These environmental covariates are derived from satellite data such as MODIS and other global datasets (details in Table S10 and accompanying notes).
Spatial Covariates: To take the spatial associations of HFO content between eigenvalues into consideration, longitude, latitude, and the 6 spatial interpolation covariates used in random forest spatial interpolation (RFSI), which include values obtained at three observation points near the target point and the distance between the points42, were introduced as covariates carrying information about spatial correlation.
Multisurface model description and validation
The EUdist and CNdist, which are filled with predicted HFO, were then used in the MSMs to predict the dissolved Cd fraction and speciation distribution in those two areas. MSMs are thermodynamic equilibrium-based geochemical models19 that include four reactive surfaces for Cd adsorption, i.e., soil organic matter (SOM), dissolved organic matter (DOM), iron oxides, and clay minerals (<0.002 mm). Thus, the total Cd, SOM, DOM, HFO, and clay fraction for each data point in both EUdist and CNdist were used as input variables to calculate the equilibrium dissolved Cd (MSMs-Cd) and Cd speciation (Cd adsorption on the various solid phases) by MSMs. By comparison with the EC-derived background electrolyte concentration, 0.001 M CaCl2 was suggested to represent coexisting ions41. The soil-solution ratio was set at 10 g L−1 based on its demonstrated performance perform for predicting both Cd in porewater and crops uptake23,43. The calculations were performed using the ORCHESTRA program44. The detailed model settings and calculation processes are described in Texts S7, S8. Validation against plant uptake demonstrated that MSMs-Cd is an effective indicator of the bioavailability of Cd in soils (Text S4)43,45.
Machine learning model development and selection
Latin hypercube sampling-based dataset: Latin hypercube sampling (LHS) and its variants are widely used random sampling techniques that allow for more efficient recreation of input distributions with fewer samples from a multidimensional distribution (More details about LHS and other sampling methods can be found in Text S9). The sampling methods were applied in this study to determine which ML model requires the smallest training set to achieve optimal prediction performance. LHS was employed repeatedly to generate a series of combinations of soil properties and total Cd content to reconstruct the distribution of soil properties in the EUdist (which has a larger volume and more consistent measurement standards than CNdist) using fewer data points and expanding the predictive capabilities of ML models (Text S10). MSMs were employed on the datasets to calculate the solid/liquid distribution of Cd for these combinations (paired soil properties-total Cd content). These combinations served as descriptors, while the outputs of the MSMs served as targets, collectively forming the LSH-based dataset (labeled LSH-N based on the sampling repetitions (N)). Various ML models were trained and compared on LSH-N to choose the optimal ML model within the smallest dataset. As a supplement, LSH-4000 was eventually chosen as the final training set and was given the name Cddist (Text S3).
Machine learning model development: Ten ML algorithms were used in this study, including five traditional learning methods (ridge regression (Ridge), Lasso regression (Lasso), Elastic Net regression (ElasticNet), K-Nearest Neighbors (KNN) and Support Vector Regression (SVR)), four ensemble models (random forest (RF), extremely randomized trees (ERT), Gradient Boosting Regression Tree (GBRT), and eXtreme Gradient Boosting (XGBoost)), and one deep learning model (Multi-Layer Perceptron (MLP)). The hyperparameters to be adjusted for these algorithms were listed in Table S11. All models were employed in constructing geochemical-integrated ML models (More detailed procedure in Text S3).
To obtain the best model for predicting HFO, ten ML models were trained on the EUHFO and CNHFO datasets. To select the optimal ML model that can mimic MSMs computations with minimal training data, the LSH technique was applied repeatedly to draw different numbers (N) of subsets (LSH-N) in EUdist to form LSH-N. Then, the distribution of Cd(II) at solid‒liquid interfaces in the LSH datasets was calculated using the ORCHESTRA program. Finally, the four ensemble models and MLP were trained and compared using soil properties in LSH-N as features and the outputs of ORCHESTRA as targets.
Feature importance evaluation: To evaluate the importance of different descriptors and interpret the predicted model, three feature importance analysis methods, namely, permutation feature importance analysis, impurity feature importance analysis, and the SHapley Additive exPlanation (SHAP) method, were applied. Compared to common feature importance methods, SHAP estimates not only how much but also how each feature contributes to model prediction, providing a fresh perspective for interpreting the interplay between soil properties and HMs46. Moreover, the results can also be used to simplify the ML model, and the most valuable features can be identified by combining the feature importance and correlation results46.
Knowledge transfer: Transfer learning based on neural networks has made considerable progress in the field of processing large amounts of image and text data due to its advantages of faster speed, better performance, and cost savings47. However, it is a challenge to effectively transfer knowledge in small datasets in the environmental field. Therefore, a knowledge transfer (KT) algorithm based on tree algorithms and tabular data has been proposed48. Specifically, the Cd speciation distribution results predicted by the best-performing GBRT model were used as additional features to predict plant adsorption. We compare the performance of the models after KT with that of the standalone nonmechanistic data-driven (DD) models trained solely on the absorption data of Chinese wheat.
Data availability
The European data and maps of this study will be made available in the European Soil Data Centre (ESDAC, https://esdac.jrc.ec.europa.eu/)31. Data for this manuscript are available at Zenodo with the following link: https://doi.org/10.5281/zenodo.15667161.
Code availability
Code for this manuscript is available at Zenodo with the following link: https://doi.org/10.5281/zenodo.15667161.
References
Hou, D. et al. Metal contamination and bioremediation of agricultural soils for food safety and sustainability. Nat. Rev. Earth Environ. 1, 366–381 (2020).
Fuller, R. et al. Pollution and health: a progress update. Lancet Planet. Health 6, e535–e547 (2022).
Han, R. et al. Bibliometric overview of research trends on heavy metal health risks and impacts in 1989–2018. J. Clean. Prod. 276, 123249 (2020).
Li, M. et al. National multi-purpose regional geochemical survey in China. J. Geochem. Explor. 139, 21–30 (2014).
Liu, F. et al. Mapping high resolution national soil information grids of China. Sci. Bull. 67, 328–340 (2022).
Salminen, R., De Vos, W. & Tarvainen, T. Geochemical atlas of Europe. (Geological survey of Finland Espoo, 2005).
Reimann, C., Birke, M., Demetriades, A., Filzmoser, P. & O’Connor, P. Chemistry of Europe’s agricultural soils, part A. (Schweizerbart’sche Verlagsbuchhandlung, 2014).
Fernández-Ugalde, O. et al. LUCAS 2018 Soil Module. (Publications Office of the European Union, 2022).
Fernández-Ugalde, O. et al. LUCAS 2018 Soil Module. Presentation of dataset and results (2018).
Ballabio, C. et al. A spatial assessment of mercury content in the European Union topsoil. Sci. Total Environ. 769, 144755 (2021).
Ballabio, C. et al. Copper distribution in European topsoils: an assessment based on LUCAS soil survey. Sci. Total Environ. 636, 282–298 (2018).
Van Eynde, E., Fendrich, A. N., Ballabio, C. & Panagos, P. Spatial assessment of topsoil zinc concentrations in Europe. Sci. Total Environ. 892, 164512 (2023).
Ballabio, C., Jones, A. & Panagos, P. Cadmium in topsoils of the European Union–an analysis based on LUCAS topsoil database. Sci. Total Environ. 912, 168710 (2024).
Tóth, G., Hermann, T., Da Silva, M. R. & Montanarella, L. J. E. I. Heavy metals in agricultural soils of the European Union with implications for food safety. Environ. Int. 88, 299–309 (2016).
Huang, J., Fan, G., Liu, C. & Zhou, D. Predicting soil available cadmium by machine learning based on soil properties. J. Hazard. Mater. 460, 132327 (2023).
Kim, R.-Y. et al. Bioavailability of heavy metals in soils: definitions and practical implementation—a critical review. Environ. Geochem. Health 37, 1041–1061 (2015).
Tipping, E. WHAMC—a chemical equilibrium model and computer code for waters, sediments, and soils incorporating a discrete site/electrostatic model of ion-binding by humic substances. Comput. Geosci. 20, 973–1023 (1994).
Peijnenburg, W. J. G. M., Zablotskaja, M. & Vijver, M. G. Monitoring metals in terrestrial environments within a bioavailability framework and a focus on soil extraction. Ecotoxicol. Environ. Saf. 67, 163–179 (2007).
Weng, L., Temminghoff, E. J. M. & Van Riemsdijk, W. H. Contribution of individual sorbents to the control of heavy metal activity in sandy soil. Environ. Sci. Technol. 35, 4436–4443 (2001).
Bonten, L. T., Groenenberg, J. E., Weng, L. & van Riemsdijk, W. H. Use of speciation and complexation models to estimate heavy metal sorption in soils. Geoderma 146, 303–310 (2008).
Dijkstra, J. J., Meeussen, J. C. L. & Comans, R. N. J. Evaluation of a generic multisurface sorption model for inorganic soil contaminants. Environ. Sci. Technol. 43, 6196–6201 (2009).
Groenenberg, J. E., Romkens, P. F. A. M., van Zomeren, A., Rodrigues, S. M. & Comans, R. N. J. Evaluation of the single dilute (0.43 M) nitric acid extraction to determine geochemically reactive elements in soil. Environ. Sci. Technol. 51, 2246–2253 (2017).
Li, Y. et al. Combining multisurface model and Gouy–Chapman–Stern model to predict cadmium uptake by cabbage (Brassica Chinensis L.) in soils. J. Hazard. Mater. 416, 126260 (2021).
Xiang, D., Wang, G., Tian, J. & Li, W. Global patterns and edaphic-climatic controls of soil carbon decomposition kinetics predicted from incubation experiments. Nat. Commun. 14, 2171 (2023).
Wu, X. et al. Sensing prior constraints in deep neural networks for solving exploration geophysical problems. Proc. Natl. Acad. Sci. USA 120, e2219573120 (2023).
Chang, E., Zavarin, M., Beverly, L. & Wainwright, H. A chemistry-informed hybrid machine learning approach to predict metal adsorption onto mineral surfaces. Appl. Geochem. 155, 105731, (2023).
Nelson, D. W. & Sommers, L. E. Total carbon, organic carbon, and organic matter. Methods Soil Anal. Part 2 Chem. Microbiol. Prop. 9, 539–579 (1983).
Imoto, Y. & Yasutaka, T. Comparison of the impacts of the experimental parameters and soil properties on the prediction of the soil sorption of Cd and Pb. Geoderma 376, 114538 (2020).
Benjamin, M. M. & Leckie, J. O. Multiple-site adsorption of Cd, Cu, Zn, and Pb on amorphous iron oxyhydroxide. J. Colloid Interface Sci. 79, 209–221 (1981).
Ballabio, C. et al. Mapping LUCAS topsoil chemical properties at European scale using Gaussian process regression. Geoderma 355, 113912 (2019).
Panagos, P. et al. European Soil Data Centre 2.0: soil data and knowledge in support of the EU policies. Eur. J. Soil Sci. 73, e13315 (2022).
Quezada-Hinojosa, R. P., Matera, V., Adatte, T., Rambeau, C. & Föllmi, K. B. Cadmium distribution in soils covering Jurassic oolitic limestone with high Cd contents in the Swiss Jura. Geoderma 150, 287–301 (2009).
Goldscheider, N. et al. Global distribution of carbonate rocks and karst water resources. Hydrogeol. J. 28, 1661–1677 (2020).
Song, X.-D. et al. Mapping soil organic carbon content by geographically weighted regression: A case study in the Heihe River Basin, China. Geoderma 261, 11–22 (2016).
Cui, Y. & Weng, L. Interpretation of heavy metal speciation in sequential extraction using geochemical modelling. Environ. Chem. 12, 163–173 (2015).
Li, Q. et al. Speciation of heavy metals in soils and their immobilization at micro-scale interfaces among diverse soil components. Sci. Total Environ. 825, 153862 (2022).
Gworek, B., Dmuchowski, W. & Baczewska-Dąbrowska, A. H. Mercury in the terrestrial environment: a review. Environ. Sci. Eur. 32, 128 (2020).
Garforth, J. M., Tye, A. M., Young, S. D., Bailey, E. H. & Lofts, S. A comparison of characterisation and modelling approaches to predict dissolved metal concentrations in soils. Environ. Chem. 21, EN23075 (2024).
Xu, L., Lofts, S. & Lu, Y. Terrestrial ecosystem health under long-term metal inputs: modeling and risk assessment. Ecosyst. Health Sustain. 2, e01214 (2016).
Vijver, M. G., Spijker, J., Vink, J. P. & Posthuma, L. Determining metal origins and availability in fluvial deposits by analysis of geochemical baselines and solid–solution partitioning measurements and modelling. Environ. Pollut. 156, 832–839 (2008).
Li, Y. et al. Prediction of the uptake of Cd by rice (Oryza sativa) in paddy soils by a multi-surface model. Sci. Total Environ. 724, 138289 (2020).
Zhao, W. et al. Accurate prediction of soil heavy metal pollution using an improved machine learning method: a case study in the Pearl River Delta, China. Environ. Sci. Technol. 57, 17751–17761 (2023).
Zhu, B., Liao, Q., Zhao, X., Gu, X. & Gu, C. A multi-surface model to predict Cd phytoavailability to wheat (Triticum aestivum L.). Sci. Total Environ. 630, 1374–1380 (2018).
Meeussen, J. C. L. ORCHESTRA: an object-oriented framework for implementing chemical equilibrium models. Environ. Sci. Technol. 37, 1175–1182 (2003).
Qu, X. et al. A field study to predict Cd bioaccumulation in a soil-wheat system: application of a geochemical model. J. Hazard. Mater. 400, 123135 (2020).
Palansooriya, K. N. et al. Prediction of soil heavy metal immobilization by biochar using machine learning. Environ. Sci. Technol. 56, 4187–4198 (2022).
Zhuang, F. et al. A comprehensive survey on transfer learning. Proc. IEEE 109, 43–76 (2020).
Zhong, S., Zhang, Y. & Zhang, H. Machine learning-assisted QSAR models on contaminant reactivity toward four oxidants: combining small data sets and knowledge transfer. Environ. Sci. Technol. 56, 681–692 (2021).
Acknowledgements
This study was supported by the National Natural Science Foundation of China (42225701, 41977027, and 41671239) and the National Key Research and Development Program of China (2021YFC1809100 and 2020YFC1806801). LUCAS soil samples were collected and analyzed with the support of EUROSTAT, DG AGRI, CLIMA, and ENV (European Commission). We are thankful for the data support from the “Soil SubCenter, National Earth System Science Data Center, National Science & Technology Infrastructure of China (http://soil.geodata.cn).
Author information
Authors and Affiliations
Contributions
Naichi Zhang: Writing—review and editing, Writing—original draft, Visualization, Investigation, Formal analysis, Data curation, Chen Lv: Writing—review and editing, Visualization, Formal analysis, Data curation, Yan Li: Methodology, Formal analysis, Data curation, Panos Panagos: Writing—review and editing, Funding acquisition, Data curation, Cristiano Ballabio: Writing—review and editing, Data curation, Jun Man: Writing—review and editing, Methodology, Data curation, Xueyuan Gu: Writing—review and editing, Methodology, Data curation, Fang-Jie Zhao: Writing—review and editing, Data curation, Peng Wang: Writing—review and editing, Data curation, Xingmei Liu: Writing—review and editing, Data curation, Yifan Qian: Visualization, Data curation, Peixin Cui: Writing—review and editing, Visualization, Data curation, Tongliang Wu: Writing—review and editing, Visualization, Data curation, Meiying Huang: Writing—review and editing, Visualization, Data curation, Cun Liu: Writing—review and editing, Validation, Supervision, Resources, Methodology, Funding acquisition, Conceptualization, Yujun Wang: Writing—review and editing, Validation, Supervision, Resources, Methodology, Funding acquisition, Conceptualization.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Earth and Environment thanks the anonymous reviewers for their contribution to the peer review of this work. Handling Editor(s): Somaparna Ghosh [A peer review file is available].
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, N., Lv, C., Li, Y. et al. Geochemical-integrated machine learning approach predicts the distribution of cadmium speciation in European and Chinese topsoils. Commun Earth Environ 6, 548 (2025). https://doi.org/10.1038/s43247-025-02516-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s43247-025-02516-6