Geochemical-integrated machine learning approach predicts the distribution of cadmium speciation in European and Chinese topsoils

Zhang, Naichi; Lv, Chen; Li, Yan; Panagos, Panos; Ballabio, Cristiano; Man, Jun; Gu, Xueyuan; Zhao, Fang-Jie; Wang, Peng; Liu, Xingmei; Qian, Yifan; Cui, Peixin; Wu, Tongliang; Huang, Meiying; Liu, Cun; Wang, Yujun

doi:10.1038/s43247-025-02516-6

Download PDF

Article
Open access
Published: 11 July 2025

Geochemical-integrated machine learning approach predicts the distribution of cadmium speciation in European and Chinese topsoils

Communications Earth & Environment volume 6, Article number: 548 (2025) Cite this article

3923 Accesses
5 Citations
14 Altmetric
Metrics details

Subjects

Abstract

Evaluating heavy metals bioavailable is crucial for comprehensive soil contamination assessment but challenging at large scales due to complex and resource-intensive analytical procedures, and the amount of dissolved metal in soils represents the relative solubility and potential mobility of cadmium, which is a key factor determining bioavailability. Here, we developed a geochemical-integrated machine learning framework using multi-source data to predict cadmium speciation distribution in European and Chinese non-industrial topsoils. Average total cadmium content in Chinese topsoils (0.41 mg kg⁻¹) was ~10.8% higher than the Europe, while average dissolved cadmium content (113.2 μg L⁻¹) was ~16.8% higher. Mechanistic interpretation revealed that lower pH, soil organic matter, and amorphous ferrihydrite contents mainly attributed to the higher bioavailability in China. The framework, coupled with knowledge transfer bridging the knowledge gap between geochemical processes and crop uptake, would facilitate the informed decision-making and targeted remediation measures for sustainable agricultural practices and long-term environmental health.

Stabilization mechanism and remediation effectiveness of Pb and cd in agricultural soil using nonmetallic minerals

Article Open access 14 April 2025

Effect of lead zinc mineralization area on heavy metals accumulation and geochemical fractions of agricultural soils in Southwest China

Article Open access 01 June 2025

Effects of soil properties on heavy metal bioavailability and accumulation in crop grains under different farmland use patterns

Article Open access 02 June 2022

Introduction

Among the seventeen Sustainable Development Goals (SDGs) for the year 2030 set by United Nations, eight are reliant on maintaining a healthy soil environment. However, soil contamination by heavy metals (HMs) poses a formidable threat to arable soil resources worldwide¹. Due to the persistent, nondegradable, and bioaccumulative nature of toxic HMs, HM-contaminated soils have long-lasting detrimental effects on global food safety and human health. According to estimates, more than 9 million premature deaths are attributable to HM pollution², underscoring the critical importance of accurately assessing the environmental risks of soil HMs and developing appropriate soil remediation strategies³.

Regional or continental maps of HM distributions in agricultural soils have been lacking until recently. Prior to the late 1990s, only a few regional- to continental-scale soil geochemical surveys with sufficiently large sampling coverage were carried out to assess the topsoil distribution of HMs. Notably, China conducted three major National Soil Surveys in 1959, 1979, and 2022 with increased sampling density and soil properties⁴. Additionally, in 2012, China conducted its first national survey focused on agricultural soil pollution, particularly HM contents⁵. In the European Union (EU), three major soil geochemical surveys were conducted with repeated sampling every few years: the FORum of European Geological Surveys (FOREGS)⁶ in 1997, the GEochemical Mapping of Agricultural and grazing land Soil (GEMAS)⁷ in 2008, and the Land Use/land Cover Area frame Survey (LUCAS)^8,9 in 2009. The LUCAS Topsoil Survey sampled a greater variety of land-use classes and soil types with more soil properties and a greater sampling density compared with the other two surveys (on average, 1 site/5000 km² in FORGES, 1 site/2500 km² in GEMAS, and 1 site/200 km² in LUCAS). These data and the derived maps have proven to be useful tools for environmental and resource management at regional or continental scales. For example, spatial assessments of total Hg¹⁰, Cu¹¹, Zn¹², and Cd¹³ in European soils have been conducted utilizing their total contents in LUCAS, which facilitated the identification of the key properties that determine the spatial variability of HMs in soils.

While total HM contents are commonly used to assess soil contamination risk, it is insufficient for assessing HM mobility, bioavailability, and toxicity in soils¹⁴. The interactions and adsorption of HMs on soil components induce considerable variability in bioavailability across different soil matrices¹⁵. Consequently, the bioavailable fraction of HMs, which represents the portion that can be taken up by plants and soil organisms, has been proposed as a more reliable parameter for risk assessment¹⁶. Various methods, including chemical extraction and modeling methods, have been developed to determine the bioavailability of HMs in soils^17,18. Among these, the Multi-Surface Models (MSMs) have shown promising results in predicting the bioavailability of HMs, such as Cd, Ni, and Zn, across different soil types¹⁹. MSMs consider that HMs in the soil solution are in thermodynamic equilibrium with different solid phases, including organic matter (OM), iron oxides, and clay minerals¹⁹, and therefore, use geochemical surface complexation models to capture the complex interaction and adsorption mechanisms between HMs and soil matrices with greater generalizability. However, MSMs require a large number of model parameters^20,21,22, and difficulties arise from the diverging reaction parameters from various geochemical surface complexation models due to different simplifications and assumptions²³. Therefore, a reliable and convenient modeling approach is urgently needed to accurately predict the mechanistic adsorption distribution, and bioavailability of HMs in soils at larger scales.

Machine learning (ML) has become an indispensable tool in environmental science and geoscience due to its robustness and high prediction accuracy, which provides a new path forward in directly leveraging the ever-expanding wealth of scientific data available in the literature²⁴. This technique has demonstrated success in high-resolution soil property mapping. One application involves estimating global patterns of soil organic carbon (SOC) decomposition rates by fusing multisource data to uncover the hidden nonlinear relationships among geochemical factors²⁵. Moreover, ML techniques have emerged as promising alternatives for dealing with multiphysical or chemical processes by embedding the knowledge of underlying physical or chemical laws²⁵. Recently, hybrid ML geochemical models have been proposed for determining the fate and transport of uranium anions in subsurface environments, which has the potential to solve the problems that limit the development of self-consistent geochemical surface complexation models as well as MSMs for large-scale applications²⁶.

In this study, we developed a geochemical-integrated ML framework to assess the Cd speciation distribution of HMs in soils. Our approach, which swiftly predicts the bioavailable fraction of soil Cd, offers a mechanistic geochemical interpretation at the continental scale. It is instrumental in facilitating risk assessment, guiding policy formulation, and implementing targeted remediation measures for soil HM pollution, which in turn promotes sustainable agricultural practices and substantially contributes to long-term environmental health across various scales, spanning from regional to continental and even global levels.

Results and discussion

ML-based framework

Our geochemical-integrated ML framework utilized regional soil surveys, crop uptake collection and MSM modeling to predict the Cd associated with soil interfaces and the dissolved fraction of nonindustrial soil Cd at the continental scale (Fig. 1). The overall structure of the work can be broadly divided into five sections (details on the ML framework and validation procedures are described in “Methods”): (1) To predict the amorphous ferrihydrite content (represented by hydrous ferric oxides and abbreviated as HFO) and Cd speciation distribution (abbreviated as dist) in the EU and China, four datasets (EU_HFO, EU_dist, CN_HFO and CN_dist) were compiled; (2) ML algorithms were developed on the EU_HFO and CN_HFO to predict the HFO content at each site in the EU_dist and CN_dist (Methods and Texts S1–S2); (3) The Cd_dist dataset was formed by the Latin hypercube sampling technique and was combined with MSM output variables to train ML models (Methods and Text S3). The best-performing ML algorithm was used to predict the solid/liquid distribution at each site in EU_dist and CN_dist, and a comparative analysis of the differences between maps of the total and dissolved fractions of Cd was conducted, and (4) Knowledge transfer (KT) models were established within the ML framework to estimate the accumulation of Cd in wheat grains and roots.

**Fig. 1: Flow diagram of the modeling framework.**

Relationships between soil properties and Cd distribution patterns

Based on previous research on the retention of Cd in soils, the bioavailable Cd content in soils is predominantly controlled by adsorption onto four major reactive soil components: soil organic matter (SOM), dissolved organic matter (DOM), clay, and HFO. Cd adsorption to SOM and DOM was described with different ratios of humic acid (HA) and fulvic acid (FA). HA-Cd, FA-Cd, Clay-Cd, and HFO-Cd represent the amount of Cd adsorbed on these surfaces (Methods and Text S7). The reliability of the MSMs in both the EU and China was demonstrated by validating the correlation between the MSM-calculated equilibrium dissolved Cd (MSMs-Cd) and empirical data on plant uptake and extracted Cd. The robust linear correlation demonstrated that MSM-calculated dissolved Cd effectively indicates soil Cd bioavailability (Text S4) and was used to evaluate the Cd bioavailability. Within the geochemical-integrated ML framework, the gradient boosting regression tree (GBRT) model trained on the Cd_dist dataset achieved satisfactory R² values of 0.998 and 0.989 on the internal and external test sets, respectively, and thus was chosen as the final algorithm to predict the solid‒liquid distribution of Cd at each site (Text S3). The dissolved Cd concentration predicted by GBRT was termed as ML-Cd.

Taking the prediction of the ML-Cd as an example, three different feature analysis techniques were utilized to evaluate the importance of different features (Figs. S1 and S2). The total Cd content, pH, and SOC (accounting for 58% of OM²⁷) content were identified as the three most important features for the prediction of the ML-Cd, among which SOC was the dominant sorbent in the soil. As shown in Fig. S1a, SHapley Additive explanation (SHAP) analysis (a feature importance method, detailed in Methods) for predicting ML-Cd revealed that pH, SOC, clay, and HFO were negatively correlated with the content of ML-Cd. Conversely, higher concentrations of total Cd were associated with greater predictions of ML-Cd. A distinct pattern of total Cd content was observed where a dense cluster of high Cd concentration instances (red points) with small and positive SHAP values, while instances of lower Cd concentration (blue points) extended further toward the left, suggesting that low total Cd concentration had a stronger negative impact on ML-Cd. More details about the interaction effects can be found in the dependence plot, interaction plot, and heatmap, and the details are given in Text S5 and Fig. S3.

HFO is identified as an important interface for Cd, since it can enhance Cd fixation during soil redox cycles²⁸. Furthermore, it plays a crucial role in certain localities, especially in areas with high pH and low SOC content²⁹. For example, the pH on the Ibérian Peninsula increases from north to south, accompanied by decreases in the contents of SOC and HFO (Fig. S4)^30,31. Three representative areas within the region were selected, the corresponding distributions of Cd species on various soil components were computed for each soil site within these areas, and the 25 sites with the highest proportions of HFO-Cd (the percentage ratio of HFO-Cd relative to total soil Cd) are presented in Fig. 2. The results indicated that the proportion of Cd adsorbed by HFO in these three areas increased from north to south, with average values of 1.5 ± 2.9%, 15.5 ± 10.5%, and 30.7 ± 4.0%, respectively. Similarly, along the east coast of Italy, where the soil exhibited high pH, low SOC content, and relatively high HFO levels, the adsorption capacity of HFO was more pronounced, with an average HFO-Cd proportion of 33.3 ± 6.0%. The fraction of Cd on HFO ranged from 0 to 53.1% of the total Cd in the four selected areas.

**Fig. 2: The distribution of Cd associated with different soil phases under different soil property conditions.**

Comparison of the spatial distribution of the total and dissolved fractions of soil Cd between the EU and China

The Cd content in soil is influenced not only by human activities (e.g., industrial activities, mining processes, and fertilization inputs¹³) but also by the regional geological background and weathering-to-soil processes. Specifically, soils over carbonate bedrocks accumulate Cd through a self-regulating cycle: carbonate rocks have a high potential for releasing Cd, while OM and Fe/Mn oxides immobilize it via adsorption/complexation³². This equilibrium between mobilization and retention explains the consistent spatial overlap of high-Cd areas and carbonate-rich regions in both EU and China (Fig. 3a, b)³³. As shown in Tables S1 and S2, the total Cd content in the EU varied between 0.11 and 1.55 mg kg⁻¹, with an average of 0.37 mg kg⁻¹ and a standard deviation of 0.17 mg kg⁻¹, while the total Cd content in China ranged from 0.01 to 14.22 mg kg⁻¹, with an average of 0.41 and a standard deviation of 1.03. The total Cd content in Chinese topsoil was approximately 10.8% greater than that in the EU.

**Fig. 3: Spatial interpolation map of total and dissolved Cd.**

The prediction of the Cd solid‒liquid distribution at each site was carried out using the geochemical-integrated ML framework (GBRT trained on the Cd_dist dataset), and the major statistics of the prediction results are summarized in Table S3. Figs. 3c, d and S5 present distribution maps illustrating the ML-Cd and adsorbed-Cd concentrations on different soil reactive components in the EU or China. In particular, the predicted range for the ML-Cd within nonindustrial topsoil in the EU was 0.3–971.2 μg L⁻¹, with a mean value of 96.9 μg L⁻¹ and a standard deviation of 116.5 μg L⁻¹. In China, the corresponding range was 0.06–4151.2 μg L⁻¹, with a mean of 113.2 μg L⁻¹ and a standard deviation of 337.3 μg L⁻¹. ML-Cd was on average 16.8% greater than that in the EU. Table S4 and Fig. S6 present the statistical results and distribution maps of the proportions of various Cd forms to the total Cd content in China and the EU, respectively. On average, the proportions of ML-Cd to the total Cd in the EU and China were 26.3 ± 22.9% and 24.6 ± 18.0%, respectively. Moreover, the percentages of SOM-bound Cd (HA-Cd + FA-Cd) were 61.1 ± 20.0% and 65.4 ± 19.3%, respectively, surpassing the percentages of Cd associated with the other interfaces (clay-Cd, constituting 3.5 ± 3.3% and 5.8 ± 4.6%, and HFO-Cd, constituting 9.1 ± 11.7% and 4.3 ± 6.8% of the total Cd in the EU and China, respectively). Based on Tables S1 and S2, the relatively higher total Cd and lower contents of OC and HFO in Chinese topsoil were thought to be the main soil components contributing to the greater bioavailability. With nearly comparable mean pH values (6.33 in the EU and 6.64 in China), the mean OC content in China (14.76 g kg⁻¹) was markedly lower than that in the EU (36.53 g kg⁻¹), and a similar pattern was also observed for the mean HFO content (2.6 g kg⁻¹ in the EU and 1.6 g kg⁻¹ in the CN). Consequently, as shown in Table S3, the mean adsorption amounts of Cd on HA and HFO in China are less than those in the EU (148.6 μg L⁻¹ and 7.2 μg L⁻¹ in China, respectively, compared to 194.3 μg L⁻¹ and 28.1 μg L⁻¹ in the EU).

As shown in Fig. 3a, the highest amount of total Cd in nonindustrial topsoil was found in Ireland, followed by northern Spain, northern Sweden, Finland, and Poland. Lithuania, Slovenia, central Romania, and the west coast of Greece also exhibited relatively high Cd contents. Nevertheless, Fig. 3c depicts a different picture in which northern Sweden, Finland, and Poland exhibited the highest bioavailability rather than Ireland; on the other hand, Italy showed higher total Cd levels but a very low Cd bioavailability. The total Cd and MSM-Cd in England, Estonia, Latvia, Hungary, and Bulgaria posed low environmental risks. By combining SHAP analysis and maps of pH³⁰, SOC³¹, and clay³¹, it was determined that pH and SOC content were the primary drivers behind the differences in the spatial distributions of the total and ML-Cd (Fig. S6). Typically, low soil pH was the major driver for regions with lower total Cd contents but higher Cd bioavailability, such as Sweden and Poland. This is consistent with the scenario depicted in Fig. S6, where regions with a high proportion of ML-Cd closely align with areas of lower pH. On the other hand, despite the high total Cd content in Ireland, the high SOC content in Ireland resulted in substantial Cd adsorption by HA and FA and therefore a comparatively low Cd bioavailability.

In contrast to those in the EU, the distributions of total Cd, ML-Cd and the proportion of ML-Cd in China show a similar pattern to that of total Cd. As shown in Fig. 3b, d, areas exhibiting higher levels of both total and ML-Cd were found in Taiwan Province, Guangxi Province, and the junction of Yunnan and Guizhou Provinces. This phenomenon can be attributed to a noticeable pattern in the distribution of soil properties across China. Specifically, soil pH tends to be greater in the north and lower in the south⁵, while SOC and clay content tend to be lower in the north and higher in the south^5,34.

Knowledge transfer for bioavailability prediction

Most regional or continental risk assessments of soil Cd solely relied on total content¹³, while bioavailable Cd measurements content remain problematic for risk evaluation because there are different types of bioavailable Cd contents as defined by different extraction procedures. Therefore, we integrated the above geochemical adsorption processes with the crop Cd uptake within the soil‒plant system using knowledge transfer (KT). The KT model was established by incorporating the predictive outcomes of the Cd speciation distribution obtained from the best-performing GBRT model (Table S5), and trained alongside a nonmechanistic data-driven (DD) model on Chinese wheat data. Although comparative performance metrics (performance score in Table S6; SHAP for DD shown in Fig. 4a, c, SHAP for KT shown in Fig. 4b, d) demonstrated similar accuracy between models, the KT model distinguished itself by encoding geochemical interaction governing Cd distribution. Feature importance analysis (Fig. 4b, d) confirms that ML-Cd and clay Cd exhibit predominant influence in the KT model, outperforming all other soil properties by substantial margins. It is physically sound that both the soluble and electrostatically clay-bound Cd can be attributed to the exchangeable fraction³⁵. These findings were also consistent with that Cd bioavailability is not solely determined by the ML-Cd fraction but also influenced by the presence of Cd at other interfaces during the dynamic Cd uptake process³⁶. Hence, the KT model effectively addresses this complexity by leveraging the knowledge transferred from a different but related task in an end-to-end paradigm. Consequently, our geochemical-integrated ML framework successfully bridged the knowledge gap by integrating knowledge from geochemical processes into crop uptake processes, leading to a better understanding of predicting Cd accumulation in both wheat roots and grains. Its potential applications in diverse species and complex scenarios await further validations.

**Fig. 4: Comparison of knowledge transfer and nonmechanistic data-driven models in predicting Cd plant uptake.**

Conclusions

Bioavailability and risks of HMs in agricultural soils remain one of the important obstacles to achieving global food safety and security. Previous studies have focused on specific case studies, whereas accurate regional- and continental-scale assessments of Cd bioavailability and risk have been lacking. This study presents an approach that integrates the mechanistic aspects of geochemical complexation processes for predicting the Cd speciation distribution in soils and crop uptake, bridging the gap between laboratory research and field applications. The end-to-end geochemical-integrated ML framework using only geographical location and few soil properties (pH, clay fraction, and SOC content) drastically reduced the costs and time needed for accurate estimation of bioavailable Cd. By leveraging existing survey data, the framework generated maps depicting Cd contents in various phases across the EU and China, achieving substantial savings in monitoring costs. Crop uptake of Cd with mechanistic interpretation was also accomplished through the extended KT model, providing valuable insights for environmental risk assessments, sustainable food security policies, and effective soil remediation strategies. Our findings underscore that relying solely on total Cd content can underestimate risk, highlighting the importance of considering different Cd species in risk assessments. The developed framework can also be readily extended to encompass most other HMs, although prediction of bioavailable lead (Pb) and mercury (Hg) would remain challenging due to their complex behaviors and transformations in soils^19,37. If more sophisticated soil-HM interaction mechanisms are considered into the framework, it may provide a comprehensive understanding of HM interactions and their associated risks in soils.

The accuracy of predictions is subjected to several data and methodological constraints. First and foremost, while MSMs ideally require chemically reactive Cd (typically 0.43 M HNO₃-extractable) as input, our reliance on aqua regia-extractable total Cd—the only consistently available metric across datasets—may overestimate MSMs-Cd³⁸. Future refinements should prioritize either developing pH-dependent empirical functions to convert total Cd to reactive fractions, or incorporating aging models³⁹ to better constrain chemically available Cd inputs for MSMs. For European predictions, 73% of LUCAS soil Cd concentrations fell below the detection limit (0.07 mg kg⁻¹). Although the predictive R² of the interpolated map reached 0.45 at a resolution of 100 m¹³, there are inherent uncertainties associated with the interpolated data compared to source data. Similarly, Chinese estimates combine 40-year-old national survey data with heterogeneous literature sources, potentially misrepresenting current contamination patterns. While the risk assessment of HMs tends to prioritize hotspots with high HM contents, it is important to acknowledge these limitations and strive for improved data collection and data quality in future studies. Methodological approximations, including equilibrium state assumptions (despite the nonequilibrium nature of soil processes) and interpolated soil properties, further introduce uncertainties. Geological processes such as carbonate weathering release Cd into soils, while pedological factors such as OM and Fe/Mn oxides drive its re-immobilization, collectively shaping the net available Cd pool. Incorporating these coupled mechanisms could refine predictive accuracy of Cd bioavailability.

Despite the known biases in the ML framework predictions, it remains valuable for decision-making at regional and continental scales⁴⁰. By improving data accuracy, enhanced predictions and risk assessments for soil Cd contamination can be achieved, facilitating informed decision-making for sustainable environmental protection.

Methods

Datasets

Soil properties: Soil organic carbon (SOC), clay minerals, and amorphous ferrihydrite (HFO) were selected as the primary solid phases responsible for Cd adsorption in soils in MSMs as well as in the geochemical-integrated machine learning framework. The HFO is particularly essential for the retention of Cd in alkaline soils^19,41. However, HFO content has not been determined in most literature or in the LUCAS survey series (Table S7) except for LUCAS 2018. To address this gap, we established two datasets for predicting HFO content in Europe and China: (1) the EU_HFO dataset comprising HFO measurements from 2510 LUCAS 2018 sampling sites with complete characterization of 11 soil properties, and (2) the CN_HFO dataset incorporating 132 literature-derived HFO measurements from China, limited to three commonly reported parameters (pH, OC, and clay content) due to data availability constraints. The resultant HFO were then compiled into the datasets among other soil properties for the prediction of Cd speciation distribution at soil interfaces (EU_dist and CN_dist). All data were strictly collected from topsoil layers (0–20 cm depth). An overview of the main statistics and probability density distribution histograms of these four datasets is given in Tables S1–S2 and S8–S9 and Figs. S7–S10. The detailed procedures for data processing for these datasets are available in Text S6.

Environmental Covariates: The formation of HFO is closely related to the interactions between soil properties, vegetation, climate, parent material, land surface moisture and thermal conditions, and human activity over time. Following Liu et al.⁵, we incorporated 14 environmental covariates (climate, topography, and land surface characteristics) as part of predictive variables in EU_HFO and CN_HFO for modeling HFO content. These environmental covariates are derived from satellite data such as MODIS and other global datasets (details in Table S10 and accompanying notes).

Spatial Covariates: To take the spatial associations of HFO content between eigenvalues into consideration, longitude, latitude, and the 6 spatial interpolation covariates used in random forest spatial interpolation (RFSI), which include values obtained at three observation points near the target point and the distance between the points⁴², were introduced as covariates carrying information about spatial correlation.

Multisurface model description and validation

The EU_dist and CN_dist, which are filled with predicted HFO, were then used in the MSMs to predict the dissolved Cd fraction and speciation distribution in those two areas. MSMs are thermodynamic equilibrium-based geochemical models¹⁹ that include four reactive surfaces for Cd adsorption, i.e., soil organic matter (SOM), dissolved organic matter (DOM), iron oxides, and clay minerals (<0.002 mm). Thus, the total Cd, SOM, DOM, HFO, and clay fraction for each data point in both EU_dist and CN_dist were used as input variables to calculate the equilibrium dissolved Cd (MSMs-Cd) and Cd speciation (Cd adsorption on the various solid phases) by MSMs. By comparison with the EC-derived background electrolyte concentration, 0.001 M CaCl₂ was suggested to represent coexisting ions⁴¹. The soil-solution ratio was set at 10 g L⁻¹ based on its demonstrated performance perform for predicting both Cd in porewater and crops uptake^23,43. The calculations were performed using the ORCHESTRA program⁴⁴. The detailed model settings and calculation processes are described in Texts S7, S8. Validation against plant uptake demonstrated that MSMs-Cd is an effective indicator of the bioavailability of Cd in soils (Text S4)^43,45.

Machine learning model development and selection

Latin hypercube sampling-based dataset: Latin hypercube sampling (LHS) and its variants are widely used random sampling techniques that allow for more efficient recreation of input distributions with fewer samples from a multidimensional distribution (More details about LHS and other sampling methods can be found in Text S9). The sampling methods were applied in this study to determine which ML model requires the smallest training set to achieve optimal prediction performance. LHS was employed repeatedly to generate a series of combinations of soil properties and total Cd content to reconstruct the distribution of soil properties in the EU_dist (which has a larger volume and more consistent measurement standards than CN_dist) using fewer data points and expanding the predictive capabilities of ML models (Text S10). MSMs were employed on the datasets to calculate the solid/liquid distribution of Cd for these combinations (paired soil properties-total Cd content). These combinations served as descriptors, while the outputs of the MSMs served as targets, collectively forming the LSH-based dataset (labeled LSH-N based on the sampling repetitions (N)). Various ML models were trained and compared on LSH-N to choose the optimal ML model within the smallest dataset. As a supplement, LSH-4000 was eventually chosen as the final training set and was given the name Cd_dist (Text S3).

Machine learning model development: Ten ML algorithms were used in this study, including five traditional learning methods (ridge regression (Ridge), Lasso regression (Lasso), Elastic Net regression (ElasticNet), K-Nearest Neighbors (KNN) and Support Vector Regression (SVR)), four ensemble models (random forest (RF), extremely randomized trees (ERT), Gradient Boosting Regression Tree (GBRT), and eXtreme Gradient Boosting (XGBoost)), and one deep learning model (Multi-Layer Perceptron (MLP)). The hyperparameters to be adjusted for these algorithms were listed in Table S11. All models were employed in constructing geochemical-integrated ML models (More detailed procedure in Text S3).

To obtain the best model for predicting HFO, ten ML models were trained on the EU_HFO and CN_HFO datasets. To select the optimal ML model that can mimic MSMs computations with minimal training data, the LSH technique was applied repeatedly to draw different numbers (N) of subsets (LSH-N) in EU_dist to form LSH-N. Then, the distribution of Cd(II) at solid‒liquid interfaces in the LSH datasets was calculated using the ORCHESTRA program. Finally, the four ensemble models and MLP were trained and compared using soil properties in LSH-N as features and the outputs of ORCHESTRA as targets.

Feature importance evaluation: To evaluate the importance of different descriptors and interpret the predicted model, three feature importance analysis methods, namely, permutation feature importance analysis, impurity feature importance analysis, and the SHapley Additive exPlanation (SHAP) method, were applied. Compared to common feature importance methods, SHAP estimates not only how much but also how each feature contributes to model prediction, providing a fresh perspective for interpreting the interplay between soil properties and HMs⁴⁶. Moreover, the results can also be used to simplify the ML model, and the most valuable features can be identified by combining the feature importance and correlation results⁴⁶.

Knowledge transfer: Transfer learning based on neural networks has made considerable progress in the field of processing large amounts of image and text data due to its advantages of faster speed, better performance, and cost savings⁴⁷. However, it is a challenge to effectively transfer knowledge in small datasets in the environmental field. Therefore, a knowledge transfer (KT) algorithm based on tree algorithms and tabular data has been proposed⁴⁸. Specifically, the Cd speciation distribution results predicted by the best-performing GBRT model were used as additional features to predict plant adsorption. We compare the performance of the models after KT with that of the standalone nonmechanistic data-driven (DD) models trained solely on the absorption data of Chinese wheat.

Data availability

The European data and maps of this study will be made available in the European Soil Data Centre (ESDAC, https://esdac.jrc.ec.europa.eu/)³¹. Data for this manuscript are available at Zenodo with the following link: https://doi.org/10.5281/zenodo.15667161.

Code availability

Code for this manuscript is available at Zenodo with the following link: https://doi.org/10.5281/zenodo.15667161.

References

Hou, D. et al. Metal contamination and bioremediation of agricultural soils for food safety and sustainability. Nat. Rev. Earth Environ. 1, 366–381 (2020).
Article Google Scholar
Fuller, R. et al. Pollution and health: a progress update. Lancet Planet. Health 6, e535–e547 (2022).
Article Google Scholar
Han, R. et al. Bibliometric overview of research trends on heavy metal health risks and impacts in 1989–2018. J. Clean. Prod. 276, 123249 (2020).
Article CAS Google Scholar
Li, M. et al. National multi-purpose regional geochemical survey in China. J. Geochem. Explor. 139, 21–30 (2014).
Article CAS Google Scholar
Liu, F. et al. Mapping high resolution national soil information grids of China. Sci. Bull. 67, 328–340 (2022).
Article Google Scholar
Salminen, R., De Vos, W. & Tarvainen, T. Geochemical atlas of Europe. (Geological survey of Finland Espoo, 2005).
Reimann, C., Birke, M., Demetriades, A., Filzmoser, P. & O’Connor, P. Chemistry of Europe’s agricultural soils, part A. (Schweizerbart’sche Verlagsbuchhandlung, 2014).
Fernández-Ugalde, O. et al. LUCAS 2018 Soil Module. (Publications Office of the European Union, 2022).
Fernández-Ugalde, O. et al. LUCAS 2018 Soil Module. Presentation of dataset and results (2018).
Ballabio, C. et al. A spatial assessment of mercury content in the European Union topsoil. Sci. Total Environ. 769, 144755 (2021).
Article CAS Google Scholar
Ballabio, C. et al. Copper distribution in European topsoils: an assessment based on LUCAS soil survey. Sci. Total Environ. 636, 282–298 (2018).
Article CAS Google Scholar
Van Eynde, E., Fendrich, A. N., Ballabio, C. & Panagos, P. Spatial assessment of topsoil zinc concentrations in Europe. Sci. Total Environ. 892, 164512 (2023).
Article Google Scholar
Ballabio, C., Jones, A. & Panagos, P. Cadmium in topsoils of the European Union–an analysis based on LUCAS topsoil database. Sci. Total Environ. 912, 168710 (2024).
Article CAS Google Scholar
Tóth, G., Hermann, T., Da Silva, M. R. & Montanarella, L. J. E. I. Heavy metals in agricultural soils of the European Union with implications for food safety. Environ. Int. 88, 299–309 (2016).
Article Google Scholar
Huang, J., Fan, G., Liu, C. & Zhou, D. Predicting soil available cadmium by machine learning based on soil properties. J. Hazard. Mater. 460, 132327 (2023).
Article CAS Google Scholar
Kim, R.-Y. et al. Bioavailability of heavy metals in soils: definitions and practical implementation—a critical review. Environ. Geochem. Health 37, 1041–1061 (2015).
Article CAS Google Scholar
Tipping, E. WHAMC—a chemical equilibrium model and computer code for waters, sediments, and soils incorporating a discrete site/electrostatic model of ion-binding by humic substances. Comput. Geosci. 20, 973–1023 (1994).
Article CAS Google Scholar
Peijnenburg, W. J. G. M., Zablotskaja, M. & Vijver, M. G. Monitoring metals in terrestrial environments within a bioavailability framework and a focus on soil extraction. Ecotoxicol. Environ. Saf. 67, 163–179 (2007).
Article CAS Google Scholar
Weng, L., Temminghoff, E. J. M. & Van Riemsdijk, W. H. Contribution of individual sorbents to the control of heavy metal activity in sandy soil. Environ. Sci. Technol. 35, 4436–4443 (2001).
Article CAS Google Scholar
Bonten, L. T., Groenenberg, J. E., Weng, L. & van Riemsdijk, W. H. Use of speciation and complexation models to estimate heavy metal sorption in soils. Geoderma 146, 303–310 (2008).
Article CAS Google Scholar
Dijkstra, J. J., Meeussen, J. C. L. & Comans, R. N. J. Evaluation of a generic multisurface sorption model for inorganic soil contaminants. Environ. Sci. Technol. 43, 6196–6201 (2009).
Article CAS Google Scholar
Groenenberg, J. E., Romkens, P. F. A. M., van Zomeren, A., Rodrigues, S. M. & Comans, R. N. J. Evaluation of the single dilute (0.43 M) nitric acid extraction to determine geochemically reactive elements in soil. Environ. Sci. Technol. 51, 2246–2253 (2017).
Article CAS Google Scholar
Li, Y. et al. Combining multisurface model and Gouy–Chapman–Stern model to predict cadmium uptake by cabbage (Brassica Chinensis L.) in soils. J. Hazard. Mater. 416, 126260 (2021).
Article CAS Google Scholar
Xiang, D., Wang, G., Tian, J. & Li, W. Global patterns and edaphic-climatic controls of soil carbon decomposition kinetics predicted from incubation experiments. Nat. Commun. 14, 2171 (2023).
Wu, X. et al. Sensing prior constraints in deep neural networks for solving exploration geophysical problems. Proc. Natl. Acad. Sci. USA 120, e2219573120 (2023).
Chang, E., Zavarin, M., Beverly, L. & Wainwright, H. A chemistry-informed hybrid machine learning approach to predict metal adsorption onto mineral surfaces. Appl. Geochem. 155, 105731, (2023).
Nelson, D. W. & Sommers, L. E. Total carbon, organic carbon, and organic matter. Methods Soil Anal. Part 2 Chem. Microbiol. Prop. 9, 539–579 (1983).
Google Scholar
Imoto, Y. & Yasutaka, T. Comparison of the impacts of the experimental parameters and soil properties on the prediction of the soil sorption of Cd and Pb. Geoderma 376, 114538 (2020).
Article CAS Google Scholar
Benjamin, M. M. & Leckie, J. O. Multiple-site adsorption of Cd, Cu, Zn, and Pb on amorphous iron oxyhydroxide. J. Colloid Interface Sci. 79, 209–221 (1981).
Article CAS Google Scholar
Ballabio, C. et al. Mapping LUCAS topsoil chemical properties at European scale using Gaussian process regression. Geoderma 355, 113912 (2019).
Article CAS Google Scholar
Panagos, P. et al. European Soil Data Centre 2.0: soil data and knowledge in support of the EU policies. Eur. J. Soil Sci. 73, e13315 (2022).
Article Google Scholar
Quezada-Hinojosa, R. P., Matera, V., Adatte, T., Rambeau, C. & Föllmi, K. B. Cadmium distribution in soils covering Jurassic oolitic limestone with high Cd contents in the Swiss Jura. Geoderma 150, 287–301 (2009).
Article CAS Google Scholar
Goldscheider, N. et al. Global distribution of carbonate rocks and karst water resources. Hydrogeol. J. 28, 1661–1677 (2020).
Article CAS Google Scholar
Song, X.-D. et al. Mapping soil organic carbon content by geographically weighted regression: A case study in the Heihe River Basin, China. Geoderma 261, 11–22 (2016).
Article CAS Google Scholar
Cui, Y. & Weng, L. Interpretation of heavy metal speciation in sequential extraction using geochemical modelling. Environ. Chem. 12, 163–173 (2015).
Article CAS Google Scholar
Li, Q. et al. Speciation of heavy metals in soils and their immobilization at micro-scale interfaces among diverse soil components. Sci. Total Environ. 825, 153862 (2022).
Article CAS Google Scholar
Gworek, B., Dmuchowski, W. & Baczewska-Dąbrowska, A. H. Mercury in the terrestrial environment: a review. Environ. Sci. Eur. 32, 128 (2020).
Article CAS Google Scholar
Garforth, J. M., Tye, A. M., Young, S. D., Bailey, E. H. & Lofts, S. A comparison of characterisation and modelling approaches to predict dissolved metal concentrations in soils. Environ. Chem. 21, EN23075 (2024).
Xu, L., Lofts, S. & Lu, Y. Terrestrial ecosystem health under long-term metal inputs: modeling and risk assessment. Ecosyst. Health Sustain. 2, e01214 (2016).
Article Google Scholar
Vijver, M. G., Spijker, J., Vink, J. P. & Posthuma, L. Determining metal origins and availability in fluvial deposits by analysis of geochemical baselines and solid–solution partitioning measurements and modelling. Environ. Pollut. 156, 832–839 (2008).
Article CAS Google Scholar
Li, Y. et al. Prediction of the uptake of Cd by rice (Oryza sativa) in paddy soils by a multi-surface model. Sci. Total Environ. 724, 138289 (2020).
Article CAS Google Scholar
Zhao, W. et al. Accurate prediction of soil heavy metal pollution using an improved machine learning method: a case study in the Pearl River Delta, China. Environ. Sci. Technol. 57, 17751–17761 (2023).
Article CAS Google Scholar
Zhu, B., Liao, Q., Zhao, X., Gu, X. & Gu, C. A multi-surface model to predict Cd phytoavailability to wheat (Triticum aestivum L.). Sci. Total Environ. 630, 1374–1380 (2018).
Article CAS Google Scholar
Meeussen, J. C. L. ORCHESTRA: an object-oriented framework for implementing chemical equilibrium models. Environ. Sci. Technol. 37, 1175–1182 (2003).
Article CAS Google Scholar
Qu, X. et al. A field study to predict Cd bioaccumulation in a soil-wheat system: application of a geochemical model. J. Hazard. Mater. 400, 123135 (2020).
Article CAS Google Scholar
Palansooriya, K. N. et al. Prediction of soil heavy metal immobilization by biochar using machine learning. Environ. Sci. Technol. 56, 4187–4198 (2022).
Article Google Scholar
Zhuang, F. et al. A comprehensive survey on transfer learning. Proc. IEEE 109, 43–76 (2020).
Article Google Scholar
Zhong, S., Zhang, Y. & Zhang, H. Machine learning-assisted QSAR models on contaminant reactivity toward four oxidants: combining small data sets and knowledge transfer. Environ. Sci. Technol. 56, 681–692 (2021).
Article Google Scholar

Download references

Acknowledgements

This study was supported by the National Natural Science Foundation of China (42225701, 41977027, and 41671239) and the National Key Research and Development Program of China (2021YFC1809100 and 2020YFC1806801). LUCAS soil samples were collected and analyzed with the support of EUROSTAT, DG AGRI, CLIMA, and ENV (European Commission). We are thankful for the data support from the “Soil SubCenter, National Earth System Science Data Center, National Science & Technology Infrastructure of China (http://soil.geodata.cn).

Author information

Authors and Affiliations

State Key Laboratory of Soil & Sustainable Agriculture, Institute of Soil Science, Chinese Academy of Sciences, Nanjing, China
Naichi Zhang, Chen Lv, Yan Li, Jun Man, Yifan Qian, Peixin Cui, Tongliang Wu, Meiying Huang, Cun Liu & Yujun Wang
University of Chinese Academy of Sciences, Beijing, China
Naichi Zhang, Yifan Qian & Yujun Wang
College of Environmental Science and Engineering, Yangzhou University, Yangzhou, China
Chen Lv
European Synchrotron Radiation Facility (ESRF), Grenoble, France
Yan Li
European Commission, Joint Research Centre (JRC), Ispra, Italy
Panos Panagos & Cristiano Ballabio
State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, China
Xueyuan Gu
College of Resources and Environmental Sciences, Nanjing Agricultural University, Nanjing, China
Fang-Jie Zhao & Peng Wang
College of Environmental & Resource Sciences, Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Zhejiang University, Hangzhou, China
Xingmei Liu

Authors

Naichi Zhang
View author publications
Search author on:PubMed Google Scholar
Chen Lv
View author publications
Search author on:PubMed Google Scholar
Yan Li
View author publications
Search author on:PubMed Google Scholar
Panos Panagos
View author publications
Search author on:PubMed Google Scholar
Cristiano Ballabio
View author publications
Search author on:PubMed Google Scholar
Jun Man
View author publications
Search author on:PubMed Google Scholar
Xueyuan Gu
View author publications
Search author on:PubMed Google Scholar
Fang-Jie Zhao
View author publications
Search author on:PubMed Google Scholar
Peng Wang
View author publications
Search author on:PubMed Google Scholar
Xingmei Liu
View author publications
Search author on:PubMed Google Scholar
Yifan Qian
View author publications
Search author on:PubMed Google Scholar
Peixin Cui
View author publications
Search author on:PubMed Google Scholar
Tongliang Wu
View author publications
Search author on:PubMed Google Scholar
Meiying Huang
View author publications
Search author on:PubMed Google Scholar
Cun Liu
View author publications
Search author on:PubMed Google Scholar
Yujun Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

Naichi Zhang: Writing—review and editing, Writing—original draft, Visualization, Investigation, Formal analysis, Data curation, Chen Lv: Writing—review and editing, Visualization, Formal analysis, Data curation, Yan Li: Methodology, Formal analysis, Data curation, Panos Panagos: Writing—review and editing, Funding acquisition, Data curation, Cristiano Ballabio: Writing—review and editing, Data curation, Jun Man: Writing—review and editing, Methodology, Data curation, Xueyuan Gu: Writing—review and editing, Methodology, Data curation, Fang-Jie Zhao: Writing—review and editing, Data curation, Peng Wang: Writing—review and editing, Data curation, Xingmei Liu: Writing—review and editing, Data curation, Yifan Qian: Visualization, Data curation, Peixin Cui: Writing—review and editing, Visualization, Data curation, Tongliang Wu: Writing—review and editing, Visualization, Data curation, Meiying Huang: Writing—review and editing, Visualization, Data curation, Cun Liu: Writing—review and editing, Validation, Supervision, Resources, Methodology, Funding acquisition, Conceptualization, Yujun Wang: Writing—review and editing, Validation, Supervision, Resources, Methodology, Funding acquisition, Conceptualization.

Corresponding authors

Correspondence to Cun Liu or Yujun Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Earth and Environment thanks the anonymous reviewers for their contribution to the peer review of this work. Handling Editor(s): Somaparna Ghosh [A peer review file is available].

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file

Supplemental Information

nr-reporting-summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, N., Lv, C., Li, Y. et al. Geochemical-integrated machine learning approach predicts the distribution of cadmium speciation in European and Chinese topsoils. Commun Earth Environ 6, 548 (2025). https://doi.org/10.1038/s43247-025-02516-6

Download citation

Received: 17 January 2025
Accepted: 24 June 2025
Published: 11 July 2025
Version of record: 11 July 2025
DOI: https://doi.org/10.1038/s43247-025-02516-6

This article is cited by

Machine learning uncovers dominant fractions of heavy metal(loid)s in global soils
- Tao Hu
- Mengting Wu
- Chongchong Qi
Communications Earth & Environment (2026)