Introduction

Soil is an indispensable foundation for human production and is the cornerstone of human development. Recently, a surge in population growth and urbanization has led to a significant influx of various waste pollutants into the soil environment. This may stem from diverse factors, including industrial and agricultural activities, as well as traffic emissions1. Notably, heavy metal pollution in soil is a more severe issue than other forms of soil pollution owing to its high toxicity and recalcitrance to degradation. The accumulation of these heavy metals imperils the soil and its surrounding ecological habitats, as well as infiltrates the food chain, posing a threat to animal and human health.

The prohibitive cost of collecting soil heavy metal samples makes a dense and thorough sampling analysis impossible. Moreover, testing inconsistencies or environmental fluctuations can yield aberrant data, and outright deletion would be wasteful. Therefore, it is imperative to devise an efficient and precise model for determining the soil’s heavy metal content. Such a model is vital in predicting, evaluating soil heavy metal pollution, and guiding soil remediation efforts.

Given the significant financial investment and extended timeline required for remediation of soil heavy metal pollution, preventive measures are prioritized in soil management. However, research on soil pollution prediction is relatively sparse. Currently, predicting soil heavy metal content predominantly relies on two approaches: deep-learning-based methods and methods that utilize the spectral characteristics of soil heavy metals. Deep learning techniques require a substantial amount of data for training, coupled with manual adjustment of numerous hyperparameters. This results in predictions that are challenging to interpret. Spectral-based methods rely on specialized equipment operated by experts2, which is not readily accessible to average management departments or research institutions, thereby hindering their widespread adoption. Hence, this paper introduced a refined tensor completion model to predict the soil’s heavy metal content, while aiming for a more precise forecast. The proposed method is expected to provide technical support for soil pollution prediction and offer scientific references for soil remediation planning and environmental decision-making. By enabling early identification of high-risk areas and quantifying pollution severity, the model holds promise for assisting in priority setting, resource allocation, and the formulation of targeted remediation strategies.

The main contributions of this study are as follows:

(1) This study proposes a prediction model for soil heavy metal content, integrating total variation regularization with low-rank tensor completion techniques. Through extensive experiments, the prediction accuracy and robustness of the algorithm were validated under varying degrees of missing data.

(2) The Alternating Direction Method of Multipliers (ADMM) method is utilized in the model to optimize the total vibrational regularization of low rank tensors, thereby completing the model effectively. This step-by-step decomposition and coordination approach enhances both the efficiency and accuracy of tensor completion.

(3) This paper proposes a C2F strategy that captures both global and local data structures by utilizing data from the coarse and fine phases.

(4) This study introduces an optimization approach to determine suitable local ranks in both low-rank and high-rank regions, thereby further improving the performance of tensor completion algorithms.

Related work

Research on prediction for soil heavy metal content

In recent years, the rapid development of artificial intelligence (AI) technology has driven the application of machine learning and deep learning in predicting soil heavy metal concentrations. Scholars have increasingly turned to neural networks, machine learning, and other related technologies to conduct research in this field. For example, Ren et al.3 used a backpropagation neural network model (BPNN) to predict the missing values of heavy metals and polycyclic aromatic hydrocarbons, but BPNN suffers from issues such as random initialization, poor stability, and low accuracy. Sergeev et al.4 combined an artificial neural network (ANN) with geostatistical methods to model the spatial distribution of soil heavy metals. Chen et al.5 used an improved particle swarm optimization algorithm to enhance an SVM model for predicting Cr and Pb in soil. Cheng et al.6 employed a genetic algorithm (GA) to optimize the weights and thresholds of BPNN, enhancing the prediction accuracy for Pb, Mn, Cr, and Cu. Zhang et al.7 proposed the MA-DKELM model, which enhances stability and accuracy by optimizing input weights and biases, building upon the DELM model. Wenqi Cao et al.8]– [9 proposed a collaborative composite neural network model (CCNN) and a deep composite model (DCM) to predict soil heavy metal content.

Researchers have leveraged the spectral characteristics of soil heavy metals to predict their contents. Liu et al.10 established spectral estimation models for nine heavy metal elements using partial least squares regression (PLSR).Tan et al.11 developed estimation models based on selected relevant features and comprehensive spectral characteristics using a random forest (RF) algorithm. This approach effectively predicted the concentrations of heavy metals, including Zn, Cr, As, and Pb, in Xuzhou, China. Yang et al.12 established soil hyperspectral prediction models utilizing stepwise multiple and partial least squares regression techniques. These models predicted concentrations of six heavy metals (Cd, As, Pb, Cr, Ni, and Zn) in the tailings.

Research on soil pollution prediction model

Scholars have extensively utilized diverse methods and models to forecast the status of soil heavy metal pollution, which can be categorized into three primary groups: machine learning techniques, statistical approaches, and predictive methodologies that incorporate soil-specific characteristics.

For soil heavy metal pollution prediction based on machine learning, Fei et al.13 integrated soil quality information with comprehensive heavy metal sampling surveys. Shi et al.14 proposed a model combining the least absolute shrinkage and selection operator (LASSO), genetic algorithm (GA), and BPNN to estimate the content of eight heavy metals in a city, resulting in more accurate results.

In statistical method-based soil heavy metal pollution prediction, Korre et al.15 integrated traditional statistics with geostatistics to assess heavy metal pollution in an abandoned mining area in Greece. Similarly, Lee et al.16 used the inverse distance-weighted interpolation method, bolstered by GIS technology, to generate a detailed map depicting the distribution of trace heavy metal elements such as Cd, Co, Cr, Cu, Ni, Pb, and Zn in the surface soil of Hong Kong Island.

For soil heavy metals prediction methods based on merged soil characteristics, Lamine S et al.17 combined spectral radiometric measurements with geochemical data to model heavy metal pollution in the floodplains of Wales, UK. Munawar et al.18 used infrared spectroscopy, a cutting-edge technique that offers real-time, swift, and robust detection of harmful pollution stemming from the accumulation of heavy metals, such as zinc and lead, in agricultural soils.

Notably, machine learning training requires a considerable time investment, while adjusting multiple hyperparameters can significantly influence prediction accuracy19. Additionally, statistical methods often rely on assumptions of data stationarity, which may not align with real-world data conditions, thereby limiting their practical applicability. Furthermore, prediction techniques incorporating soil characteristics require specialized equipment and expertise, which restricts their implementation to dedicated institutional laboratories20.

Soil heavy metal content prediction based on improved tensor completion

Design of total variation low-rank tensor completion algorithm

In tensor completion, let \(\:\varvec{y}\) be an incomplete tensor, and incorporate total variation regularization into low-rank tensor completion21. The following objective function was proposed and designed:

$$\begin{aligned}& \mathop {min}\limits_{z,g,\left\{ {{{\bf{U}}^{\left( n \right)}}} \right\}_{n = 1}^N} {\lambda _1}\mathop \sum \limits_{n = 1}^N {\beta _n}\left| {{{\bf{F}}_n}{{\bf{Z}}_{\left( n \right)}}} \right| + \frac{1}{N}\mathop \sum \limits_{n = 1}^N {\|{\bf{U}}^{\left( n \right)}\|}_* + {\lambda _2}\|g\|_{F}^{2} \nonumber\\ & s.t. \quad z = g { \times _1}{\bf{U}}^{(1)} { \times _2}{\bf{U}}^{(2)} \cdots { \times _N}{\bf{U}}^{(N)} \nonumber\\ & \left[ {\bf{z}} \right]_\Omega = \left[ {\bf{y}} \right]_\Omega \end{aligned}$$
(1)

In Eq. (1), we performed Tucker decomposition on \(\:\varvec{z}\) and \(\:\varvec{z}=\varvec{g}{\times\:}_{1}{\mathbf{U}}^{\left(1\right)}{\times\:}_{2}{\mathbf{U}}^{\left(2\right)}\cdots\:{\times\:}_{N}{\mathbf{U}}^{\left(N\right)}\). Here,\(\:\varvec{g}\in\:{R}^{{J}_{1}\times\:{J}_{2}\times\:\cdots\:\times\:{J}_{N}}\) denotes the core tensor, while \(\:{\left\{{\mathbf{U}}^{\left(n\right)}\right\}}_{n=1}^{N}\) represents the decomposition factors of size. Tucker decomposition was chosen over CP decomposition because Tucker decomposition is a general model, whereas CP is a special instantiation.

The objective function in Eq. (1) comprises the following three terms: The first term represents the total variation regularization of recovery tensor \(\:\varvec{z}\). The second term indicates that the Tucker decomposition factors of the expected tensor are of low rank, without imposing low-rank constraints on the unfolded matrices. The third term was used to prevent overfitting. Removing the total variation term reduces the model to simultaneous tensor decomposition and completion (STDC). However, this simplified model lacks Laplacian regularization of the factor matrices. Considering that the three terms in Eq. (1) are interdependent, the ADMM was employed to optimize the parameters of the improved tensor completion model, thereby enabling the results to more accurately approximate the truth data. Specifically, to facilitate optimization using ADMM, a set of matrices \(\:{\left\{{\mathbf{Q}}_{n}\right\}}_{n=1}^{N}\), \(\:{\left\{{\mathbf{R}}_{n}\right\}}_{n=1}^{N}\), and \(\:{\left\{{\mathbf{V}}^{\left(n\right)}\right\}}_{n=1}^{N}\) is introduced as auxiliary variables, transforming the objective function into

$$\begin{aligned} &{min}{\lambda\:}_{1}{\sum\:}_{n=1}^{N}{\beta\:}_{n}\left|{\mathbf{Q}}_{\left(n\right)}\right|+\frac{1}{N}{\sum\:}_{n=1}^{N}{\|{\mathbf{U}}^{\left(n\right)}\|}_{*}+{\lambda\:}_{2}{\|\varvec{g}\|}_{F}^{2}\nonumber\\ & \begin{array}{c}{\begin{array}{cc}s.t.\quad \left\{{\mathbf{Q}}_{n}={\mathbf{F}}_{n}{\mathbf{R}}_{n},{\mathbf{R}}_{n}={\mathbf{Z}}_{\left(n\right)},{\mathbf{V}}^{\left(n\right)}={\mathbf{U}}^{\left(n\right)}\right\}\end{array}}_{n=1}^{N}\end{array}\nonumber\\ &\varvec{z}=\varvec{g}{\times\:}_{1}{\mathbf{V}}^{\left(1\right)}{\times}_{2}{\mathbf{V}}^{\left(2\right)}\cdots\:{\times\:}_{N}{\mathbf{V}}^{\left(N\right)}\nonumber\\ &{\left[\varvec{z}\right]}_{\varOmega\:}={\left[\varvec{y}\right]}_{\varOmega\:}\end{aligned}$$
(2)

Using the augmented Lagrangian formulation, the following objective function is obtained:

$$\begin{aligned} &\varvec{L}={\sum\:}_{n=1}^{N}{\beta\:}_{n}\cdot\:\left({\lambda\:}_{1}\left|{\mathbf{Q}}_{n}\right|+\frac{{\rho\:}_{1}}{2}{\|{\mathbf{Q}}_{n}-{\mathbf{F}}_{n}{\mathbf{R}}_{n}+\frac{{\varvec{\Lambda\:}}_{n}}{{\rho\:}_{1}}\|}_{F}^{2}\right)\nonumber\\&\quad +{\sum\:}_{n=1}^{N}{\beta\:}_{n}\cdot\:\left(\frac{{\rho\:}_{2}}{2}{\|{\mathbf{R}}_{n}-{\mathbf{Z}}_{n}+\frac{{\varvec{\Phi\:}}_{n}}{{\rho\:}_{2}}\|}_{F}^{2}\right)\nonumber\\ &\quad +\frac{1}{N}{\sum\:}_{n=1}^{N}{\|{\mathbf{U}}^{\left(n\right)}\|}_{*}+{\sum\:}_{n=1}^{N}\frac{{\rho\:}_{3}}{2}{\|{\mathbf{V}}^{\left(n\right)}-{\mathbf{U}}^{\left(n\right)}+\frac{{\varvec{\Gamma\:}}_{n}}{{\rho\:}_{3}}\|}_{F}^{2}+{\lambda }_{2}{\|\varvec{g}\|}_{F}^{2}\nonumber\\ &\quad + \frac{{\rho }_{4}}{2}{\|\varvec{z}-\varvec{g}{\times\:}_{1}{\mathbf{V}}^{\left(1\right)}\cdots\:{\times\:}_{1}{\mathbf{V}}^{\left(1\right)}+\frac{\varvec{w}}{{\rho\:}_{4}}\|}_{F}^{2}\nonumber\\&\quad s.t.\quad \left[\varvec{z}\right]_{\varOmega\:}={\left[\varvec{y}\right]}_{\varOmega}\end{aligned}$$
(3)

where matrices \(\:{\left\{{\varvec{\Lambda\:}}_{n}\right\}}_{n=1}^{N}\), \(\:{\left\{{\varvec{\Phi\:}}_{n}\right\}}_{n=1}^{N}\),\(\:{\left\{{\varvec{\Gamma\:}}_{n}\right\}}_{n=1}^{N}\) and tensor \(\:\varvec{w}\) are Lagrange multipliers;\(\:{\rho\:}_{1},{\rho\:}_{2},{\rho\:}_{3},{\rho\:}_{4}>0\)are the penalty parameters; \(\:{\|\bullet\:\|}_{F}\) represents the Frobenius norm of a matrix or tensor. Next, we derived the updated formulas for \(\:{\left\{{\mathbf{Q}}_{n}\right\}}_{n=1}^{N}\), \(\:{\left\{{\mathbf{R}}_{n}\right\}}_{n=1}^{N}\), \(\:{\left\{{\mathbf{U}}^{\left(n\right)}\right\}}_{n=1}^{N}\), \(\:{\left\{{\mathbf{V}}^{\left(n\right)}\right\}}_{n=1}^{N}\), \(\:\varvec{z}\) and \(\:\varvec{g}\) further refining the solution process.

(1) Update \(\:{\left\{{\mathbf{Q}}_{n}\right\}}_{n=1}^{N}\):

Of total variation regularization item \(\:{Q}_{n}\) minimize implementation updates, and the update formula of \(\:{Q}_{n}\) is computed as:

$$\:\begin{array}{c}{\mathbf{Q}}_{n}={\beta\:}_{n}\cdot\:shrinkag{e}_{\frac{\lambda\:}{{\rho\:}_{1}}}\left({\mathbf{F}}_{n}{\mathbf{R}}_{n}-\frac{1}{{\rho\:}_{1}}{\varvec{\Lambda\:}}_{n}\right)\end{array}$$
(4)

where \(\:shrinkag{e}_{\alpha\:}\left(\bullet\:\right)\) is the element-wise soft-thresholding operator for matrices, and is defined as:

$$\:\begin{array}{c}{\left[shrinkag{e}_{\alpha\:}\left(\mathbf{A}\right)\right]}_{i,j}={\left[\mathbf{A}\right]}_{i,j}-{min}\left(\alpha\:,\left|{\left[\mathbf{A}\right]}_{i,j}\right|\right)\cdot\:\frac{{\left[\mathbf{A}\right]}_{i,j}}{\left|{\left[\mathbf{A}\right]}_{i,j}\right|}\end{array}$$
(5)

when \(\:{\left[\mathbf{A}\right]}_{i,j}=0\), \(\:\frac{{\left[\mathbf{A}\right]}_{i,j}}{\left|{\left[\mathbf{A}\right]}_{i,j}\right|}\) is defined as 0.

(2) Update \(\:{\left\{{\mathbf{R}}_{n}\right\}}_{n=1}^{N}\):

Extract the terms involving \(\:{\mathbf{R}}_{n}\) from the augmented Lagrangian function (10) and minimize them to achieve joint optimization of the total variation regularization term and the unfolding matrix constraint.

$$\:\begin{array}{c}{\mathbf{R}}_{n}={\beta\:}_{n}{\left({\rho\:}_{1}{\mathbf{F}}_{n}^{T}{\mathbf{F}}_{n}+{\rho\:}_{2}\mathbf{I}\right)}^{-1}\left({\mathbf{F}}_{n}^{T}{\varvec{\Lambda\:}}_{n}+{\rho\:}_{1}{\mathbf{F}}_{n}^{T}{\mathbf{Q}}_{n}+{\rho\:}_{2}{\mathbf{Z}}_{\left(n\right)}-{\varvec{\Phi\:}}_{n}\right)\end{array}$$
(6)

where \(\:I\) represents the identity matrix.

(3) Update \(\:{\left\{{\mathbf{U}}^{\left(n\right)}\right\}}_{n=1}^{N}\):

The update formula of \(\:{\mathbf{U}}^{\left(n\right)}\) is computed as:

$$\:\begin{array}{c}{\mathbf{U}}^{\left(n\right)}={D}_{\frac{1}{N{\rho\:}_{3}}}\left({\mathbf{V}}^{\left(n\right)}+\frac{1}{{\rho\:}_{3}}{\varvec{\Gamma\:}}_{n}\right)\end{array}$$
(7)

where \(\:{D}_{\alpha\:}\left(\mathbf{A}\right)=\mathbf{U}{\left(diag\left\{\delta\:-\alpha\:\right\}\right)}_{+}{\mathbf{V}}^{T}\) is the singular value thresholding operator, and the singular value decomposition of A is denoted as:

$$\:\mathbf{U}\left(diag\left({\left\{{\delta\:}_{i}\right\}}_{0\le\:i\le\:rank\left(\mathbf{A}\right)}\right)\right){\mathbf{V}}^{T}\:\text{a}\text{n}\text{d}\:{t}_{+}={max}\left(0,t\right).$$

(4) Update \(\:{\left\{{\mathbf{V}}^{\left(n\right)}\right\}}_{n=1}^{N}\):

Through the terms of low-rank constraints \(\:{V}^{\left(n\right)}\) minimize implementation updates, and the update formula of \(\:{V}^{\left(n\right)}\) is computed as:

$$\:\begin{array}{c}{\mathbf{V}}^{\left(n\right)}=\left(-{\varvec{\Gamma\:}}_{n}+{\rho\:}_{3}{\mathbf{U}}^{\left(n\right)}+\left({\mathbf{W}}_{\left(n\right)}+{\rho\:}_{4}{\mathbf{Z}}_{\left(n\right)}\right){\mathbf{V}}^{\left(-n\right)}{\mathbf{G}}_{\left(n\right)}^{T}\right){\left({\rho\:}_{3}\mathbf{I}+{\rho\:}_{4}{\mathbf{G}}_{\left(n\right)}{{\mathbf{V}}^{\left(-n\right)}}^{T}{\mathbf{V}}^{\left(-n\right)}{\mathbf{G}}_{\left(n\right)}^{T}\right)}^{-1}\end{array}$$
(8)

where \(\:{\mathbf{V}}^{\left(-n\right)}={\mathbf{V}}^{\left(1\right)}\otimes\:{\mathbf{V}}^{\left(2\right)}\cdots\:\otimes\:{\mathbf{V}}^{\left(n-1\right)}\otimes\:{\mathbf{V}}^{\left(n+1\right)}\cdots\:\otimes\:{\mathbf{V}}^{\left(N\right)}\), expressed as the Kronecker product of all factor matrix except \(\:{\mathbf{V}}^{\left(n\right)}\); \(\:\otimes\:\)are Kronecker products, which expands the product of higher-order tensors into matrix operations; \(\:{\mathbf{W}}_{\left(n\right)}\)is the matrix form of w.

(5) Update \(\:\varvec{g}\):

Extracting all terms containing the core tensor \(\:\varvec{g}\) from the Eq. (10), and the update formula of \(\:g\) is derived by vectorization and Kronecker product properties:

$$\:\begin{array}{c}vec\left(g\right)={\left[{{V}^{\left(-n\right)}}^{T}{V}^{\left(-n\right)}\otimes\:{\rho\:}_{4}{{V}^{\left(n\right)}}^{T}{V}^{\left(n\right)}+{\lambda\:}_{2}I\right]}^{-1}vec\left({{V}^{\left(n\right)}}^{T}\left({W}_{\left(n\right)}+{\rho\:}_{4}{Z}_{\left(n\right)}{V}^{\left(-n\right)}\right)\right)\end{array}$$
(9)

where \(\:vec\left(\varvec{g}\right)\) represents the vectorization of tensor \(\:\varvec{g}\), resulting in tensor \(\:\varvec{g}\) after computation.

(6) Update \(\:\varvec{z}\):

By projecting onto the observation domain \(\:{\Omega\:}\) using the data consistency constraint to ensure the completion result aligns with the true data, the update formula for \(\:z\) can be derived as:

$$\:\begin{array}{c}{\left[\varvec{z}\right]}_{\bar{\varOmega\:}}={\left[\frac{{\sum\:}_{n=1}^{N}{\beta\:}_{n}\left(-fold\left({\varvec{\Phi\:}}_{n}+{\rho\:}_{2}{\mathbf{R}}_{n}\right)\right)-\varvec{w}+{\rho\:}_{4}\varvec{t}}{{\sum\:}_{n=1}^{N}{\beta\:}_{n}{\rho\:}_{2}+{\rho\:}_{4}}\right]}_{\bar{\varOmega\:}}\:\:\:\:\:\:\:\:\:s.t.{\left[\varvec{z}\right]}_{{\Omega\:}}={\left[\varvec{y}\right]}_{{\Omega\:}}\end{array}$$
(10)

where \(\:\varvec{t}=\varvec{g}{\times\:}_{1}{\mathbf{V}}^{\left(1\right)}{\times\:}_{2}{\mathbf{V}}^{\left(2\right)}\cdots\:{\times\:}_{N}{\mathbf{V}}^{\left(N\right)}\) and \(\:{fold}_{n}(\cdot )\) represent the inverse operations of mode unfolding for tensors.

The improved tensor completion model was solved iteratively within the ADMM framework described above. This iterative process resulted in the optimal solution \(\:\varvec{z},\) which was used for imputing the missing data.

Design of an improved tensor completion algorithm under the C2F framework

By applying a global low-rank assumption, the low-rank tensor completion algorithm strikes a balance between potential low-rank and high-rank regions. In contrast, the proposed C2F framework avoids this balance by leveraging data from both the coarse and fine stages, placing greater emphasis on local structures. The detailed steps of the C2F framework involve two tensor completion stages: coarse and fine22.

  1. 1.

    Coarse Stage: During the coarse stage, the entire region \(\:\varvec{y}\) is completed with an enhanced tensor-completion algorithm, producing the coarse-stage completed region \(\:{\varvec{z}}_{temp}\). Due to the influence of the global low-rank assumption on both low-rank and high-rank regions, the process moves to a fine stage to achieve higher-quality completion results.

  2. 2.

    Fine Stage: The aim of the fine stage is to obtain satisfactory completion results by repeatedly completing small data regions. Each fine stage consists of three steps: completion, comparison, and replacement.

Step 1: Completion. The initial region \(\:\varvec{y}\) is segmented into \(\:N\) identically sized subregions, denoted as \(\:{\varvec{y}}_{patch}^{\left(1\right)},{\varvec{y}}_{patch}^{\left(2\right)},\cdots\:,{\varvec{y}}_{patch}^{\left(N\right)}\). Each little section overlapped with its neighboring regions to provide clean boundaries for the ensuing completion phase. Each tiny section was then completed separately using the suggested enhanced tensor-completion algorithm, with the appropriate local rank values. Compared to the coarse stage, the local rank in the fine stage is set lower.

Step 2: Comparison. After the completion step, under the local rank setting, \(\:{\widehat{\varvec{z}}}_{temp}^{\left(1\right)},{\widehat{\varvec{z}}}_{temp}^{\left(2\right)},\cdots\:,{\widehat{\varvec{z}}}_{temp}^{\left(N\right)}\) completed small regions are obtained. To compare the small regions at corresponding positions from different completion stages, the coarse-stage completed region \(\:{\varvec{z}}_{temp}\) is divided into \(\:N\) small regions of the same size based on the partitioning used in the current fine stage, resulting in \(\:{\varvec{z}}_{temp}^{\left(1\right)},{\varvec{z}}_{temp}^{\left(2\right)},\cdots\:,{\varvec{z}}_{temp}^{\left(N\right)}\). The discrepancies between the two matching tiny regions were quantitatively compared using the relative squared error (RSE) and the following formula:

$$\:\begin{array}{c}gap\left({\widehat{\varvec{z}}}_{temp}^{\left(k\right)},{\varvec{z}}_{temp}^{\left(k\right)}\right)=\frac{{\|{\widehat{\varvec{z}}}_{temp}^{\left(k\right)}-{\varvec{z}}_{temp}^{\left(k\right)}\|}_{F}}{{\|{\varvec{z}}_{temp}^{\left(k\right)}\|}_{F}}\end{array}$$
(11)

The mistake in the current tiny region’s completion result at this point increases with the RSE value.

Step 3: Replacement. Decide whether to replace \(\:{Z}_{temp}^{\left(k\right)}\) with \(\:{\widehat{Z}}_{temp}^{\left(k\right)}\) depending on a predetermined threshold and the variation between the two corresponding tiny zones.

$$\:\begin{array}{c}{\varvec{z}}_{temp}^{\left(k\right)}=\left\{\begin{array}{c}{\varvec{z}}_{temp}^{\left(k\right)},\begin{array}{cc}if&\:gap\left({\widehat{\varvec{z}}}_{temp}^{\left(k\right)},{\varvec{z}}_{temp}^{\left(k\right)}\right)\ge\:\epsilon\:\end{array}\\\:{\widehat{\varvec{z}}}_{temp}^{\left(k\right)},\begin{array}{cc}if&\:gap\left({\widehat{\varvec{z}}}_{temp}^{\left(k\right)},{\varvec{z}}_{temp}^{\left(k\right)}\right)<\epsilon\:\end{array}\end{array}\right.\end{array}$$
(12)

This replacement scheme was adopted because the data captured in the coarse stage completed the underlying true overall structure of the region. If the newly completed small region \(\:{\widehat{\varvec{z}}}_{temp}^{\left(k\right)}\) differs significantly from the corresponding small region \(\:{\varvec{z}}_{temp}^{\left(k\right)}\), there is a risk of serious deviation from the true data for that region. Therefore, only the small regions that satisfy the threshold conditions are replaced at the corresponding positions of \(\:{\varvec{z}}_{temp}^{\left(k\right)}\), and the overlapping parts of the small regions are averaged.

Because the set local rank may not necessarily be suitable for small regions in the fine stage, the fine stage was conducted multiple times in sequence. The size of the small regions decreases in the order of the fine stage. Through multiple fine-stage completions, both the high- and low-rank parts are likely to obtain the appropriate tensor ranks because of the simultaneous decrease in the size of the small regions and the use of local ranks set by the divide-and-conquer approach.

This fine stage was repeated several times. The initial fine stage aims to capture the sub-global structure of the entire area using larger divided regions. If all intermediate fine stages are omitted, the transition from global to local will be too abrupt, ignoring important sub-global relationships between small regions and resulting in a completion outcome that significantly deviates from the real data.

Construction of a prediction model for soil heavy metal content

A workflow of the prediction model for soil heavy metal content is shown in Fig. 1.

Fig. 1
Fig. 1
Full size image

The workflow of the improved tensor completion algorithm under the C2F framework. The interpolated data in this figure are generated from the soil heavy metal dataset, which was created using the ordinary Kriging interpolation method with ESRI ArcGIS 10.8 software (https://www.esri.com).

  1. 1.

    Determine inputs and outputs. Kriging interpolation was applied to the study area and a portion of the data was randomly removed in proportion.

  2. 2.

    Enter the coarse stage. Utilizing the low-rank tensor completion algorithm improved the total variation regularization, which was optimized using ADMM. Initialize the number of ADMM iterations and, upon completion of all iterations, obtain the coarse-stage completion results.

  3. 3.

    Enter the fine stage, initialize the number of iterations for the fine stage, and divide the entire region into \(\:{2}^{2f}\) equally sized small regions, and apply the improved tensor completion algorithm to each small region for completion, where \(\:f\) indicates that the \(\:f\)rd fine process is currently being performed.

  4. 4.

    Each small region completed in the previous step is compared with the corresponding small region from the coarse-stage results. Based on a comparison of the difference and the threshold, we determined whether to replace the corresponding small region in the coarse-stage results.

  5. 5.

    After the total number of iterations is reached in the fine stage, the completion results are obtained. The results obtained were compared with the Kriging interpolation results to evaluate the predictive performance of the model.

Analysis of experimental results.

The hardware used in this experiment included an AMD Ryzen 7 4800U CPU with Radeon Graphics, an NVIDIA GeForce GTX 1650 Ti GPU, 4GB of RAM, and 16GB of RAM. The software environment was Windows 10 operating system, and the experimental platform was MATLAB R2020b.

Data preprocessing and visualization

The soil heavy metal dataset consisted of 319 sampling points, each with varying metal contents. Therefore, the tensor expression constructed for soil metal content data in the research area is:\(\:\varvec{t}\in\:{R}^{319\times\:319\times\:8}\). Using ArcGIS 10.8 software, each point’s latitude and longitude coordinates of each point were marked on a map, and ordinary kriging interpolation was performed for each heavy metal content. Parameter settings are as follows: Spherical model was selected to fit the spatial autocorrelation structure, Apply a natural logarithmic transformation to the original concentration data to reduce heteroscedasticity, and second-order polynomial trend components were removed. The interpolation process employed 12 lags, while retaining default settings for anisotropy and other parameters, and the final spatial distribution raster was generated at 30-meter resolution. The interpolated results are shown in Fig. 2. From the interpolation map, it can be observed that the spatial distribution of soil heavy metal content in the research area is centered around pollution sources, forming concentric circles that spread outward, with lower content farther from the pollution source. The heavy metals Cd, Zn, and Cu were relatively concentrated and were mostly distributed near the pollution sources. Heavy metals such As, Cr, Hg, Ni, and Pb were relatively dispersed and spread to most parts of the research area. This spatial distribution reveals significant spatiotemporal correlations, which implies that the tensor structure formed by the heavy metal data possesses low-rank or approximately low-rank properties. Therefore, the predictive capability of the model may be dependent on these specific low-rank spatial distribution. For instance, the model’s performance might not be directly generalizable in regions with weakened spatiotemporal correlations, or demonstrate boundedness in special diffusion processes predominantly influenced by anthropogenic factors.

Fig. 2
Fig. 2
Full size image

Kriging interpolation results for different metals. The interpolation results in this figure are generated from the soil heavy metal dataset, which was created using the ordinary Kriging interpolation method with ESRI ArcGIS 10.8 software (https://www.esri.com).

Model evaluation metrics and parameter settings

To evaluate the performance of the tensor completion in the research area, the Residual Standard Error (RSE) and peak signal-to-noise ratio (PSNR) were adopted as evaluation metrics. The definitions of these two evaluation metrics are as follows.

$$\:\begin{array}{c}RSE=\frac{{\|\varvec{z}-{\varvec{z}}_{true}\|}_{F}}{{\|{\varvec{z}}_{true}\|}_{F}}\end{array}$$
(13)
$$\:\begin{array}{c}PSNR=10{{log}}_{10}\frac{{\prod\:}_{k=1}^{d}{\widehat{\varvec{z}}}_{true}^{2}}{{\|\varvec{z}-{\varvec{z}}_{true}\|}_{F}^{2}}\end{array}$$
(14)

Here, \(\:\varvec{z}\) represents the completed data for the research area, \(\:{\varvec{z}}_{true}\) denotes the true values for the research area, and \(\:{\widehat{\varvec{z}}}_{true}\) indicates the maximum true values for the research area. The performance of the completion method is inversely related to the RSE value and directly related to the PSNR.

To implement the C2F framework, set the number of repetitions in the fine stage to F = 3. In the first repetition of the fine stage, the threshold for replacing small regions is set to ε = 0.15. In the second and third repetitions, the threshold for replacing small regions is updated according to Formula (15):

$$\:\begin{array}{c}\epsilon\:=\left\{\begin{array}{c}\frac{3}{2}{max}\left(gap\left({\widehat{\varvec{z}}}_{temp}^{\left(k\right)},{\varvec{z}}_{temp}^{\left(k\right)}\right)\right),if\begin{array}{c}\:\:\:F=2,\end{array}\\\:\frac{3}{2}\cdot\:{max}\left(gap\left({\widehat{\varvec{z}}}_{temp}^{\left(k\right)},{\varvec{z}}_{temp}^{\left(k\right)}\right){r}_{k}\right),if\:\:\:\begin{array}{c}F=3.\end{array}\end{array}\right.\end{array}$$
(15)

.

In the second loop of the fine stage, vector \(\:\mathbf{r}\) is binary and records whether the completed small regions are replaced by their corresponding completed regions in the coarse stage, with elements of 0 or 1. The threshold increases as the size of the small regions decreases.

To ensure convergence accuracy while maintaining computational efficiency, the number of ADMM iterations is set to 300, with a convergence threshold of epsilon = 1e-8. A moderate low-rank constraint is introduced by setting the regularization coefficient λ₁ to 0.5, preventing the loss of spatial details due to overly strong regularization. The sparsity regularization term λ₂ is set to 1000 to enhance robustness against noise and outliers. The multi-scale parameter scale num is set to 5 to balance the trade-off between modeling spatial heterogeneity and computational complexity. To improve continuity in boundary regions during reconstruction, an image patch overlap of overlap_pixel = 5 is adopted. Furthermore, a small-area replacement threshold thre = 0.1 is employed to identify poorly reconstructed regions, facilitating a more effective local update strategy during the fine-grained refinement stage. All other parameters are kept at their default settings. These hyperparameters are determined through grid search and performance evaluation on the validation set.

Analysis of random missing data completion results

To more intuitively understand the improvements brought about by the improved tensor completion algorithm under the C2F framework, Fig. 3 illustrates the completion process for element As, with a 60% missing rate. As chosen for its representativeness, its distribution is complex, and the differences in content between different locations are significant. Figure 3a shows the original complete heatmap of the As content, whereas Fig. 3b depicts the heatmap with a 60% missing rate, where the black regions indicate missing data. Figure 3c shows the results of ordinary tensor completion, and Fig. 3d shows the results of the improved tensor completion algorithm proposed herein. Notably, the data obtained by the improved tensor-completion algorithm under the C2F framework are richer in detail and closer to the original data. The data obtained using the ordinary tensor-completion algorithm differed significantly from the original data, and the errors were more pronounced in the boundary areas where the content values differed. In summary, the improved tensor completion algorithm under the C2F framework can consistently enhance the performance of existing tensor completion algorithms and preserve more detail.

Fig. 3
Fig. 3
Full size image

As completion process. The original heatmap in Fig. 3 are generated from the soil heavy metal dataset, which was created using the ordinary Kriging interpolation method with ESRI ArcGIS 10.8 software (https://www.esri.com).

To better simulate content completion under random missing scenarios, the complete dataset was first randomly deleted at certain proportions and then the proposed algorithm was used to complete it. The completion results for missing data rates ranging from 20 to 80% are listed in Table 1. Table 1 shows that for the same heavy metal, as the missing rate increased, the PSNR decreased continuously, while the RSE increased continuously. Through analysis of the PSNR and RSE for multiple heavy metals, it was concluded that when the missing rate was less than 40%, the accuracy of the proposed model’s predictions decreased slowly. However, when the missing rate exceeded 40%, the prediction accuracy of the proposed model decreased rapidly. For example, for As, compared to a 20% missing rate, when the missing rate was 40%, the RSE increased by 23.53% and the PSNR decreased by 6.29%. When the missing rate was 60%, the RSE increased by 52.38%, and the PSNR decreased by 8.1%. When the missing rate was 80%, the RSE increased by 3.13% and the PSNR decreased by 0.13% compared to 60%. Therefore, the proposed model achieved an ideal completion effect for data with missing rates below 40%. However, based on the PSNR values, even when the missing rate reached 80%, the value remained at approximately 30, which was within an acceptable error range. This indicates that the predictions for high missing-rate data are feasible within an acceptable error range. Further analysis revealed that the proposed method shows varying completion performance for different heavy metals. The best completion results were observed for Pb and Cd, with a small decrease in PSNR, which dropped by 4.74% and 6.49%, respectively, and a relatively low increase in RSE, which rose by 16.67% and 25.81%, respectively. In contrast, the completion performance for Cu and Ni was poorer, with a larger decrease in PSNR, which dropped by 16.02% and 19.84%, respectively, and a more significant increase in RSE, rising by 125% and 208.33%, respectively. This difference is related to the concentration distribution characteristics of different heavy metals, such as the more regular distribution of Pb and Cd, compared to the higher spatial heterogeneity of Cu and Ni. Furthermore, based on the actual conditions of the study area—rich in mineral resources and clustered with heavy industrial sectors, the soil contains significant amounts of heavy metals such as Pb and Cd. These metals are readily adsorbed by clay minerals, resulting in a relatively stable spatial distribution. In contrast, Cu and nickel Ni exhibit higher mobility, which leads to greater variability in their concentrations and increased difficulty in prediction.Overall, the proposed method performs well for most heavy metals when the missing rate is below 40%, and still maintains acceptable prediction accuracy even at higher missing rates, making it effective for data completion in practical applications.

Table 1 Completion results for different heavy metals at various missing rates.

Algorithm comparative analysis

To verify the effectiveness of the proposed algorithm (C2F-STDCTV), it was compared with the total variation regularization-improved low-rank tensor completion (STDCTV), the STDC algorithms, high accuracy low-rank tensor completion (HALRTC) and low-rank total variation with proximal decomposition splitting(LRTV-PDS). According to the actual local situation, the soil contains a large amount of heavy metals (Pb and Zn) that have been mined for a long time. It can be inferred that these two heavy metal pollutants are the most severe in the study area. Therefore, these two metals were selected to demonstrate the predictive effects of the different algorithms.

Tables 2 and 3 present the PSNR and RSE values of metals Pb and Zn, respectively, under different missing rates (20%, 40%, 60%, and 80%). From the tables, it is evident that the C2F-STDCTV algorithm consistently outperforms the other algorithms across all missing rates. For instance, for metal Pb at a 60% missing rate, the PSNR of C2F-STDCTV is 33.569, significantly higher than that of STDCTV, STDC, HALRTC, and LRTV-PDS. Similarly, for metal Zn at a 40% missing rate, the PSNR of C2F-STDCTV is 32.379, surpassing that of STDCTV, STDC, HALRTC, and LRTV-PDS. These results further validate the superiority of C2F-STDCTV in handling complex data missing problems.

Table 2 PSNR and RSE comparison results for metal Pb under different missing rates.
Table 3 PSNR and RSE comparison results for metal Zn under different missing rates.

Figure 4 depicts the change trend of the PSNR for metals Pb and Zn, with random missing rates ranging from 10 to 90%. Overall, the PSNR of the all prediction algorithms decreased as the missing data rate increased. With the increase in the missing rate, when the missing rate was below 40%, the decrease in PSNR was relatively slow. When the missing rate was greater than 40% but less than 80%, the rate of decrease in the PSNR increased. When the missing rate exceeded 80%, the PSNR decreased rapidly. Overall, the proposed algorithm performs well in recovering data with missing rates of less than 80% and performs best for data with missing rates of less than 40%. Regarding details, the prediction effect of C2F-STDCTV outperformed STDCTV, STDC, HALRTC, and LRTV-PDS. For example, for Pb, when the missing rate was 80%, the PSNR of C2F-STDCTV was 32.427, significantly higher than that of STDCTV, STDC, HALRTC, and LRTV-PDS. Compared to STDC, C2F-STDCTV improved the PSNR by 76.830%, by 28.485% compared to STDCTV, by 43.214% compared to HALRTC, and by 23.128% compared to LRTV-PDS.

Fig. 4
Fig. 4
Full size image

PSNR comparison results of different algorithms.

Figure 5 illustrates the RSE trend for Pb and Zn, with random missing rates ranging from 10 to 90%. Overall, the RSE of the all prediction methods increased with the missing data rate. With an increase in the missing rate, the increase in the RSE was relatively slow when the missing rate was less than 40%. When the missing rate was above 40% but below 80%, the rate of increase in the RSE increased. However, when the missing rate exceeded 80%, the RSE increased rapidly. Similarly, the proposed algorithm outperformed in recovering data with missing rates below 80%. Moreover, it performed best when the missing rate was below 40%. Concerning details, C2F-STDCTV outperformed STDCTV, STDC, HALRTC, and LRTV-PDS. For example, considering the heavy metal Zn and an 80% missing rate, the RSE of C2F-STDCTV was 0.038, significantly lower than that of STDCTV, STDC, HALRTC, and LRTV-PDS. Compared to STDC, C2F-STDCTV reduced the RSE by 61.224%, by 44.118% compared to STDCTV, by 56.322% compared to HALRTC, and by 37.705% compared to LRTV-PDS. The proposed model demonstrates superior completion performance and greater stability when the missing rate is below 80%. This performance advantage is primarily attributed to the adaptive local rank setting strategy. Unlike fixed rank settings, the proposed approach introduces a multi-stage coarse-to-fine completion process. In each fine-grained stage, the shrinking local regions are dynamically matched with appropriate tensor ranks through a divide-and-conquer strategy, allowing the model to maintain stable performance even as the missing rate gradually increases.

Fig. 5
Fig. 5
Full size image

RSE comparison results of different algorithms.

By comparing the PSNR and RSE of different tensor completion algorithms at various missing rates, it can be concluded that owing to the addition of total variation regularization improvement on low-rank tensor completion, effectively utilizing local smoothness and piecewise priors, the completion accuracy of STDCTV is higher than that of STDC. The use of the C2F framework effectively addresses the need for tradeoffs between low-rank and high-rank parts owing to global low-rank settings; thus, the completion accuracy of C2F-STDCTV is higher than that of STDCTV. The performance of HALRTC and LRTV-PDS lies between STDCTV and STDC, indicating that while HALRTC and LRTV-PDS have improved some limitations of STDC, they still do not reach the optimization level of C2F-STDCTV. In conclusion, the C2F-STDCTV method resolves tensor completion issues and significantly improves the completion accuracy. Through multiple experiments on soil heavy metal data, it was demonstrated that the proposed method can be effectively applied to soil heavy metal pollution prediction.

Conclusion

The C2F-STDCTV model proposed in this paper has significant unique advantages over existing tensor completion methods, particularly excelling in handling complex data missingness issues. Traditional tensor completion methods often fail to effectively address the challenges posed by high missing rates and spatial heterogeneity in heavy metal data. In contrast, the C2F-STDCTV model improves completion accuracy by incorporating total variation regularization and the ADMM optimization algorithm. Additionally, the integration optimization in both the coarse and fine stages of the C2F-STDCTV model enhances the accuracy of heavy metal content prediction, enabling better handling of incomplete data and errors. Finally, the model is applied to predict soil heavy metal content. Experimental results demonstrate that, com-pared to the traditional tensor completion method STDC, the proposed method shows significant improvements: when the missing rate of Pb is 80%, the PSNR metric increases by 76.830%, and the RSE metric decreases by 80.952%. Additionally, the model outperforms in terms of two quantitative evaluation metrics across various heavy metals with different missing rates, indicating that the C2F-STDCTV model has superior predictive performance. The method proposed in this paper provides a more effective and accurate solution for soil heavy metal pollution prediction, demonstrating more reliable performance in practical applications compared to traditional methods. To facilitate further research and application, the code and data used in this study are available upon request via email.