Improving model performance in mapping black-soil resource with machine learning methods and multispectral features

Hu, Jianfang; Tang, Yulei; Yan, Jiapan; Zhang, Jiahong; Zhao, Yuxin; Chen, Zhansheng

doi:10.1038/s41598-024-82399-3

Download PDF

Article
Open access
Published: 07 January 2025

Improving model performance in mapping black-soil resource with machine learning methods and multispectral features

Jianfang Hu^1,2,
Yulei Tang^1,2,
Jiapan Yan^1,2,
Jiahong Zhang³,
Yuxin Zhao⁴ &
…
Zhansheng Chen^1,2

Scientific Reports volume 15, Article number: 1199 (2025) Cite this article

2508 Accesses
1 Citations
Metrics details

Subjects

Abstract

Accurate information on the distribution of regional black-soil resource is one of the important elements for the sustainable management of soils. And its results can provide decision makers with robust data that can be translated into better decision making. This study utilized all Sentinel-2 images covering the study area from April to July in 2022. After masking clouds, all images were synthesized monthly. Based on the revised random forest classification algorithm, model performance using different feature combination programs were evaluated to search for an efficient, high-precision method for mapping black-soil resource. The impact on model performance of adding data from temperature, precipitation and slope geographic covariates was analyzed. And the robustness of the model was verified using Landsat-8 data with lower spatial resolution. The results showed that (1) the model based on multi-temporal ensemble features for mapping black-soil resource shows the best performance, with an OA of 94.6%; (2) adding temperature covariate can effectively improve the accuracy of black-soil resource mapping; (3) compared to the sentinel data, the performance of the model based on Landsat-8 data is reduced but still plausible, verifying the robustness of the model. This study provides a robust method to improve model performance for rapid mapping of black-soil resource.

High-Resolution prediction of soil pH in European temperate forests using Sentinel-2 and ancillary environmental data

Article Open access 05 August 2025

Machine learning and computer vision technology to analyze and discriminate soil samples

Article Open access 27 August 2024

Estimating soil profile salinity under vegetation cover based on UAV multi-source remote sensing

Article Open access 21 January 2025

Introduction

Black soil refers to suitable agricultural soil which has a black surface horizon, enriched with organic carbon that is at least 25 cm deep^1,2. Black soil is recognized as one of the most fertile soils in the world, including three main soil groups Chernozem, Kastanozem and Phaeozem³. There are only four major black soil distribution areas in the world, which are in the Mississippi Plain of North America, the Pampas Plain of South America, the Northeast Plain of China, and the Ukrainian Plain^2,4. Although black soils account for only one-sixth of the global arable land area, a significant percentage of oilseed, cereals, and tuber crops are harvested from black soils⁵. Apart from being highly productive lands, black soils are responsible for multiple ecosystem services. One of the most valuable services is the high potential to mitigate climate change because of their high soil organic carbon sequestration potential⁶. Due to anthropogenic intervention and intensive cultivation systems, black soils are suffering severe degradation such as erosion, loss of organic matter, compaction and soil acidification^7,8. Considering their relevance for food security and their increasing vulnerability to soil degradation, it is of the utmost importance to promote the sustainable use, management and conservation of black-soil resource. Therefore, it is necessary to obtain accurate, updated, and detailed information regarding the distribution of black-soil resource.

Satellite remote sensing data has emerged as a vital tool for soil resource surveys and information generation. It contributes to the evolution of optimal land-use plans at scale ranging from regional to micro levels for sustainable development⁹. Traditionally, survey of black-soil resource based on ground surveys requires soil sampling at a large scale, which is relatively difficult and costly to implement^10,11. Moreover, it is difficult to ensure the accuracy of the results in areas with large spatial heterogeneity. Satellite remote sensing data with rapid, low cost, and wide spatiotemporal coverage play a vital role in soil mapping by reducing the need for extensive time-consuming and costly field surveys^12,13,14. Sahadevan explored the potential of multi-label classification approach to classify soil types using airborne hyperspectral image and Sentinel-2 images¹⁴. With the advancement of technology, multispectral images have the ability to obtain adequate soil information independently¹⁵. An ensemble modelling approach to improve the accuracy of soil texture component estimation using Sentinel-2 data has been investigated¹⁶. Multiple spectral transformation and topographic factors were also used by Neyestani to map soil class in central Iran, and their model had an overall accuracy of 53% and a kappa coefficient of 0.39¹⁷. However, the relationship between black soil reflectance spectroscopy and image information still needs to be further explored.

The properties of the soils that govern the spectral reflectance are color, organic matter, moisture, texture and mineralogy^9,18,19. Soil organic matter content (SOMC) primarily controls the expression of black soil properties. Previous results have indicated that increasing SOMC results in an overall decrease in reflectance at the visible (VIS), near-infrared (NIR), and shortwave infrared (SWIR) spectral regions^20,21,22. And most commonly used optical satellites, Sentinel-2 and Landsat-8, cover the relevant sensitive bands. Based on the relationship between soil properties and the spectral information of satellite images, many studies have accurately mapped the regional distribution of soil information²³. For example, Lin explored a new way to accurately determine the SOMC by focusing on the influence of soil moisture on the SOM²⁴. Luo used multitemporal synthetic Landsat-8 images in Google Earth Engine to obtain the regional mapping of SOMC²⁵. A local strategy was proposed in the study of Meng to reduce the influence of the high heterogeneity of SOMC content and environmental variables on the prediction results²⁶. And it was used to generate the first high-resolution global black soil region SOMC content product. Unfortunately, the band parameters extracted from ground spectra tend to suffer from the influence environmental noise. Spectral index based on linear or nonlinear band combinations can somewhat suppress the effects of factors such as soil surface conditions and the atmosphere^27,28.

There is a need to further explore the potential of spectral index in efficient modeling of precise information on the extent distribution of black-soil resource. Time series images have been proven to be more effective than mono-temporal image for crop mapping, wetland mapping and land use mapping^29,30. Previous studies extracted spatial and spectral parameters from time series remote sensing images to reflect more complete soil surface information, which is conducive to the development of soil classification mapping³¹. Consequently, establishing a relationship between time series spectral index and the soil properties enables the discriminating of spatial variations in black-soil resource³². To be based on large-scale, multitemporal, and high-dimensional input features, an effective model is essential. Random forest (RF), as a concrete implementation of the ensemble machine learning method, can effectively solve the problem of processing and analyzing remote sensing images with huge data volume and complex features^33,34,35. Both the probability map and the categorical map of the black soil distribution were generated with the continuous improvements in computing platforms by using the RF model^36,37,38. However, RF model is better at handling sensitive features, redundant features will reduce the model performance³⁸. It’s necessary to develop an innovative framework combining a method for filtering key features that can best reflect the properties of the black-soil resource with the RF model.

The primary aim of this paper is to provide a more effective mapping model for black-soil resource, we focused on enhancing model performance through the utilization of readily available multispectral features and machine learning methods. To achieve this, the main objectives of the study are as follows: (1) study the methods of extracting and combining features from multitemporal remote sensing images for black soil mapping; (2) improve RF model by integrating feature filtering technology to increase the accuracy of black-soil resource discrimination. (3) Evaluate the reliability of adding slope data, temperature data and precipitation data for the mapping of black-soil resource. (4) Demonstrate the robustness of this model by applying it to Landsat-8 data. This study can provide a model reference for discriminating of black-soil resource by using easily available multispectral data and similar spectral curves. The resulting moderate-spatial-resolution (10-m) black-soil resource map can be used as technical support for soil monitoring and protection.

Materials and methods

Study area

Figure 1 illustrates that the study area is mainly located in Kang ping County (131°27′–132°15′E, 46°28′–46°59′N) in the Liao River basin of Northeast China. The main towns covered by the study area are XCZ, HZWB, BSJZ and others (The information of towns is in Supplementary Table S1). The study area is located at the southern edge of the Horqin Sandy land, adjacent to Changtu and Faku counties, covering an area of over 4000 km². The study area features a continental monsoon climate of the north temperate zone. The terrain of the study area is flat with an average elevation of 382.1 m³⁹. The annual average temperature is 6.9 °C, the average sunshine time is 2867 h, the average wind speed is 4.6 m/s, the average precipitation is 524 mm, and the average frost-free period is 151 days⁴⁰. The main types of black soil in study area are meadow soil, black soil, dark brown soil, etc. The area has a large plot area, suitability for large-scale planting. Sandy topography of the Horqin Sandy land poses a threat to black soil degradation in the study area⁴¹.

According to previous studies, the bare-soil period in study area lasts from the beginning of April to the end of May every year. From April to July, the region typically experiences a period characterized by low vegetation coverage, except for some areas covered by snow³⁰. This facilitates the use of remote sensing satellites to monitor the physical and chemical properties of the black soil.

Data acquisition and treatment

Training samples and verification samples

In recent years, our research group has conducted numerous field expeditions to the study area. The results of these field expeditions revealed a general view of the distribution and the characteristics of black soil resources in the study area. We classified the soil types in the study area as two dominant classes (black-soil and other-soil). Six different major land cover classes have been identified within the study area. The detailed description of each class was given in Table 1.

Table 1 Detail description of land cover classes.

Full size table

Both training samples and verification samples were collected from field sampling data in 2022. The sampling strategy used was stratified random sampling considering soil type, land cover types and other factors. Specifically, the study area was divided into six strata according to the land cover classes in Table 1. 10 m ×10 m pixels were designated as the smallest sampling unit, the number of sampling points within each stratum was determined based on the following ratio: “black soil: other-soil: water: vegetation: building: floodplain = 3:2:1:1:2:1”. The central position of the sample points was recorded with a global positioning system, and the soil color, soil thickness, SOC content of each sample were collected. The quality of training sample points has a strong relevance with the model performance. The sample points were filtered according to the soil thickness and SOC content, which is supplemented by Google Earth and high-resolution optical remote sensing images. And a total of 298 sample data were selected based on established classification systems for the study area (The information of sample data is in Supplementary Table S2). Among them, the total number of black soil sample points are 92.

Satellite retrievals

Sentinel-2 satellite data has the comprehensive advantages of high spatial and temporal resolution, rich spectral information and low acquisition cost. It provides a new opportunity for extracting information on the distribution of black soil on a large scale and with high precision. Sentinel-2 level-2 A (L2A) treatment level data product represents geometrically corrected and atmospherically processed surface reflectance data in 13 bands covering the visible to short-wave infrared with a spatial resolution of up to 10 m⁴² (The detailed information of the spatial resolution and spectral characteristics is in the Supplementary Table S3). According to the local climate and local crop phenology, the Sentinel-2 L2A imagery covering the study area from April to July in 2022 was selected to construct the experimental datasets (The image in June was excluded to avoid the impact on the model performance evaluation due to the high cloud content of the June image). The Sentinel images were acquired from GEE platform (https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_SR_HARMONIZED).

Table 2 Image collection information.

Full size table

In order to compare the performance of the black soil resource mapping model applied in different satellites, Landsat-8 data covering the study area from April to July 2022 were selected from GEE platform (https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LC08_C02_T1_L2). The spatial resolution of Landsat-8 data is 30 m, and the revisit period is 16 d (The detailed information of Landsat-8 data is in the Supplementary Table S4). Table 2 shows the information of acquisition date and cloud content for the Sentinel-2 data and Landsat-8 data used in this study.

All Sentinel-2 images were cloud masked using the ‘QA60’ band in the image to obtain a cloud-free time-series Sentinel-2 image collection. Among them, ‘Bit10’ and ‘Bit11’ were used to identify the opaque clouds and cirrus clouds. Then, the ‘QA_PIXEL’ and ‘QA_RADSAT’ bands were used to mask the images with clouds and shadows and obtain a cloud-free Landsat-8 image collection for the time period of interest. A median composite was performed based on each monthly image on the cloud-free Sentinel-2 image collection and cloud-free Landsat-8 image collection.

Geographic covariates

Many previous studies have used supplemental topographic and climate data to perform soil discriminating model, and the results have shown that the slope, average temperature and average precipitation are the most useful factors for improving the performance of soil discriminating model with good interpretability⁴³. The accumulation of organic matter differs significantly between steep areas and flat terrain⁴⁴. The air temperature affects the soil temperature, and soils with different temperatures will have different spectral characteristics⁴⁵. Precipitation directly affects soil water content and further affects the soil reflectance spectral characteristics⁴⁶.

Three geographic covariates were selected in this study, with data from the GEE platform. For the slope data, we used NASADEM Digital Elevation 30-m data to calculate “SLOPE” as the slope input of the study area. The average temperature and precipitation data were calculated from the “ERA5 Monthly Aggregates”. All geographic covariate data were resampled to ensure uniform spatial resolution (10–30 m) as input to the model.

Selection of feature variables

Table 3 describes in detail the feature variables extracted in this study, which consist of two main parts: band-features and spectral index-features.

Table 3 Feature variable set (variables in ATSAVI mean that X = 0.08, a = 1.22, b = 0.03).

Full size table

The color of black soil and the content of SOMC significantly affect the band reflectance. In this study, the reflectance of 10 bands of multi-temporal images was selected as the band-features. The adding of the water index not only enhances the spectral response of soil moisture and amplifies the subtle differences in the spectral properties of the black soil, but also suppresses the effect of and floodplains on soil reflectance⁵⁸.Soil reflectance indexes are closely related to the type of soil and the SOC content, three typical soil reflectance indexes were included in the feature variable set. According to the land cover types in the study area, vegetation index and building index are extracted from the preprocessed Sentinel-2 data. In addition, the red-edge index was taken as a separate category of feature in this study.

Algorithm descriptions

Random Forest (RF), first proposed in 2001, is a regression and classification decision tree approach widely used in digital soil mapping^59,60. RF uses multiple trees to train and predict samples, and its essence is to use multiple decision-making classifiers to determine the final classification result. This model has obvious advantages in processing high-dimensional data compared with other decision tree classification algorithms. The RF classification model and the ‘ee. Classifier. smileRandomForest ()’ program were employed in GEE to map black-soil resource. As a result of repeated trials, the parameters are set as follows: the number of trees was set to 100, mainly to balance the accuracy and operation efficiency; for VariablesPerSplit, we used the default value of ‘null’; for minLeafPopulation, we used the default value of 1; for bagFraction, we used the default value of 0.5; for maxNodes, we used the default value of ‘null’; and for seed, we used the default value of 0.

Recursive Feature Elimination (RFE) method is a feature selection method commonly used in machine learning⁶¹. It works by recursively removing features and building models on the remaining features until the optimal subset of features is found. It initializes the required feature set to the entire data set, calculate the importance of the features⁶². Random Forest Feature Importance, i.e. Tree’s Feature Importance from Mean Decrease in Impurity (MDI) were calculated in this study. At each iteration, the least significant features are removed based on their importance scores determined by the model. In order to identify the most relevant features for mapping while reducing overfitting and improving model interpretability⁶³, This study improved RF model by combining RFE method to develop a multi-feature ensemble random forest model. The feature elimination process continues until the specified number of features is reached or until the performance of the model no longer improves.

In order to explore the optimum black-soil resource mapping model, four models were designed based on the features in Table 3. The fundamental feature combination model (NBF-RF) was constructed using the mono-temporal band reflectance. In order to eliminate the influence of other factors, such as atmosphere, on the information of image bands, we added spectral index features to design the mono-temporal ensemble features model (NEF-RF). Multi-temporal ensemble features model (MEF-RF) was constructed to demonstrate the validity of time series features. In addition, multi-temporal band features model (MBF-RF) was constructed to compare with the MEF-RF model to demonstrate the necessity of multispectral features. Information on the models is shown in Table 4. We evaluated the accuracy of these models to investigate the effect of different feature variables and obtain the best model performance.

Table 4 Description of the models.

Full size table

Accuracy assessment

We evaluated the models via independent verification methods. First, we used filter algorithm to divide sample points into training samples and verification samples in the ratio of 7:3. Ultimately, 70% of the samples (209) were utilized to develop the mapping model, the remaining samples (30%) were utilized to assess the performance of the model. A confusion matrix was developed using the outputs and verification samples for this comparative analysis. The confusion matrix is a widely used method for accuracy evaluation in land cover classification. In confusion matrix, the values of verification samples are compared with the values of the corresponding result data to provide categories like a true positive (TP), true negative (TN), false positive (FP), and false negative (FN)^64,65. Finally, parameters like recall and F1 score can be computed to evaluate model performance (Eqs. (1)–(3)).

$$precision=\frac{{TP}}{{TP+FN}}=\frac{{TP}}{{Total~predicted~positive}}~$$

(1)

$$recall=\frac{{TP}}{{TP+FN}}=\frac{{TP}}{{Total~actrual~positive}}~$$

(2)

$$F1=2*\frac{{precision*recall}}{{precision+recall}}$$

(3)

$$OA=\frac{{TP+TN}}{T}$$

(4)

In order to visually and concisely compare the overall performance of different models (Table 4), The Overall Accuracy (OA) were employed in this study. Equation (4) states the OA computed for further statistical analysis, where T is the total number of pixels in the image. It is the ratio of the number of correctly classified image elements to the total number of image elements, representing the classification accuracy of the overall image elements. Several field expeditions were conducted for ground-truthing of different land use land cover types, as described in Table 1 (Section “Data acquisition and treatment”). The ground truth of different land cover types is presented in supplementary Fig. 2.

In response to the map we obtained, we conducted another field survey of the study area. We intend to verify the accuracy of the map through field research (The survey route is shown in the supplementary Fig. 3). Meanwhile, Reference distribution map for different classifications were further derived from all the field sampling data using random forest interpolation, which was used to design an overlay analysis of the reference map against the map obtained from the model. In addition, we conducted statistical analyses for black-soil source using a categorized statistical tool in QGIS.

Results and discussion

Spatial distribution of black-soil resource

Model with the highest accuracy among the algorithm described in the previous section was used to map the spatial distribution maps of the black-soil resource in the study area. The results for the overlap of the same classification show that the accuracy of the map obtained by the model achieved 90%. In addition, the areas in the reference and simulated maps obtained by the categorized statistical tool indicated a consistency of more than 90% in the black-soil areas. These comparisons support the fact that the map we acquired has reliable accuracy.

Black-soil is primarily concentrated in the southern part of the study area, with a high concentration in DX, XGT, FJT, DGT, and ENSK in the southwestern region and KP, SL in the central region (Fig. 2). However, the distribution of black-soil in the northwest part of the study area, specifically in Horqin District, and in XCZ, HZWB, BSJZ, and LJZ at the northern part of Kangping town, appears to be fragmented. This may be due to the impact of soil erosion at the southern edge of the Horqin Sandy land (The towns mentioned can be found in Supplementary Table S1).

Model performance evaluation

Model performance based on NBF-RF

The evaluation results of the model based on NBF-RF are shown in Table 5. The model constructed based on the image in May for discriminating black-soil resource had the highest accuracy, with an OA of 79.7%. The confusion matrix generated using the NBF-RF model (Fig. 3) showed that several land cover classes were misinterpreted. In particular, the model based on April had more misidentification of ‘black-soil’, and the model based on May showed significant improvement. However, there are still significant amounts of ‘black-soil’ were misinterpreted as ‘building’. Statistical analysis was carried out to evaluate the accuracy of the land cover map using the F-1 score indicators. The F-1 score of ‘black-soil’ based on the model in May presented the highest value. The OA of NBF-RF model in different months are still unstable. It is necessary to explore the effect of the inclusion of spectral indices on model accuracy and stability.

Table 5 Model performance based on mono-temporal features.

Full size table

To better understand the impact of band features on accurately determining black-soil resources, we obtained the importance of band features from each individual image taken at a single point in time. As shown in Fig. 4, band features of B2, B9, B11, and B12 contributed more to the identification of black-soil resource. It is due to the fact that the spectral profile of black soil has significant absorption valleys in the red visible band and shortwave infrared band (350 ~ 700 nm, 1400 ~ 2052 nm), which exactly correspond to these bands. The low importance of B8A and B7 indicate that the vegetation red-edge bands do not have a significant effect on improving the discrimination of black-soil resource.

Model performance based on NEF-RF

The overall performance made by the NEF-RF model in each month improved the output result significantly compared to the performance made by NBF-RF model, with OA increased by 2.9, 1.4, and 0.5 respectively (Table 6). It indicates that the inclusion of spectral index can effectively suppress the influence of atmosphere and other factors to effectively improve the model performance. The NEF-RF model constructed in May performed equally well as the NBF-RF model in terms of accuracy. It indicated that May is the optimal time for discriminating black-soil resource. The problem of misinterpretation of land cover classes significantly improved when the NEF-RF model was applied. For instance, the “Black-soil” land feature is misinterpreted as a “Vegetation” or ‘Building ’land feature when using the NBF-RF. The NEF-RF model provided better differentiation between above land cover classes, which was misclassified in the NBF-RF model. The results of the statistical analysis carried out in this model are presented in Table 6. Compared to the NBF-RF model, the F-1 score of “Black soil” in the NEF-RF model increased but at a lesser amount.

The importance order of the ensemble features in different periods is slightly different. But the importance of the band features (B2, B5, B6, B12) and the index features (MNDWI, NDSI, NDVI) extracted from different phase images are at a high level (For a clearer presentation, the 16 feature variables with higher importance in each month are shown in Fig. 5). From the view of the top ten features ranked in the combination of mono-temporal ensemble features, the ratio of band features has reached more than 60%. Band reflectance of satellite images can effectively reflect the characteristic difference information of black-soil resource, which is very important for the identification of black-soil resource.

Model performance based on multi-temporal features

The evaluation results of the MEF-RF model with or without feature filtering are shown in Table 6. Compared with the model performance using only band features or mono-temporal ensemble features, the performance of the MEF-RF model has been improved significantly. Before feature filtering, the OA of the black-soil resource discrimination model reached 91.4%. The data redundancy was reduced based on the RFE feature selection method. The iteration ended until the OA increased is less than 1%. It is considered that the model reached the most stable state when 33 features were selected (Supplementary Fig. 4). And the accuracy of the black-soil resource identification model was higher after feature filtering, with an OA of 94.6%. Extracting the same band features and index features based on images from different periods can provide richer image information to effectively solve the problem of different sensitivity of reflectance spectra affected by time.

Table 6 Model performance on multi-temporal features model.

Full size table

Figure 6 illustrates that, among the various time-phase ensemble features with high importance order, the spectral index features, especially the soil reflectance index and vegetation index, carry more significance. Spectral indexes derived from the combination of reflectance values across multiple bands can effectively enhance the correlation between black soil characteristics and sensitive band features. The bands comprising the soil reflectance index exhibit heightened sensitivity to the properties of black soil, which makes the soil reflectance index a significant contributor to enhance the model performance in mapping black-soil resource.

We further compared the performance of the MEF-RF model with that of the MBF-RF model. Obviously, the MEF-RF model performed better (Table 7). After feature filtering, 33 selected features were used in the MBF-RF model, achieving an OA of 90.4% (Supplementary Fig. 3 recorded the relationship between the MBF-RF model performance and the number of features). The confusion matrices generated using the multi-temporal feature models (Fig. 7) showed a significant reduction in misinterpreted, compared to the results based on the mono-temporal model. There was no significant difference in the confusion matrices of MEF-RF model and MBF-RF model. Specifically, there was a reduction in the misinterpretation of ‘black soil’ as ‘other soil’ in the MEF-RF model compared to the MBF-RF model. The F-1 score of the MEF-RF model were also increased compared to the MBF-RF model (Table 7).

In terms of the importance of multi-temporal band features, those from May had a high percentage. It’s due to the fact that spring plowing of cropland in the Northeast was mostly complete in May, cropland straw cover was low, and the high soil moisture period from snowmelt had ended. Although the importance of vegetation red-edge band ranked low in mono-temporal band features, its contribution in July was higher due to the high vegetation cover, improving the model performance (Fig. 6).

Effects of geographic covariate data on model performance

We evaluated the model performance by adding geographic covariates based on the best black-soil resource discriminating model. The model performance varied with different geographic covariates, indicating their differing roles in black-soil discrimination. Including precipitation and slope covariates decreased model accuracy, indicating these variables do not significantly contribute to black-soil discrimination and introduce data redundancy. The addition of temperature covariate to the MEF-RF model resulted in a slight improvement in model performance, with OA increasing from 94.62 to 95.71% (Table 7). However, the F1 scores of ‘black soil’ in the model decreased slightly after adding the covariates. This illustrates that the model with geographic covariates indirectly improved the performance in mapping black-soil resources by improving the ability to identify other land cover classes.

Table 7 F-1 score on models.

Full size table

To further discuss the effect of geographic covariate data on model performance, we obtained the importance of band features after adding geographic covariates. The result in Fig. 8 shows that temperature covariate is of high importance in black-soil discriminations.

Model robustness evaluation

The model performance with Landsat-8 data applied was evaluated to validate the robustness and transferability of the MEF-RF model for discriminating black-soil resource. In comparing the accuracy of the models constructed based on Landsat-8 data and Sentinel data, we observed the accuracy of the MEF-RF model based on Landsat-8 data is slightly lower than the MEF-RF model based on Sentinel-2 data, with an OA of 90.01% (Table 7). This is mainly due to the gap in spatial resolution of different remote sensing data. However, the MEF-RF model based on Landsat-8 data performed better than the other models in this study. The confusion matrix generated using the MEF-RF model with Landsat-8 data also showed a low misclassification rate (Fig. 9). It was demonstrated that the model developed in this study still performs well with limited spatial resolution. This suggests that the wider availability of the MEF-RF model for black soil discrimination.

According to the black soil distribution results in Fig. 10, the 30 m spatial resolution map of black soil distribution obtained from the Landsat-8 is prone to loss of detail information, compared to the 10 m resolution of the Sentinel-2. For example, buildings and vegetation in soil distribution areas are susceptible to under-scoring, and black soils in large-scale distribution areas of other soil types are easily misclassify for other soil types. However, the fact that the extent of black soil identified using different data sources is basically the same proved the robustness of the model developed in this study.

Discussion

The multi-temporal ensemble features extracted in this paper could remove the invalid information from the image. The common features formed with the compression and dimensionality reduction of multiple remote image data could effectively reveal the surface information of black-soil resources. However, the consideration of alternative features in this study is still lacking. The spectral index should be constructed innovatively by selecting sensitive bands on the basis of studying the spectral characteristics of black soil resources. Furthermore, Since the accuracy of mapping black-soil is determined by the selection of classification features, it remains to be verified whether the classification features selected in this study can be applied successfully in other study areas.

The performance of model was different when using diverse geographic covariates. Although in this paper, the temperature covariates improved the model performance. The relationship between the covariates and the model in mapping black-soil resource is still a complex relationship to be further investigated. Finally, it is assumed in this study that high spatial resolution image has higher mapping accuracy. The application of the model on Landsat-8 obtained better performance and confirmed its robustness. However, there is still a lack of application cases of the model to higher spatial resolution satellites. These aspects still need attention and exploration.

Conclusions

This study proposed a comprehensive and multispectral features-based approach built on a machine learning algorithm named multi-temporal ensemble features model (MEF-RF) for delineating the spatial extent of black-soil resources. The results showed that ensemble features from multi-temporal images can provide complementary information to improve the accuracy of black-soil discriminating. In addition, the effects of adding geographic covariates in MEF-RF model were evaluated in this study. It indicates that adding temperature covariate is another key means to improve model performance in mapping black-soil resource in the study area. Finally, we assume that the high spatial resolution image has a better effect of black soil mapping. Applied to slightly lower spatial resolution images, the MEF-RF model still has good performance. The results proved that the model improved in this study are generalizable to commonly used remote sensing images, which could be helpful for management of black-soil resources. Jianfang Hu conceived and designed the study and drafted the manuscript. Yulei Tang and Jianfang Hu carried out the experiments. All authors revised and edited the manuscript and approved the final version of the manuscript.

Data availability

All data relevant to the study are included in the article or uploaded as supplementary information. In addition, the datasets used or analyzed during the current study are available from the corresponding author on reasonable request.

References

Liu, X. B. et al. Overview of Mollisols in the world: distribution, land use and management. Can. J. Soil Sci., 92(3). (2012).
FAO. global status of black soils[R].2022.
Suchodoletz, H. V. et al. Distribution of Chernozems and phaeozems in Central Germany during the neolithic period. Quatern. Int. 511, 166–184 (2019).
Article MATH Google Scholar
Zhang, G., Long, H. & Yang, F. Understanding the formation time of black soils. Innov. Geosci. 1(1), 100010 (2023).
Article MATH Google Scholar
Li, R. et al. Soil degradation: a global threat to sustainable use of black soils. Pedosphere, (2024).
Wang, W., Deng, X. & Yue, H. Black soil conservation will boost China’s grain supply and reduce agricultural greenhouse gas emissions in the future. Environ. Impact Assess. Rev. 106, 107482 (2024).
Article Google Scholar
Li, X. et al. Soil quality assessment of croplands in the black soil zone of Jilin Province, China: establishing a minimum data set model. Ecol. Ind. 107, 105251 (2019).
Article CAS MATH Google Scholar
Li, X. et al. Dynamic evaluation of cropland degradation risk by combining multi-temporal remote sensing and geographical data in the Black Soil Region of Jilin Province, China. Appl. Geogr. 154, 102920 (2023).
Article MATH Google Scholar
Singh, S. et al. Remote sensing applications in soil survey and mapping; a review. Int. J. Geomatics Geosci. 7(2), 192–203 (2016).
MATH Google Scholar
Stumpf, F. et al. Incorporating limited field operability and legacy soil samples in a hypercube sampling design for digital soil mapping. (2016).
Mohammad, R. K. et al. Legacy soil maps as a covariate in digital soil mapping: a case study from Northern Iran. Geoderma: Int. J. Soil. Sci., 279. (2016).
Wang, J. L. et al. Balanced fertilization over four decades has sustained soil microbial communities and improved soil fertility and rice productivity in red paddy soil. Sci. Total Environ. 793, 148664 (2021).
Article CAS PubMed Google Scholar
Wang, L. et al. Improvement of data imbalance for digital soil class mapping in Eastern China. Comput. Electron. Agric. 214, 108322 (2023).
Article MATH Google Scholar
Sahadevan, A. S., Lyngdoh, R. B. & Ahmad, T. Multi-label sub-pixel classification of red and black soil over sparse vegetative areas using AVIRIS-NG airborne hyperspectral image. Remote Sens. Appl.: Soc. Environ. 29, 100884 (2023).
Article MATH Google Scholar
Poppiel, R. R. et al. High resolution middle eastern soil attributes mapping via open data and cloud computing. Geoderma 385, 114890 (2021).
Article ADS CAS Google Scholar
Swain, S. R. et al. Estimation of soil texture using Sentinel-2 multispectral imaging data: an ensemble modeling approach. Soil Tillage. Res. 213, 105134 (2021).
Article MATH Google Scholar
Neyestani, M. Digital mapping of soil classes using spatial extrapolation with imbalanced data. Geoderma Reg. 26, e00422 (2021).
Article Google Scholar
Baumgardner, M. F. et al. Reflectance properties of soils. Adv. Agron. 38, 1–44 (1986).
Article MATH Google Scholar
Atemkeng, C. C. et al. Inverse radiative transfer problem for soil properties retrieval from bidirectional reflectance measurements. Results Opt. 11, 100409 (2023).
Article MATH Google Scholar
BenDor, E., Irons, J. R. & Epema, G. F. Soil Reflectance. remote sensing for the earth sciences manual of remote sensing, (1999).
Jin, X. et al. Remote estimation of soil organic matter content in the Sanjiang Plain, Northest China: the optimal band algorithm versus the GRA-ANN model. Agric. For. Meteorol., 218–219 :250–260. (2016).
Wang, S. et al. Using soil library hyperspectral reflectance and machine learning to predict soil organic carbon: assessing potential of airborne and spaceborne optical soil sensing. Remote Sens. Environ. 271, 112914 (2022).
Article MATH Google Scholar
Luo, C. et al. Mapping soil organic matter content using Sentinel-2 synthetic images at different time intervals in Northeast China. Int. J. Digit. Earth. 16(1), 1094–1107 (2023).
Article ADS MATH Google Scholar
Lin, L. et al. A new method for multicolor determination of organic matter in moist soil. CATENA 207, 105611 (2021).
Article CAS MATH Google Scholar
Luo, C. et al. Regional mapping of soil organic matter content using multitemporal synthetic landsat 8 images in google earth engine (CATENA, 2022).
Meng, X. T. et al. SOC content of global mollisols at a 30 m spatial resolution from 1984 to 2021 generated by the novel ML-CNN prediction model. Remote Sens. Environ. 300, 113911 (2024).
Article MATH Google Scholar
Dou, X., Wang, X., Liu, H., Zhang, X. & Cui, Y. Prediction of soil organic matter using multi-temporal Satellite Images in the Songnen Plain, China356 (An International Journal of Soil Science, 2019).
Vaudour, E. et al. The impact of acquisition date on the prediction performance of topsoil organic carbon from sentinel-2 for croplands. Remote Sens., 2019(18).
Luo, C. et al. Monthly composites from Sentinel-1 and Sentinel-2 images for regional major crop mapping with Google Earth Engine. J. Integr. Agric. 20(7), 1944–1957 (2021).
Article MATH Google Scholar
Luo, C. et al. Mapping of soil organic matter in a typical black soil area using Landsat-8 synthetic images at different time periods. CATENA 231, 107336 (2023).
Article CAS MATH Google Scholar
Yang, H. et al. Hyper-temporal remote sensing data in bare soil period and terrain attributes for digital soil mapping in the black soil regions of China. CATENA 184, 104259 (2020).
Article MATH Google Scholar
Meng, X. et al. A new digital soil mapping method with temporal-spatial-spectral information derived from multi-source satellite images. Geoderma 425, 116065 (2022).
Article ADS MATH Google Scholar
Dou, P. et al. Remote sensing image classification using an ensemble framework without multiple classifiers. ISPRS J. Photogrammetry Remote Sens. 208, 190–209 (2024).
Article MATH Google Scholar
Poggio, L. et al. SoilGrids 2.0: producing soil information for the globe with quantified spatial uncertainty. Soil, 7. (2021).
Xu, X. et al. A remote sensing-based strategy for mapping potentially toxic elements of soils: temporal-spatial-spectral covariates combined with random forest. Environ. Res. 240, 117570 (2024).
Article CAS PubMed Google Scholar
Zhang, S. et al. Assessing soil thickness in a black soil watershed in northeast China using random forest and field observations. Int. Soil. Water Conserv. Res. 9 (1), 49–57 (2021).
Article MATH Google Scholar
Taghizadeh-Mehrjardi, R. et al. High-performance soil class delineation via UMAP coupled with machine learning in Kurdistan Province, Iran. Geoderma Reg. 36, e00754 (2024).
Article Google Scholar
Westhuizen, V. S. & Heuvelink, G. B. M. Hofmeyr. D.P. Multivariate random forest for digital soil mapping. Geoderma 431, 116365 (2023).
Article MATH Google Scholar
Bi, W. et al. Evolution characteristics of groundwater level and its relation to low-carbon development in southern horqin sandy land, China. Energy Procedia. 152, 809–814 (2018).
Article MATH Google Scholar
Yuechao, Z. & Kun, L. I. Discussion on the land scale management in Kangping County (Agricultural Science&Technology and Equipment, 2010).
Yao, Y. et al. Exchangeable Ca2 + content and soil aggregate stability control the soil organic carbon content in degraded Horqin grassland. Ecol. Ind. 134, 108507 (2022).
Article CAS MATH Google Scholar
Cui, X. et al. Estimating and mapping the dynamics of soil salinity under different crop types using Sentinel-2 satellite imagery. Geoderma 440, 116738 (2023).
Article ADS CAS MATH Google Scholar
Zhou, T. et al. High-resolution digital mapping of soil organic carbon and soil total nitrogen using DEM derivatives, Sentinel-1 and Sentinel-2 data based on machine learning algorithms. Sci. Total Environ. 729, 138244 (2020).
Article CAS PubMed Google Scholar
Djukic, I. et al. Soil organic-matter Stocks and Characteristics along an Alpine Elevation Gradient (Journal of Plant Nutrition and Soil Science, 2010).
Onwuka, B. Effects of soil temperature on Some Soil properties and plant growth. (2016).
Cheng, Y. et al. The effect of soil water content and erodibility on losses of available nitrogen and phosphorus in simulated freeze-thaw conditions. CATENA 166, 21–33 (2018).
Article CAS MATH Google Scholar
Masek, J. G. et al. A Harmonized landsat-sentinel-2 surface reflectance product: a resource for agricultural monitoring (AGU Fall Meeting Abstracts, 2015).
Sripada, R. P. et al. Aerial color infrared photography for determining late-season nitrogen requirements in corn. Agron. J. 97 (5), 1511–1514 (2005).
Article MATH Google Scholar
Riggs, G. A., Hall, D. K. & Salomonson, V. V. A snow index for the landsat thematic mapper and moderate resolution imaging spectroradiometer. IEEE International Geoscience and Remote Sensing Symposium, 1994, 8–12 Aug. 1994. (1994).
Huete, A. R. A soil-adjusted Vegetation Index (SAVI) (Remote Sensing of Environment, 1988).
Yao, B. et al. Spatiotemporal variation and GeoDetector analysis of NDVI at the northern foothills of the Yinshan Mountains in Inner Mongolia over the past 40 years. Heliyon 10(20), 39309 (2024).
Article Google Scholar
Birth, G. S. & Mcvey, G. R. Measuring the color of growing turf with a reflectance spectrophotometer. Agron. J. 60(6), 640–643 (1968).
Article MATH Google Scholar
Liang, Y. I. Study on dynamic change of Yuli oasis plant cover based on RDVI. J. Arid Land. Resour. Environ. 18(6), 66–71 (2004).
MATH Google Scholar
Luciano, S. et al. Water and nitrogen effects on active canopy sensor vegetation indices. Agron. J. 103(6), 1815 (2011).
Article MATH Google Scholar
Bin, L. I. et al. Comparative study on the correlations between NDVI, NDMI and LST (Progress in Geography, 2017).
Chen, X. L. et al. Remote sensing image-based analysis of the relationship between urban heat island and land use/cover changes. Remote Sens. Environ. 104(2), 133–146 (2006).
Article ADS MATH Google Scholar
Shashikanta, S. S. et al. Evaluation of NDWI and MNDWI for assessment of waterlogging by integrating digital elevation model and groundwater level (Geocarto International, 2015).
Sharma, J. et al. Development of a new vegetation modulated soil moisture index for the spatial disaggregation of SMAP soil moisture data product135103594 (Physics and Chemistry of the Earth, 2024).
Hateffard, F., Steinbuch, L. & Heuvelink, G. B. M. Evaluating the extrapolation potential of random forest digital soil mapping. Geoderma 441, 116740 (2024).
Article ADS MATH Google Scholar
Breiman, L. Random Forests (Machine Learning, 2001).
Deng, F. et al. Union with recursive feature elimination: a feature selection framework to improve the classification performance of multicategory causes of death in colorectal cancer. Lab. Invest. 104(3), 100320 (2024).
Article PubMed MATH Google Scholar
Tang, Y. et al. Research on the optimization of sample point placement for ground substrate survey based on interpretable machine learning, IEEE, 2023. (2023).
Wang, Y. & Li, Y. Mapping the ratoon rice suitability region in China using random forest and recursive feature elimination modeling. Field Crops Res. 301, 109016 (2023).
Article ADS MATH Google Scholar
Deng, X. et al. An improved method to construct basic probability assignment based on the confusion matrix for classification problem. Inf. Sci. 340–341, 250–261 (2016).
Article MATH Google Scholar
Godbole, S. & Sarawagi, S. Discriminative methods for multi-labeled classification: Pacific-Asia conference on knowledge discovery and data mining, (2004).

Download references

Acknowledgements

This study was supported by the Program of China Geological Survey (DD 20243191, DD20220862), the National Key Research and Development Program (2023YFD1500104), the Science and Technology Innovation Foundation of Command Center of Integrated Natural Resources Survey Center (KC20220005).

Author information

Authors and Affiliations

Center for Geophysical Survey, China Geological Survey, Langfang, 065000, China
Jianfang Hu, Yulei Tang, Jiapan Yan & Zhansheng Chen
Technology Innovation Center for Earth Near Surface Detection, China Geological Survey, Langfang, 065000, China
Jianfang Hu, Yulei Tang, Jiapan Yan & Zhansheng Chen
China Aero Geophysical Survey and Remote Sensing Center for Natural Resources, Beijing, 100083, China
Jiahong Zhang
College of Resources, Environment and Tourism, Capital Normal University, Beijing, 100048, China
Yuxin Zhao

Authors

Jianfang Hu
View author publications
Search author on:PubMed Google Scholar
Yulei Tang
View author publications
Search author on:PubMed Google Scholar
Jiapan Yan
View author publications
Search author on:PubMed Google Scholar
Jiahong Zhang
View author publications
Search author on:PubMed Google Scholar
Yuxin Zhao
View author publications
Search author on:PubMed Google Scholar
Zhansheng Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

Hu, J. F. and Tang, Y. L. conceived and designed the study. Hu, J. F. wrote the main manuscript text and carried out the experiments.Tang, Y. L., Zhang, J. H., and Zhao, Y. X. have substantively revised the manuscript. All authors revised and edited the manuscript and approved the final version of the manuscript.

Corresponding author

Correspondence to Yulei Tang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hu, J., Tang, Y., Yan, J. et al. Improving model performance in mapping black-soil resource with machine learning methods and multispectral features. Sci Rep 15, 1199 (2025). https://doi.org/10.1038/s41598-024-82399-3

Download citation

Received: 26 March 2024
Accepted: 05 December 2024
Published: 07 January 2025
Version of record: 07 January 2025
DOI: https://doi.org/10.1038/s41598-024-82399-3

Subjects

Abstract

Similar content being viewed by others

High-Resolution prediction of soil pH in European temperate forests using Sentinel-2 and ancillary environmental data

Machine learning and computer vision technology to analyze and discriminate soil samples

Estimating soil profile salinity under vegetation cover based on UAV multi-source remote sensing

Introduction

Materials and methods

Study area

Data acquisition and treatment

Training samples and verification samples

Satellite retrievals

Geographic covariates

Selection of feature variables

Algorithm descriptions

Accuracy assessment

Results and discussion

Spatial distribution of black-soil resource

Model performance evaluation

Model performance based on NBF-RF

Model performance based on NEF-RF

Model performance based on multi-temporal features

Effects of geographic covariate data on model performance

Model robustness evaluation

Discussion

Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1 (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links