Background & Summary

Rapid industrialization and economic development in China in the past decades have generated a colossal amount of solid waste and posed heavy risks onto the environment1. At present waste streams from agricultural, construction, industrial, and post-use sources may aggregate to more than 10 gigatonnes per annum (Gt/a)2,3,4,5. Among them, industrial solid waste (ISW) in China counts for some 4 Gt/a, which is equivalent to more than half of coal or nearly double of iron ore production in the world5,6,7.

According to an updated solid waste code, there are 17 main categories and hundreds of subcategories of ISW in China (excluding hazardous industrial waste). The top six categories are metallurgical slags, fly ash, furnace slags, coal gangue, tailings, and desulfurization gypsum, respectively. ISW can be utilized as secondary materials, landfilled, incinerated, or temporarily stockpiled for future treatment. Although the 13th Five Year Plan of China (2016–2020) set the goal of a 73% comprehensive utilization rate, only 57.7% of ISW was reused by 2022 (2022 China’s Ecological Environment Status Bulletin). The cumulative ISW stocks to be environmentally safely treated reached some 60–70 Gt8, led to a variety of environmental degradations from soil erosion, groundwater contamination, to habitat destruction and biodiversity loss across the country.

Despite the urgent challenge, a complete tempo-spatial dataset of ISW generation in China at city-level has not been established, preventing a comprehensive understanding and further investigation on waste management and recycling. A well-functioning environmental monitoring and data collection system can statistically track and report industrial waste from factory to factory and city by city. However, this takes time and requires expensive investment in the environmental infrastructure. Before high-resolution industrial waste information are fully recorded, machine learning can provide a powerful tool for estimating the regional and sub-provincial industrial waste generation patterns in China and filling the gaps.

Machine learning has been increasingly applied in environmental management because of its cost-efficiency, predictive accuracy, and robustness. It is already widely used in making predictions, extracting features, detecting anomalies, and discovering new materials or chemicals9. Machine learning has also been employed in a variety of solid waste research10,11,12. Lin et al.13 conducted a critical review on application of deep leaning in solid waste management, and found that the method had been very effective in predicting waste quantity and composition. The amount of waste can be informed with one or several machine learning methods combined. Kannangara et al.14 employed a neural network and decision tree in modelling and estimating regional waste generation and diversion in Canada. Ma et al.15 also used artificial neural network to investigate the composition of municipal solid waste (MSW). Zhang et al.16 predicted MSW generation with five machine learning approaches.

This present research aims to establish the city-level ISW dataset in China from 1990 to 2020 by employing a non-interpolation-based machine learning approach. It covers industrial waste from all the 337 administrative divisions at the prefectural level and above in China from 1990 to 2022, including 293 prefecture-level cities, 7 prefecture-level regions, 30 autonomous prefectures, 3 prefecture-level leagues, as well as 4 municipalities (as seen in the dataset17). For simplicity, we use “city” to refer to these administrative divisions in the text below. In addition to the aggregate ISW flow, we also attempt to develop the city-level inventory of six major ISW subcategories for Year 2022, including metallurgical slags, fly ash, furnace slags, coal gangue, tailings, and desulfurization gypsum.

ISW is one of the largest and most widely spreading solid waste flows, creating environmental and ecosystem risks in many regions. Regional and meso-scale ISW generation, however, often fluctuates owing to local industrial changes and irregular operational factors. Such dynamics can hardly be captured by conventional spatial interpolation methods. Therefore, we have developed a data-driven and non-interpolation machine learning framework to enhance predictive accuracy and spatial resolution. To our best, we accomplish this complete and consistent estimation of ISW in Chinese cities. The methodology and practice offered by this present work may also help to inform waste generation patterns in the future and in other areas.

Methods

We first collect the data of ISW from available waste statistics, then set up six machine learning models and used Bayesian optimization to find the best model, and finally complete the missing data. The methodology is portrayed in Fig. 1.

Fig. 1
figure 1

Methods of ISW data completion.

Collection of ISW data

The ISW data from 337 cities across China are collected from >500 sources of national, provincial, and city statistical yearbooks and waste statistical bulletins, as depicted in the dataset17. The statistical data comprise the original ISW dataset, in which 33% are missing and to be filled. Most of these missing values occur in less developed areas or in early years, thus they are less significant in the dataset (as seen in Fig. 2).

Fig. 2
figure 2

Heat map of ISW in 30 top generation cities in China.

Machine learning models

When utilizing machine learning to estimate absent data in ISW inventory, the selection of appropriate models is critical to ensure reasonable predictions. We employ six models based on their advantages and complementary effects. KNeighborsRegressor (K-Nearest Neighbour, KNN) is effective by leveraging the proximity of neighbouring data points to predict missing values. It is particularly adept at handling datasets characterized by intricate relationships and non-linear patterns18. KNN is one of the most widely-used models to estimate the generation of solid waste for its flexibility to include date type and ability to filling missing data19,20. LGBMRegressor (Light Gradient Boosting Machine, LGBM) is a gradient boosting framework to make efficient and accurate computation. Notably, it is prominent in managing sizable datasets with minimal risk of overfitting21, as well as in identifying complex connections and capturing significance of features22. RandomForestRegressor (Random Forest, RF) employs an ensemble learning approach, constructs multiple decision trees and amalgamates their predictions. It exhibits resilience, and can adeptly manage non-linear relationships while mitigating the risk of overfitting23. RF can predict the generation of waste even with limited data24. MLPRegressor (Multilayer Perceptron, MLP) represents a form of artificial neural network esteemed for its proficiency in discerning complex data patterns, making it well-suited for datasets characterized by non-linear relationships and large data volumes25. Extreme Gradient Boosting (XGB) Regressor presents another gradient boosting approach enabling great efficiency, accuracy, and resilience against outliers in regression tasks26. Decision Trees (DT) offer a straightforward yet potent modelling technique, dividing data based on feature values. It is known for interpretability, versatility in handling both numerical and categorical data, and efficacy in capturing non-linear relationships27. A mix of different models to estimate solid waste generation or its features have been applied in a number of studies28,29.

Bayesian optimization

To enhance the predictive model’s performance and resilience, Bayesian optimization has been integrated alongside the selected models. It helps to tune model hyperparameters, conduct efficient exploration, and withstand noise interference. This technique aids in refining the hyperparameters of the machine learning models, and holds significant influence over the model’s performance, including learning rate, number of estimators, and tree depth. Bayesian optimization systematically explores the hyperparameter space to pinpoint the optimal configuration, consequently refining model accuracy and generalization. By employing probabilistic models, it arranges the search for optimal hyperparameters, maintaining a delicate balance between exploring new regions in the hyperparameter space and exploiting promising areas, thereby ensuring efficient and effective optimization. This strategy of exploration and exploitation is pivotal, especially when grappling with intricate models and datasets containing incomplete information30. Moreover, Bayesian optimization exhibits robustness in the face of noise when evaluating objective functions. In scenarios where machine learning models are trained on incomplete data, the presence of missing values can introduce disruptive noise in the optimization process. By encapsulating the uncertainty linked with the objective function, Bayesian optimization mitigates the impact of noise, yielding more dependable hyperparameter estimates31.

Moreover, Bayesian optimization can automate the process of hyperparameter tuning, reducing the burden on data scientists and practitioners32. By iteratively selecting hyperparameters based on past evaluations, Bayesian optimization efficiently converges to the optimal solution. This automation accelerates the model development pipeline and enables faster experimentation with different machine learning algorithms and configurations.

Data Records

“Industrial solid waste dataset in China 1990–2022” can be accessed under Figshare17. It includes the following data sheets:

  • List of prefectural cities, showing 333 prefectural level administrative divisions across 27 provincial level divisions in mainland China. It includes an inventory of 293 prefecture-level cities, 7 prefecture-level regions, 30 autonomous prefectures, 3 prefecture-level leagues.

  • ISW at city level (original), presenting the original data of ISW generation in 333 prefectural divisions and 4 municipalities in China in the period 1990–2022. The proportion of missing data is 33%.

  • ISW at city level (computed), presenting the data that are computionally developed in 333 prefectural divistions and 4 municipalities in China in the period 1990-2022.

  • ISW at city level (complete), presenting the complete data of ISW generation in 333 prefectural divisions and 4 municipalities in China in the period 1990–2022. The missing value in the original dataset has been completed with the best estimate of the machine learning models.

  • Six major ISW at city level, presenting the original data of the generation of six subcategories of ISW in 173 prefectural divisions and 3 municipalities in China in 2022, including metallurgical slags, fly ash, furnace slags, coal gangue, tailings, and desulfurization gypsum.

  • Data sources, listing the sources of the original data in the sheets of “ISW at city level (original)” and “Six major ISW at city level”.

Based upon the complete ISW dataset, Fig. 3 shows the increase rate of ISW in 2022 of all Chinese cities compared to that in 1990. Cities in resource rich areas or economically advanced areas showed the biggest increase in ISW. This suggests that ISW generation was largely driven by industrial development coupled with rapid urbanization. From the 1990s, industrialization surged as the government encouraged investment in manufacturing and infrastructure, leading to the emergence of bustling urban centers and the expansion of industrial zones across the country. Concurrently, urbanization accelerated as rural residents migrated to cities in search of employment opportunities. Since the beginning of reform and opening up in the late 1970s, the industrial value-added in China had increased nearly 200 times until 202233, and urbanization rate had climbed from 17.92% to 65.2%34 The ISW generation, nonetheless, grew by a factor of 8.4 in the period of 1980 to 2022.

Fig. 3
figure 3

Increase of ISW in 1990–2022 in cities of China.

Specifically, among all ISW flows, metallurgical slags, fly ash, furnace slags, coal gangue, tailings, and desulfurization gypsum are the major subcategories in China, coded as SW01 to SW06, respectively, under the recently updated classification of the Ministry of Environment and Ecology of China35. They are produced alongside with industrial activities including mining, metallurgy, thermal power industry. Due to data limitation, time series dataset for subdivision of ISW cannot be provided by far. We complied generation of six ISW subcategories in major Chinese cities in 2022, as illustrated in Fig. 4 and presented in the dataset17.

Fig. 4
figure 4

Generation of ISW in cities of China in 2022, the grey area indicates ISW generation in total, and the colored column represents six subcategories of ISW.

Technical Validation

Model evaluation

To assess the effectiveness of these models, two metrics were utilized: the correlation coefficient (R²) and mean square error (MSE). These metrics offer insights into the precision and predictive capability of the models. R² quantifies the strength and direction of the linear association between predicted and actual values, ranging from 0 to 1, where 1 signifies a perfect correlation. MSE gauges the average squared disparity between predicted and actual values, offering a comprehensive assessment of the model’s precision, with lower values indicating superior performance. Therefore, R² and MSE are calculated with the following equations, respectively.

$${R}^{2}=1-\frac{{\sum }_{i=1}^{n}{({y}_{i}-{\hat{y}}_{i})}^{2}}{{\sum }_{i=1}^{n}{({y}_{i}-{\bar{y}}_{i})}^{2}}$$
(1-1)
$$MSE=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{({y}_{i}-{\hat{y}}_{i})}^{2}$$
(1-2)

where \({y}_{i}\), \({\hat{y}}_{i}\), and \({\bar{y}}_{i}\) represented the actual, predicted and average value of ISW, respectively. The values of R2 and MSE for each model are shown in Table 1.

Table 1 Prediction performance of test data.

To further validate the data quality, we compared predicted data with actual data of all the cities in year 2022. Figure 5 shows the scatter plot of predicted values and actual values of testing set, which indicates the performance of different models in prediction. The bolded table cell in Table 1 represents the model applied to certain provinces and cities, with the best performance (R2 closes to 1 and MSE the smallest among all models). Finally, the missing ISW values are completed by the best prediction models for each city.

Fig. 5
figure 5

Predicted values and actual values of the testing set of machine learning models, dots in the same colour represent cities from the same province.

Limitations and perspectives

This study attempts to complete an ISW dataset for across 337 cities in China as from 1990 to 2022, yet certain limitations remain. The quality and consistency of the data input, retrieved primarily from statistical yearbooks and waste bulletins and reports, may affect the predictive accuracy due to regional and temporal inconsistencies. Especially in small cities, even opening or closure of a coal mine or steel plant may substantially change their ISW generation. This business change has randomness, thus is uneasily predicted by machine learning models. Additionally, the feature selection process does not fully account for complex socio-economic factors such as policy shift, business movement, and technological advancement, all could significantly change ISW patterns. The models also demonstrate limited adaptability in regions with sparse or highly variable data, as evidenced by lower R² scores or even negative values in some cities and provinces. Furthermore, the study predominantly focuses on static predictions based on historical data, offering limited insights into the dynamic evolution of waste toward future, including increasingly rigorous environmental regulations and adoption of innovative green technologies.

Despite these constraints, the findings lay a solid foundation for enhancing waste management research. Future studies could integrate real-time monitoring data obtained from IoT and remote sensing to improve data coverage and accuracy. Incorporating advanced deep learning frameworks, such as graph neural networks or transformer architectures, could improve the models’ capability to handle spatial and temporal complexities. Scenario-based dynamic modelling, combined with interdisciplinary approaches, could provide a deeper understanding of ISW generation trends and their environmental implications, enabling policymakers and industrial stakeholders to develop more effective waste management strategies and promote sustainable and low-carbon development pathways.

This dataset of ISW generation across prefectural divisions and municipalities in China holds potential for diverse applications. It can constitute a foundation to estimate future waste streams, identify regional disparities, and evaluate waste reduction programs. Furthermore, the dataset can support the development of predictive models for solid waste under various economic and regulatory scenarios. Policymakers and environmental agencies may also leverage these data to design targeted interventions, optimize resource recycling, and promote circular economy initiatives. Future research may integrate this dataset with socioeconomic, environmental, and industrial structure indicators and yield more informative insights into waste management and recycling in specific regions.