Abstract
There has been a rapid-growing trend in studying China’s environmental problems in the past decade. However, the existing environmental statistics data are far from meeting researchers’ requirements. The biggest problem is that the official environmental statistics data are only provided at either regional level or sectoral level. Considering the huge heterogeneities in different regions and sectors, researchers are unable to conduct comprehensive policy evaluations. In this study, we constructed the time-series industrial environmental database for China (CIED) at both regional and sectoral level. The database includes totally 31 regions and four types of pollutants: chemical oxygen demand (COD), sulphur dioxide (SO2), ammonia-nitrogen (NH3-N), and nitrogen oxide (NOX). This study also clarifies several important concepts for researchers to better understand China’s official environmental statistics data.
Measurement(s) | environmental pollution |
Technology Type(s) | computational modeling technique |
Sample Characteristic - Location | China |
Similar content being viewed by others
Background & Summary
China’s environmental problem is one of the most important issues accompanying with its rapid industrialization. Although the situation has been greatly improved in the past decade, the whole nation, especially in certain regions, is still facing severe environmental problems such as air pollution issue, water pollution and so on. Since 2000s, there is a growing number of evidence-based literature focusing on China’s environmental problems. Topics of these literature include discussions of China’s environmental Kuznets curve1,2,3,4,5, analyses of environmental protection performance6,7,8,9 and evaluations of environmental policies10,11. Most of these studies use either regional-level or sector-level environmental data that are retrieved from official statistics to conduct analyses, which fail to take both regional and sectoral heterogeneities into consideration simultaneously. The main reason is that the official statistics such as China Statistical Yearbook on Environment only report provincial aggregated pollution data and sectoral aggregated pollution data. Attempts to partly overcome this problem are to use micro-level data such as firm-level or plant-level data, and cases can be seen in several recent studies12,13,14,15. These studies provide insights into firms’ polluting behaviours and also do provide new evidence in understanding regional and sectoral variations in China.
There are two main drawbacks in using micro-level data, especially firm-level data, to study the environmental issues. Firstly, a certain number of observations will be dropped when constructing the balanced panel dataset which plays an important role in most empirical studies. Sample representativeness will be further weakened if researchers match different firm-level datasets in order to obtain more variables. As a result, other important topics such as structure effect, entry effect and exit effect can not be investigated at the macro-level. Secondly, many important influence factors are only available and applicable in regional level or sectoral level, and micro-level data only contains individual firm specific characteristics. But most of current research interests and policy needs are at the macro-level. In this case, firm-level analysis may unable to deliver desired results and policy implications at the macro-level. Due to the above two reasons, estimating environmental data at both regional level and sectoral level will undoubtedly contribute to the existing empirical works. For example, the two-dimensional data have great potentials to boost environmental policy evaluations by using treatment effect analyses such as difference-in-difference (DID) model16,17,18 and synthetic control method (SCM)4,19,20, which have gained much popularity in this field in the past decade. The second direction of the application is conducting environmental efficiency analyses such as data envelope analysis (DEA) and stochastic frontier analysis (SFA), by viewing pollutions as undesirable products21,22,23,24. Furthermore, two-dimensional environmental data can also be used to construct environmentally extended multi-regional input-output (EE-MRIO) tables25,26,27, as well as provide more accurate estimates for parameters28, which are extremely useful in conducting complex economic system analyses such as computable general equilibrium (CGE) modelling works.
In the China Industrial Environmental Database (CIED), there are 31 regions (excluding Hongkong, Macao, and Taiwan) and 39 industrial sectors. For each sector in each region, we provide environmental data of four types of pollutants: chemical oxygen demand (COD), sulfur dioxide (SO2), ammonia-nitrogen (NH3-N), and nitrogen oxide (NOX). The database contains discharge and removal data for all four pollutants. Due to the data availability, data for COD and SO2 are available from 1998 to 2015, data for NH3-N is available from 2001 to 2015 and data for NOX is available from 2006 to 2015. The time period covers last three years of the ninth Five-Year Plan and the successive three complete Five-Year Plan periods. Since Five-Year Plans play great roles in the policy-making procedure, the database can be used to analyze a wide range of institutional and policy adjustment issues during this period.
Methods
In this study, we combine the bottom-up method and top-down method to construct the two-dimensional environmental database. In the first step, the bottom-up method uses firm-level environmental data calculate pollutants’ discharge and removal amount by region and by sector. In the second step, the top-down method collects total amount of pollutants discharged and removed for each region or for each sector. In the final step, the cross-entropy method is used to balance the two-dimensional matrix in each year.
Aggregation of micro-level environmental data
Micro-level environmental data are retrieved from China’s Environmental Statistics Database (CESD). The CESD is a micro-level database compiled by China’s Ministry of Environmental Protection (MEP) which is used as the data basis for publications of China’s official environmental statistics such as China Statistical Yearbook on Environment, China Environment Yearbook and Annual Statistic Report on Environment in China. This study uses the industrial firm-level data from the CESD and we call it China’s Industrial Environmental Statistics Database (CIESD). The CIESD is a newly released database available to researchers which covers all major industrial emission sources in China. This database has already been used and explained in details in several recent studies14,15,29,30.
Table 1 reports the number of firms covered in our database by region and year. We should pay some special attentions when using the database. Totally 20 cells in Table 1 are zero or close to zero for all regions, which means these observations are completely or almost missing. The overall data coverage rates in 1998 and 1999 are 75.38% and 91.97% respectively, which are not so perfect but acceptable for this study. In 2006, there are totally 1784 firms which cannot be categorized to any region by using 6-digit administrative division code. After carefully checking these observations manually, we find they are all duplicated observations for Anhui province. After comparing data of these duplicated observations, we just keep one unique observation with useful information. For the period during 2006 and 2010, environmental statistical data for thermal power plants are collected as a separate database, which is not included in our database. Total number of thermal power plants are also reported in Table 1. Therefore, if we subtract numbers of thermal power plants in the period during 2006 and 2010, the overall data coverage rates from 2000 to 2014 are quite close to 100% which means a quite good sample representativeness.
Before aggregating the micro-level environmental data, deeper investigations of the whole dataset have been conducted in order to revise the abnormal data records and improve the data quality. There are typically two types of errors exist for the abnormal data record. The first type is missing data error and the other one is measurement error. The missing data error will lead to underestimation of pollutants’ values and the measurement error will lead to both underestimation and overestimation of pollutants’ values. We manually checked each firm’s time series data to revise these two issues. For the missing data, if a water-polluting firm has both COD and NH3-N discharge data in all year but lacks NH3-N discharge data in one specific year, then the missing NH3-N discharge data is identified and will be filled with the interpolation value. For the measurement error, it is likely that firms may report incorrect data by using different units. For example, if one firm reports the data in unit of gram instead of kilogram which is required by the regulations, then data value recorded in the database will be 1,000 times higher than the true value. This data will be identified as abnormal data and be revised to its reasonable level.
The raw two-dimensional environmental data can be calculated based on revised CIESD as follows:
where \({x}_{r,j,i}^{c}\) is revised environmental data of firm i of sector j in region r. \({\bar{X}}_{r,j}^{c}\) represents raw macro-level environmental data of sector j in region r. Superscript c represents type of indicator, i.e. discharge or removal. We estimate the missing \({\bar{X}}_{r,j}^{c}\) by extrapolating the ratio of each sector among all regions, and the ratio is defined as follows:
Since micro-level environmental data in 2015 is not available, we use the structure ratios of 2014 as the prior structure information for 2015.
Collection of macro-level environmental data
In this study, we collect macro-level environmental data from China Statistical Yearbook on Environment for year from 2001 to 2015 and China Environment Yearbook for year from 1998–2000. Environmental data in the yearbook are compiled from the CESD, which is collected and processed according to the environmental statistical system (ESS). The ESS was adjusted at the beginning of each Five-Year Plan period to improve the support the environmental statistical work and to improve the quality of environmental statistical data. Within the time period of our database, the ESS has been adjusted four times. These adjustments raise three biggest challenges for this study to construct the CIESD.
The first challenge is that not all pollutants are included in the CESD from 1998 to 2015. For example, NH3-N was reported since the Tenth Five-Year Plan (started from 2001) and NOX was reported since the Eleventh Five-Year Plan (started from 2006). Due to the data availability, environmental data of these two pollutants are covered since the forementioned period. Table 2 reports whether each pollutant is reported and is listed as the major pollutant in four Five-Year Plan periods.
The second challenge is that indicators of pollutants are not consistent within the whole data period. According to the ESS, industrial emission sources are divided into major sources and non-major sources. Only major sources are covered by the CESD, and they are expected to account for at least 85% of total annual discharges of major pollutants. When compiling the environmental statistical data from 1998 to 2010, regional removal data, sectoral discharge data and sectoral removal data are just the summation value of major sources. However, regional discharge data are adjusted to include non-major sources’ discharge data which are estimated by using certain estimation methods. As a result, the national data aggregated from regional data are typically larger than that aggregated from sectoral data. Besides, removal data are no longer reported according to the new ESS since the beginning of Twelfth Five-Year Plan. Instead, number of pollutants produced are reported in the CESD as well as in the yearbooks. Both the new production data and the discharge data are adjusted to include non-major sources. Table 3 reports the comparison results of regional aggregation to sectoral aggregation. For the period from 2011 to 2015, we calculate the pollutants’ removal data by subtracting discharge data from production data. Inconsistent treatments of non-major sources in regional and sectoral data will be balanced using cross-entropy method in this study.
The third challenge is that the classifications of sectors reported in official statistics have been changed three times within the data period. Classifications of sectors are based on Classification Standards of National Economic Industries (CSNEI) which includes 2-digit, 3-digit and 4-digit codes for all sectors and sub-sectors. Three versions of CSNEI used for classification are GB/T 4754-1994, GB/T 4754-2002 and GB/4754-2011. In the period from 1998 to 2000, all firms are classified into 20 sectors (including cement manufacturing as a sub-sector). The classification changed for the first time in 2001 and 2002, all firms are classified into 43 sectors (including cement manufacturing and thermal power as sub-sectors). Then the classification changed for the second time from 2003 to 2010, all firms are classified into 42 sectors (including cement manufacturing and thermal power as sub-sectors). Finally, the classification changed for the third time from 2011 to 2015, all firms are classified into 42 sectors (including four separate sub-sectors with regional data). To make the data comparable across different years, we coordinate all sectors into 39 sectors labelled from 1 to 39. Table 4 reports the concordance of industrial sectors and the first column is the coordinated sector number. We also find that for some firms, they have different 2-digit sector codes recorded in CIESD and Annual Survey of Industrial Enterprises Database (ASIED). Since 2-digit sector codes in ASIED are more consistent to the sector classifications of official statistics released by the National Bureau of Statistics of China. We match the firms in CIESD and ASIED, and revise firms’ 2-digit sector codes according to the ASIED before the concordance of sectors. Besides, firms classified as “Others” in CIESD are all checked and revised manually. The chord diagram in Fig. 1(a) illustrates the changes of all firms’ sector codes in CIESD. We can see that most of firms’ sector codes in the CIESD are not revised or remain the same 2-digit sector codes after revision. If we take a close look at firms whose 2-digit sector codes are revised, interchanges occur in most sectors except for the “Others” sector whose 2-digit sector code is 39. Since all firms originally classified in the “Others” sector have been manually checked and revised, they are now assigned the correct 2-digit sector codes.
Moreover, since China has conducted the first National Census of Pollution Sources (NCPS) in 2007, the ESS has significantly changed according to the census results. As a result, there are two special concerns we must pay attention to when using the environmental data. Firstly, the ESS relied heavily on firms’ self-reported data to determine the major emission sources before 2011. After adopting the complete emission source list provided by the NCPS, total number of firms included in the CESD increased significantly in 2011 and keeps stable thereafter. As a result, it is not surprising that total discharge data jump at 2011 as well. Secondly, a large number of discharge coefficients have been updated by the NCPS and they have been adopted in the ESS adjusted in 2011. Consequently, some sectors pollution levels also changed greatly. If researchers conduct empirical studies by using panel models, these systematic changes can be captured by the sector and time fixed effects. While if researchers use methods such as index decomposition, data envelope analysis and so on, the forementioned concerns should be treated carefully.
Balancing the environmental data
We use a two-step adjustment procedure to adjust the raw two-dimensional environmental data to construct the final CIESD. In the first step, the raw value is adjusted by using the following formula:
where \({\widetilde{X}}_{r,j}^{c}\) represents the adjusted raw value and \({Y}_{r}^{c}\) represents aggregated macro-level environmental data for region r. The second term in the right-hand side of formula (3) is the region-specific average scale ratio to add non-major emission sources. Sectors in the same region are assumed to be scaled up proportionally. This step adjusts the aggregated regional raw data and make them consistent to the macro-level environmental data.
In the second step, we use the cross-entropy method to balance the two-dimensional data matrix31. In this paper, we assume the regional distributions of the environmental data within each sector are more reliable. Thus, we define the prior coefficients \({\widetilde{a}}_{r,j}^{c}\) and the new coefficients to be estimated \({\widehat{a}}_{r,j}^{c}\) as follows:
where \({\widehat{X}}_{r,j}^{c}\) is the new environmental data to be estimated. Then the objective function is defined as follows:
subject to:
The solution for \({\widehat{a}}_{r,j}^{c}\) is obtained by solving the optimization problem from formula (6)-(8) and the final estimated environmental data are obtained as follows:
Data Records
Our data records are available through Figshare in format of Excel file from the repository: https://doi.org/10.6084/m9.figshare.1684696632. Table 5 presents the structure of the environmental data for each year by region and by sector. Each matrix includes 31 regions and 39 sectors. Totally 122 matrices are included in the database. Of these,
-
72 matrices are discharge and removal data for COD and SO2 from 1998 to 2015;
-
30 matrices are discharge and removal data for NH3-N from 2001 to 2015;
-
20 matrices are discharge and removal data for NOX from 2006 to 2015.
Unit for all environmental data is kilogram (kg).
Technical Validation
Creditability of firm-level data
The prior information to construct the environmental data matrix are aggregated from the firm-level data, thus it is important to validate the creditability of these micro-level data. Since the yearbooks are the only authoritative data sources and sectoral data in the yearbooks only cover major emission sources, we compare data in our database to yearbooks’ data. In the Supplementary Fig. 1, we provided the comparison information for all pollutants by sector and by year. For years from 2000 to 2014, aggregation value for each sector in our database are very close to that in the yearbook. For years 1998 and 1999, although total observations are less than the yearbook, ratios that aggregation value in our database divided by total value in the yearbook are similar across sectors. Therefore, firm-level data in our database is credible and it is appropriate to calculate the coefficients in formula (4) by sector.
Validation of balanced results
Our estimation procedure aims to adjust the environmental data and make them consistent to the regional aggregations. Since regional aggregations include non-major emission sources, estimated environmental data have been systematically scaled up. For each year, we run the following two-way fixed effect regression to investigate adjustments from the original data to the estimated data:
where βc is the slope coefficient and \({\varepsilon }_{r,j}^{c}\) is the noise term which follows normal distribution, \({\varepsilon }_{r,j}^{c} \sim N\left(0,{\sigma }_{c}^{2}\right)\). Table 6 presents estimated slope coefficients and standard deviations for different pollutants. All coefficients are statistically significant at 1% level. Table 7 presents adjusted R2 for different pollutants and most of them are greater than 0.99. Figure 2 provides visual illustrations of the comparisons of unbalanced value and balanced value for four types of pollutants.
Comparison with provincial official statistics
Among all provinces in China, only Zhejiang province reports sectoral level environmental data. We collect sectoral level data for four types of pollutants from Zhejiang Statistical Yearbook on Environment and Zhejiang Natural Resources and Statistical Yearbook on Environment and coordinate all industrial sectors according to Table 4. Figure 3 provides visual illustrations of the comparisons of official value and our value for four types of pollutants. We can see that observations for all pollutants are well fitted and OLS results in Fig. 3 are highly consistent to those provided in Fig. 2.
Comparison with national aggregation data
The national aggregation data of four types of pollutants have been revised due to the correction of micro-level firm data. Figure 4 presents the comparisons of original and revised national aggregation data. Results show that the revised national aggregation data are smoother than the original data over the time. Abnormal data such as national COD removal data in 2002 and 2012 have been corrected. Almost all discharge and removal data of four types of pollutants have been adjusted downward for all years. Removal data have higher reduction rates than discharge data.
Code availability
In this study, we use the General Algebraic Modeling System (GAMS) to conduct the cross-entropy estimation, and MINOS solver is used to conduct the nonlinear optimization tasks. All codes used for analysis are available in the public GitHub repository: https://github.com/qianhaoqi/China-Industrial-Environmental-Database.
References
Shen, J. A simultaneous estimation of Environmental Kuznets Curve: Evidence from China. China Econ. Rev. 17, 383–394 (2006).
Brajer, V., Mead, R. W. & Xiao, F. Searching for an Environmental Kuznets Curve in China’s air pollution. China Econ. Rev. 22, 383–397 (2011).
Lan, J., Kakinaka, M. & Huang, X. Foreign Direct Investment, Human Capital and Environmental Pollution in China. Environ. Resour. Econ. 51, 255–275 (2012).
Zhang, C. et al. An integrated specification for the nexus of water pollution and economic growth in China: Panel cointegration, long-run causality and environmental Kuznets curve. Sci. Total Environ. 609, 319–328 (2017).
Xu, T. Investigating Environmental Kuznets Curve in China–Aggregation bias and policy implications. Energy Policy 114, 315–322 (2018).
Liu, Q. & Wang, Q. How China achieved its 11th Five-Year Plan emissions reduction target: A structural decomposition analysis of industrial SO2 and chemical oxygen demand. Sci. Total Environ. 574, 1104–1116 (2017).
Hang, Y., Wang, Q., Wang, Y., Su, B. & Zhou, D. Industrial SO2 emissions treatment in China: A temporal-spatial whole process decomposition analysis. J. Environ. Manage. 243, 419–434 (2019).
Shao, L., Yu, X. & Feng, C. Evaluating the eco-efficiency of China’s industrial sectors: A two-stage network data envelopment analysis. J. Environ. Manage. 247, 551–560 (2019).
Qian, Y., Cao, H. & Huang, S. Decoupling and decomposition analysis of industrial sulfur dioxide emissions from the industrial economy in 30 Chinese provinces. J. Environ. Manage. 260, 110142 (2020).
Li, X., Wu, X. & Zhang, F. A method for analyzing pollution control policies: Application to SO2 emissions in China. Energy Econ. 49, 451–459 (2015).
Shi, X. & Xu, Z. Environmental regulation and firm exports: Evidence from the eleventh Five-Year Plan in China. J.Environ.Econ.Manage. 89, 187–200 (2018).
Jiang, L., Lin, C. & Lin, P. The determinants of pollution levels: Firm-level evidence from Chinese manufacturing. J. Comp. Econ. 42, 118–142 (2014).
Kahn, M. E., Li, P. & Zhao, D. Water Pollution Progress at Borders: The Role of Changes in China’s Political Promotion Incentives. Am. Econ. J.-Econ. Policy 7, 223–242 (2015).
Wang, C., Wu, J. & Zhang, B. Environmental regulation, emissions and productivity: Evidence from Chinese COD-emitting manufacturers. J.Environ.Econ.Manage. 92, 54–73 (2018).
Zhang, B., Chen, X. & Guo, H. Does central supervision enhance local environmental enforcement? Quasi-experimental evidence from China. J. Public Econ. 164, 70–90 (2018).
Greenstone, M. & Hanna, R. Environmental Regulations, Air and Water Pollution, and Infant Mortality in India. Am. Econ. Rev. 104, 3038–3072 (2014).
Gehrsitz, M. The effect of low emission zones on air pollution and infant health. J.Environ.Econ.Manage. 83, 121–144 (2017).
Zhu, J., Fan, Y., Deng, X. & Xue, L. Low-carbon innovation induced by emissions trading in China. Nat. Commun. 10, 4088 (2019).
Maamoun, N. The Kyoto protocol: Empirical evidence of a hidden success. J.Environ.Econ.Manage. 95, 227–256 (2019).
Andersson, J. J. Carbon Taxes and CO2 Emissions: Sweden as a Case Study. Am. Econ. J.-Econ. Policy 11, 1–30 (2019).
Bi, G., Luo, Y., Ding, J. & Liang, L. Environmental performance analysis of Chinese industry from a slacks-based perspective. Ann. Oper. Res. 228, 65–80 (2015).
Wu, J., Song, M. & Yang, L. Advances in energy and environmental issues in China: theory, models, and applications. Ann. Oper. Res. 228, 1–8 (2015).
Zhang, Z. & Ye, J. Decomposition of environmental total factor productivity growth using hyperbolic distance functions: A panel data analysis for China. Energy Econ. 47, 87–97 (2015).
Wu, J., Zhu, Q., Yin, P. & Song, M. Measuring energy and environmental performance for regions in China by using DEA-based Malmquist indices. Oper. Res. 17, 715–735 (2017).
Hawkins, J., Ma, C., Schilizzi, S. & Zhang, F. Promises and pitfalls in environmentally extended input–output analysis for China: A survey of the literature. Energy Econ. 48, 81–88 (2015).
Mi, Z. et al. Chinese CO2 emission flows have reversed since the global financial crisis. Nat. Commun. 8, 1–10 (2017).
Chen, L. et al. Trans-provincial health impacts of atmospheric mercury emissions in China. Nat. Commun. 10, 1–12 (2019).
Qian, H. & Wu, L. Avoiding Mis-estimation of the CES Function: Unit Matters. Appl. Econ. 52, 2056–2062 (2020).
Liu, M., Shadbegian, R. & Zhang, B. Does environmental regulation affect labor demand in China? Evidence from the textile printing and dyeing industry. J.Environ.Econ.Manage. 86, 277–294 (2017).
Qian, H. et al. Air pollution reduction and climate co-benefits in China’s industries. Nature Sustainability 4, 417–425 (2021).
Robinson, S., Cattaneo, A. & El-Said, M. Updating and Estimating a Social Accounting Matrix Using Cross Entropy. Methods. Econ. Syst. Res. 13, 47–64 (2001).
Qian, H. et al. China Industrial Environmental Database 1998–2015. figshare https://doi.org/10.6084/m9.figshare.16846966 (2021).
Acknowledgements
This work was supported by the National Key Research and Development Program of China (2020YFA0608600), the National Natural Science Foundation of China (71925010, 72121002, 71703027, 71904125), Shanghai Talent Development Fund (2021098).
Author information
Authors and Affiliations
Contributions
H.Q. and L.W. conceived the study. H.Q. and F.R. performed analysis. All authors (H.Q., F.R., Y.G., R.M., W.W. and L.W.) interpreted the data. H.Q. prepared the manuscript. H.Q. and L.W. revised the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Qian, H., Ren, F., Gong, Y. et al. China industrial environmental database 1998–2015. Sci Data 9, 259 (2022). https://doi.org/10.1038/s41597-022-01362-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-022-01362-x
This article is cited by
-
A high-resolution multi-scale industrial water use dataset in China
Scientific Data (2024)
-
A dataset on corporate sustainability disclosure
Scientific Data (2023)






