Abstract
With the rapid proliferation of climate policies in both number and scope, there is an increasing demand for a global-level dataset that provides multi-indicator information on policy elements and their implementation contexts. To address this need, we developed the Global Climate Change Mitigation Policy Dataset (GCCMPD) using a semisupervised hybrid machine learning approach, drawing upon policy information from global, regional, and sector-specific sources. Differing from existing climate policy datasets, the GCCMPD covers a large range of policies, amounting to 73,625 policies of 216 entities. Through the integration of expert knowledge-based dictionary mapping, probability statistics methods, and advanced natural language processing technology, the GCCMPD offers detailed classification of multiple indicators and consistent information on sectoral policy instruments. This includes insights into objectives, target sectors, instruments, legal compulsion, administrative entities, etc. By aligning with the sector classification of the Intergovernmental Panel on Climate Change (IPCC) emission datasets, the GCCMPD serves to help policy-makers, researchers, and social organizations gain a deeper understanding of the similarities and distinctions among climate activities across countries, sectors, and entities.
Similar content being viewed by others
Background & Summary
With the first World Climate Conference in 1979 marking the start of an international focus on climate change issues, the number and scope of climate-related policies have increased substantially. At the global level, an international governance scheme formulated the Kyoto Protocol in 1997 and the Paris Agreement in 2015 to set up national mitigation targets and ancillary mechanisms. National, subnational and sectoral policies have also flourished in recent years, with some local consideration of other objectives such as environmental protection, economic development, equity thinking and sustainable development. Quantitative policy studies have emerged accordingly, either assessing the specific policy effect empirically or simulating the policy impacts virtually.
The literature focusing on the analysis of a single policy typically involves only a single country (or even a specific industry in a certain country)1,2,3,4,5,6 or a comparison of a few countries7,8, thus requiring a limited amount of policy data. A more detailed and consistent climate policy dataset enables large-scale quantitative analysis of global climate policies and provides new information for policy-makers. For example, through detailed classification based on direct and indirect laws, the differences in the coverage scope of greenhouse gas targets can be analysed9. Through the classification of instruments, various patterns of policy convergence and divergence across countries and the driving factors behind them can be compared10.
On the other hand, the differences in the circumstances of policy formulation and implementation are driving wide variations with similar policy instruments, which have not been fully captured for better evaluating the policy effects. In addition, some rising concerns, such as the constraints of policy adoption in developing economies, the synergistic effects of the policy mix, and the spillover effects of market-based instruments across regions, call for more comparative studies with global or cross-regional perspectives. Such demands in policy science require a more finely categorized, multi-indicator climate policy dataset that allows for convincing investigation of policy design under the premise of clarifying the implementation context.
Furthermore, the climate policy literature has diversified significantly with the emergence of new climate practices and the availability of more advanced data processing methods. According to SciVal (see Supplementary Information (SI1) for detailed search terms), the number of articles in climate policy-related research areas has been increasing annually since 1997, reaching 6,946 articles in 2022. Concurrently, the number of topics within the global climate policy research domain has also expanded, rising from 499 in 2018 to 751 in 2022 (Supplementary Figs. 1-2). Climate policy evolution represents one of the emerging branches within this research domain, encompassing various aspects such as policy coverage9,11,12, factors influencing policy adoption10,13,14, policy diffusion15,16, and policy themes and trends17,18,19,20,21. The utilization of policy quantity as the primary variable has broadened the scope of research on policy effects and their relationship with greenhouse gases22,23,24. At the same time, there has been a gradual shift towards more detailed and consistent exploration of policy design and comparison, adopting a mixed policy perspective25,26,27,28,29,30. In addition, research on the trade-offs between multiple goals31,32,33,34 also requires the support of multi-indicator policy datasets.
To the best of our knowledge, the existing global climate policy datasets are still not sufficient to meet the above research demands. The following points are taken as an example. (1) The lack of policy instruments for specific sectors prevents in-depth research on the policy mix for specific sectors. The lack of linkage between sectoral policies and sectoral carbon emissions, coupled with the varying focus of existing datasets on sectors35, further compounds these limitations. (2) Variations in policy versions and coverage can also undermine the robustness of research that relies on policy density as a metric35. (3) The separation of the law from a large number of supportive and weaker policies will also cause the role of supportive policies to be ignored36. The above shortcomings can be summarized as follows. First, the numbers and scopes of policies are still not complete, and updates cannot be made in time due to manual methods of data collection and processing. Second, there is a lack of consistent and more detailed policy categorization in comparable standards, partly due to the overlapping scope of sectors, complexity of entities involved, and difficulty in identifying indirect policies. These limitations also make it impractical to conduct comparative analysis for the identification of additional features across datasets. Additionally, studies comparing existing mainstream global datasets have shown that differences in data coverage and inconsistent classification standards lead to inconsistent results when using different datasets35. Finally, legal enforcement characteristics and information about administrative entities are crucial in driving changes in policy design. The aforementioned features may impact the practical outcomes of policy implementation and serve as a complement to policy ambitions35,36.
We attempted to construct the Global Climate Change Mitigation Policy Dataset (GCCMPD) to fill the gap between the existing climate policy databsets and the increasing demand for research by harmonizing the existing datasets with a hybrid machine learning approach. The GCCMPD currently includes 73,625 policies of 216 entities, with a substantial increase in the numbers and scopes of policies through a complete search of available sources. For each policy in the dataset, we extracted important policy features such as target sector, policy instrument, objective, binding force, executive/legislative, jurisdiction, etc. We mainly used semisupervised machine learning and combined various natural language processing methods to achieve consistent classification of each feature of the policy, which enhances the objectivity and extensibility of the data.
Compared with the existing datasets, our dataset can better respond to the demands of empirical studies and cross-national comparisons of policies. First, our dataset can be used to build more accurate policy indicators. On the one hand, our dataset includes more policy records than any individual dataset and has fewer omitted policies, which can ease the underestimation of the efforts of the entities to mitigate climate change. Coordinated climate policies can also enrich single or small n-case country case studies30,35,37. On the other hand, our dataset provides features of the policies, including the binding force, the executive/legislative and the objectives. Since the differences in the circumstances of policy formulation and implementation can vary greatly among policies with different binding forces (for example, laws and preparative instruments) and strongly influence the output of the policies, future studies can weigh the number of policies with their binding force or add the number of soft and hard laws into their model as two separate variables rather than simply counting the number of policies23,24,35,36. Second, our dataset enables us to measure the output of sectoral policies. Our dataset categorizes the policies into sectors according to the IPCC Sixth Assessment Report (AR6) standard and can be easily linked to emission datasets with the same standard, such as the Emissions Database for Global Atmospheric Research (EDGAR). Thus, with the sectoral emission data from these datasets and climate policy indicators calculated by sector instruments with our dataset, the sectoral emission reduction performance of the policy system (rather than the emission reduction performance of single policies38,39,40,41) can be measured. Third, the policy similarity information of our dataset supports a better exploration of climate policy diffusion. We provide information on the similarity of the context of climate policies, facilitating the inference of policy diffusion using both temporal and contextual data. This approach has the potential to outperform existing studies42, which primarily rely on inferring the probability of diffusion from one entity to another based solely on the time gap between the adoption of climate policies in the two entities. Finally, our dataset includes information on the binding force and policy objectives, enabling studies to explore the relationship between these features and other factors. For example, researchers can investigate how various conditions, such as economic or geographic factors, may influence the binding force of policies, as well as the differences in policy outcomes associated with different levels of binding force. Moreover, our dataset enables studies to focus on the dynamics between hard law and soft law, such as examining how soft law transitions into hard law and to what extent different types of laws influence emissions43,44.
Methods
We constructed the GCCMPD using a multilevel process and framework (Fig. 1). Specifically, this entailed source identification and selection of policy data, selection of policy characteristic indicators, data collection, data processing, manual checks, verification, annotation, dataset expansion and some possible uses. The final data are also stored in different ways to facilitate various types of users.
Data identification
Before identifying the data source, it is important to define the boundaries of the dataset described in this paper. At present, there is no clear definition of climate policy, and existing attempts to define climate policy have adopted very broad and vague scopes20. The reason is that climate policy covers a wide range of areas. For example, greenhouse gases and pollutants have common sources45, resulting in considerable overlap between climate policy and energy and environmental policy. Sometimes climate policy is directly defined as policies to reduce greenhouse gases and air pollution. Our purpose is to construct a detailed classification of multiple-indicator multipurpose datasets, so we use a relatively broad definition to define climate policy as a set of laws, regulations, strategies, and other measures related to climate change, which can be further divided into climate mitigation and climate adaptation policies. Dataset users can filter by attributes such as sources to obtain more targeted subsets of data, despite the broad boundaries of the dataset.
According to the above definition, the GCCMPD roughly consists of three sources (see Table 1). First, global datasets, the Climate Change Laws of the World (CCLW), the International Energy Agency (IEA) Policies and Measures Dataset, and the Climate Policy dataset (CP) were used. Second, regional datasets, such as the Asia Pacific Energy Portal (APEP) and the European Environment Agency (EEA), cover a certain area and have a considerable number of policies. The third category is datasets that focus on specialized fields such as technology, policy instruments, and legislation. These data are extremely specialized but narrow in scope. For the ECOLEX dataset, we selected climate-related subjects (see Supplementary Table 1 for details).
We selected IEA, CP, and CCLW as the training sets for manual annotation, accounting for approximately 16% of all data sources (excluding the legal part of ECOLEX, which accounts for approximately 70%). Since the GCCMPD is a multi-indicator dataset, it is appropriate to choose authoritative and diverse data sources, and these three datasets have gradually been recognized by the academic community18,22,23. We first harmonized the three datasets to form a “core” dataset via the natural language processing technique and manual checking. The “core” dataset can support the demand of analysing global climate mitigation policies since these three datasets cover various types of climate policies (laws, commitment agreements, goals; national level, subnational level) from almost all of the entities in the world, which can basically guarantee the diversity and representativeness of the core dataset. Additionally, we noticed that the data coverage of the “core” dataset can be further improved by including data from other data sources. Then, we used the machine learning models trained on the manually checked core dataset to label data from other sources and formed a full dataset. The IEA, CP, and CCLW datasets have great potential to complement each other in terms of country coverage, sector coverage, laws and nonlaws, policy versions, policy scope, etc. CCLW is a specialized climate law dataset that collects climate laws or equally important policies23, with the highest policy importance but a relatively small number of policies. The IEA was the first of the three datasets to collect policy data on energy-related policies in the building sector. CP is a synthesized dataset containing policies from other datasets (including CCLW and IEA) and various reporting sources, characterized by detailed information on policy targets9,12, instruments and sectors11. After harmonizing with the IEA and CP datasets, the following complements were formed: (1) subnational policies and early policy versions; (2) nonlegal supportive “weaker” policies; (3) specific sector policies (such as the building sector); and (4) more balanced national policies. A detailed comparison of the main characteristics of the three datasets is given in Supplementary Table 2. The main purpose of comparison is to find which classification information (the fields in bold in the table) of these three datasets can help us classify.
Indicator selection and description
The GCCMPD aims to provide a multi-indicator dataset that can be researched and consulted by scholars in various fields, policy-makers, and various social organizations. We mainly chose the following indicators to meet different research needs: sector, instrument, objective, legally binding force, enforcement settings and jurisdiction. Due to the diverse sources of the GCCMPD, choosing a unified standard and definition is conducive to maintaining the consistency of the dataset.
-
Sector: One of the characteristics of the GCCMPD is that it can be closely integrated with the carbon emission dataset, which requires sectoral classification of policies to match them with GHG emissions. Most mainstream GHG emissions datasets46,47 refer to the IPCC 2006 reporting guidelines48 for sectoral classification and generally attribute global GHG emission sources to five broad sectors49 of energy systems, industry, buildings, transport and AFOLU (agriculture, forestry and other land uses). For the above reasons, the GCCMPD also divides sectors into these five sectors, adopting the same sectoral classification standards as the carbon emissions dataset based on the EDGAR50, and the classification standards and descriptions are shown in Supplementary Table 3.
-
Instrument: A taxonomy of policy instruments can help policy-makers formulate policy packages, provide possible options for policy practice, and better evaluate climate policies. Some taxonomies, such as the NATO scheme51 (A widely-used standard that categorizes policies based on the types of governing resources they rely on: Nodality, Authority, Treasure, and Organization.), represent general criteria and are less specific to a certain domain of climate mitigation. We chose the classification standard of the IPCC Fifth Assessment Report (AR5)52, which has the following advantages. First, the classification is authoritative. The classification criteria are summarized by climate experts based on a large amount of literature, reflecting the climate policies that the academic community focuses on. Second, after the modification from the Third Assessment Report (TAR)53 to the AR5, the classification has strong completeness and timeliness. Third, there are sector-level classifications of policy instruments for the above five sectors, which can help to understand and use them in conjunction with sectoral classifications. As a result, policy instruments are divided into seven categories: taxes, tradable allowances, subsidies, government provisions of public goods or services, information programmes, regulatory approaches, and voluntary actions. Detailed sector-level standards and examples can be found in Supplementary Table 4.
-
Objective: The mitigation of climate change is only one of many public issues, and sometimes it is necessary to judge whether climate change has co-benefits and adverse side effects on society, the economy, and the environment. For almost the same reasons as “instruments”, we chose the AR5 taxonomy, which categorizes climate mitigation impacts on economic, social, and environmental objectives/issues. The detailed subobjective classification can be found in Supplementary Table 5.
-
Binding force: Whether the same policy instrument is legally binding affects its effectiveness, which is also the core difference between hard law and soft law. The concept of soft law was originally common in the EU and international law54. At present, countries need to consider multiple goals while mitigating climate change. Therefore, soft law has become a flexible choice when making climate commitments cautiously. Both soft and hard laws have advantages and disadvantages. Hard law usually balances the interests of all parties and is more democratic but has a long promulgation cycle and high costs. Soft law is formulated and adopted more quickly, but at the same time, there are problems of poor consideration and they can be overly ambitious55. In addition, the soft law is a good complement to the hard law, and its functions can be divided into prelaw functions, postlaw functions, and para-law functions55. For soft law, we refer to the classification based on function (as an alternative to legislation) proposed by Senden, L55, which divides soft law into three main categories: preparatory and informative instruments, interpretative and decisional instruments, and steering instruments. For hard laws, we mainly consider the priority when two hard laws conflict, that is, the legal hierarchy56. We refer to the legal hierarchy summarized by Clegg, M. et al.56, combined with the legal hierarchy of the European Union, the United States and other countries, and divide hard law into three main categories: constitution, statutes/legislation, and regulations/rules. The classification rules and descriptions of soft law and hard law can be found in Supplementary Table 6, and some examples are shown in Supplementary Table 7.
-
Executive/legislative: The enactment of policies by executive or legislative bodies can partially measure the durability of policies (assuming that executive measures can more easily be removed by new governments, while laws are more durable, given that their removal requires the approval of the legislative body), referring to Iacobuta et al.9 Policy durability has attracted attention9 because it affects the expectations of policy implementation objects. For example, expectations regarding the durability of policies may affect short-term and long-term decision-making judgements of enterprises, thereby impacting the effectiveness of policy implementation. What is slightly different is that, like the CCLW, the GCCMPD considers whether policies are enacted by the legislative or executive branch of the government on the basis of law or strategy. For example, China’s 14th five-year plan is classified as executive in the CCLW and GCCMPD but not legislative.
-
Jurisdiction: In addition to classifying the scope of policy from an industrial perspective, such as sector, the classification of policy jurisdiction can help to better analyse the interaction between central and local governments57, international competition and multilateral cooperation58, and regional emissions governance59. Therefore, we divided policy jurisdictions into international, national, subnational area (the broader category with multiple provinces), and subnational (single state/provinces, municipality/city) jurisdictions.
Data collection
Some challenges of data collection are that there are few websites that can be directly packaged and downloaded, the webpage rules of each data source are different, and the rules are constantly updated. First, important policy features were manually selected based on the characteristics of each website (all policies include at least title, content, and year). Then, Python toolkits such as Requests (an elegant and simple library allows you to send HTTP requests) and Selenium (is used to automate web browser interaction from Python) were used to obtain the data. All the raw data are stored in document form (CSV, EXCEL) and data platform form (MongoDB).
Data processing
Through steps 1-12 of the data processing shown in Fig. 1, we constructed a reference (original) version of the training set data as of December 31, 2021. To ensure the consistency of manual annotation and reduce manual annotation errors, the classification information of the IEA, CP, and CCLW datasets was used as much as possible. As shown in Supplementary Table 2, among the main indicators of the GCCMPD, the preliminary classifications of sector, instrument, objective (including detailed classification of subsector, sector-instrument and subobjective), and jurisdiction can refer to the IEA, CP, and CCLW classification information. The two indicators of binding force and executive/legislative can only partially refer to the IEA, CP, and CCLW classification information. Therefore, additional legal keywords were constructed with reference to legal dictionaries, such as Black’s law dictionary (https://thelawdictionary.org/) and Duhaime’s law dictionary (http://www.duhaime.org/), to assist in classification. Therefore, these steps were intuitively divided into three categories: IEA, CP, and CCLW classification information (steps 2, 4, 6, and 11); IEA, CP, and CCLW information (steps 7, 8, and 9); and simple rules and mappings (remaining steps).
A detailed description of the processing steps can be found in the Supplementary Information (SI2); the mapping dictionary built based on IEA, CP, and CCLW can be found in Supplementary Table 8-16; and the keywords built with reference to the legal dictionaries can be found in Supplementary Table 17-18.
Although there are differences in the definitions of mitigation and adaptation, thus, the two are strongly related60. Many climate policies often involve both mitigation and adaptation policies, and different datasets differ in terms of whether the same climate policy involves mitigation or adaptation. The GCCMPD adopts a negative list method to construct keywords (Supplementary Table 19) that significantly characterizes adaptation policies and remove them (One of the keywords in Supplementary Table 19 appears in the policy title or content, then the policy is judged to be climate adaptation.) from the dataset.
Below are the descriptive statistics of the training set. The training set contains climate policy data from 199 entities (198 countries and the European Union). Table 2 illustrates the distribution of entities and policies on each continent. According to Table 2, the number of African, Asian, and European entities is the largest. For the number of policies, although the number of European entities is not the largest, European entities issued the most policies among the six continents. The average number of policies issued by each entity is the highest in Europe (89.96) and lowest in Africa (16.09).
There are 10088 policies ranging from 1927 to 2021 (among them, several policies in the IEA dataset are scheduled to enter into force in 2025) included in our core dataset. As Table 3 shows, national climate policies make up approximately 90% of our sample, while international policies account for 3% and subnational policies for 7%. We divided the policies into five sectors and found that most of the policies targeted energy systems (3871) and that the fewest policies focused on AFOLU (838). There are also 3843 policies focused on the whole economic level rather than on a single sector or several other sectors. Policies are divided into seven types according to the instrument they apply. Most policies employ the government provision of public goods or services (7040) and regulatory approaches (4467), while only 121 policies use the tradable allowance instrument. For the binding force of the policies, soft laws constitute a greater share of the sample (69.73%), while hard laws constitute only 30.27% of the sample. Law/Act is the most popular hard law, while Preparatory Instruments is the most widely used soft law. The classification of co-benefit objectives is mainly economic objective, followed by social objective.
As Table 4 shows, by combining these three datasets, the number of policies in each sector has almost doubled. All four datasets have a high share of energy sector policies. In addition, the integration of the construction sector and the AFOLU sector shows the benefits of coordination. The IEA dataset also has a high proportion of construction sector policies, while the CCLW dataset has a relatively high proportion of AFOLU sector policies. GCCMPD generates complementary advantages through the integration of these datasets, further highlighting the significant potential for complementarity35 between the IEA, CP and CCLW.
Manual check, verification, annotation
Check for duplicates
Although merging the three datasets can combine their respective strengths, there will be some duplication of policies, as the three datasets share many of the same sources. Duplicate policies do not affect analysis, such as policy coverage9,11,12 and policy sequencing18,21, but do affect analysis based on policy strength23,24. Based on the Best Match 25 algorithm (BM2561, a ranking algorithm based on probabilistic relevance framework for generating the similarity score between two texts), we used all policy titles as a corpus, grouped the data by country, year, and jurisdiction, and calculated the text similarity score between policies. Human checks combined with text similarity scores eventually revealed duplicate policies (Supplementary Fig. 3).
Verification and annotation
It is also necessary to manually check, verify and annotate each indicator. Based on the reference version, the manual labelling bias was reduced. During manual verification, we also found that there were some policy instruments and sectors that cannot be found directly from the policy content, and the reference version was used to assist the manual work in making more accurate annotations. After manual checking and correcting the dictionary mapping several times, we found that even graduate students majoring in energy and climate were prone to label mistakes.
Here are a few illustrations of how to manually check policy sectors. We judged whether the policies belonged to the energy system or building sector based on whether the home solar PV system was connected to the grid because self-sufficiency alone cannot be considered the energy supply. Heating and cooling generally refer to household heating and cooling appliances (buildings) rather than district heating (energy systems). Fossil fuel extraction (energy systems) depends mainly on how the carbon emissions dataset is classified. Standards for building parking spaces to promote the use of new energy vehicles are categorized as transport and building policies. Equipment safety and photovoltaics are considered industry policies. Biofuels are considered transport policies, biomass power generation is considered for energy systems, and the manufacture of biofuels may involve industry and AFOLU. It should be noted that climate policy data differ from carbon emissions data. It is very common for a policy to involve multiple sectors.
For the instrument indicator, the tax category is relatively easy to mislabel. To distinguish it from subsidies, new taxes and tax increases are classified as taxes, and tax reductions and accelerated depreciation are classified as subsidies. Since the tax classification of the IEA is based on whether a certain type of tax appears in the policy text, the tax reduction will also be recorded as a tax, so it needs to be manually corrected. In addition, grants, awards, R&D subsidies, etc., also need to be classified as subsidies or/and government provisions of public goods or services according to the content of the policy. Since the government provision of public goods or services involves capacity building and the removal of barriers, it includes the establishment of institutions, the provision of financial support, etc. Another thing to point out is that subsectors (sector-instrument and subobjective) are more detailed but not completely classified; that is, this does not mean that a policy classified as a certain sector (instrument or objective) will necessarily be further classified into a certain subsector (sector-instrument and subobjective) (Supplementary Figs. 4-6). For example, some energy plans mentioned several sectors but did not involve specific subsectors. Some government funds support renewable energy power generation without specifying whether it is investment, R&D support, or subsidies.
Search, verification, and supplementation of policy content
The policy content also requires many manual searches and verification. According to Table 5, many policy contents in the CP and IEA are missing or too short, and this will affect the effectiveness of manual checking indicators, machine learning model training, and topic models. We tried to find the content of the policy by searching the source webpage. For some countries whose official language is English (e.g., Canada), the source web page was invalid or incorrect, and for countries whose official language is not English (e.g., Argentina, Spain, Germany, Japan, etc.), there were language problems. We used Google Search (when the source web page was invalid or incorrect) in addition to optical character recognition (OCR) technology and Google Translate for scanned documents and language issues. Through the above efforts, the missing value of policy content has been greatly reduced, and the quality of policy content has improved (Table 5). In addition, manual verification also included translating the policy title and content into English and correcting the policy year and jurisdiction.
Dataset expansion
After the above data processing and manual inspection, a refined training dataset containing 10088 policies was formed (GCCMPD-IEA-CP-CCLW). However, similar to other datasets, the GCCMPD-IEA-CP-CCLW still had shortcomings, such as high labour costs, slow update speed, and difficulty expanding (some data sources do not have relevant indicator information and cannot form dictionary maps). In this section, we used several state-of-the-art natural language processing (NLP) and machine learning techniques to expand the dataset. As shown in Table 6, through the application of machine translation, multilabel classification, single-label classification, named entity recognition, text similarity, and topic models, dataset construction was completed only by relying on policy titles and content.
Multilabel and single-label classification
The GCCMPD’s indicators, sector, instrument, and objective are multilabel classifications, while binding force and executive/legislative are single-label classifications. We chose the ClimateBERT62 algorithm, which is based on the Bidirectional Encoder Representations from Transformers (BERT) model, because it has high performance63,64, relatively low fine-tuning cost compared with large language models such as Generative pretrained transformer (GPT) or Large Language Model Meta AI (Llama), and concentration on climate change (pretrained on the text corpus of abstracts of climate-related research papers, corporate and general news, and company reports, and because it is currently the latest model in the field of climate change), and it is widely used in existing studies65,66. By comparing ClimateBERT and the traditional machine learning models logistic regression classifier (LR), naïve Bayes (NB), and support vector machine (SVM) with ClimateBERT, the results (Supplementary Table 21–25) demonstrate the advantages of ClimateBERT in the climate field. The detailed model training details and comparisons can be found in the Supplementary Information (SI3).
Named entity recognition (NER)
Judging climate policy jurisdictions by identifying states/provinces and countries that appear in policies through NER is an intuitive and highly interpretable method. Specifically, we first used the Roberta-based67 “en_core_web_trf” model to identify geopolitical entities and then use “countryinfo” to distinguish “state/province” or “country”. Furthermore, the “Subnational area” and “International” were determined according to the number of different cities and countries in a policy (Supplementary Table 26).
Text similarity
In the GCCMPD-IEA-CP-CCLW, after calculating the BM25 score, we determined the duplication of policies from different sources by manual checking in groups. However, for data expansion, two issues needed to be considered: 1. When the amount of data increases rapidly, methods based on a large amount of labour are no longer applicable. 2. How can we objectively judge whether two policies are the same? Some text similarity methods, such as bag-of-words/TFIDF combined with cosine similarity, can obtain a standardized value, but thresholds such as 0.8 still need to be set subjectively, and the methods are relatively rough. We still used the BM25 algorithm and combined the ideas of model training and evaluation to automatically find duplicates (Fig. 2). Finally, based on the optimal ranking of 6061 and an F1 score of 0.8 (Supplementary Fig. 7), the optimal BM25 score was determined. A policy with a similarity score greater than the optimal BM25 score was judged to be a duplicate policy. The technical details and detailed instructions can be found in the Supplementary Information (SI2).
Topic model
Policy topic analysis can aid in understanding policy evolution17 and has become a new branch of research in the field of public policy. Currently, BERTopic68 is a state-of-the-art topic model that effectively uses word embedding to transform input documents into numerical representations, dimensionality reduction to reduce the dimensionality of the embeddings to a workable dimensional space and clustering similar embeddings to obtain topic representations through c-TF-IDF (an adjusted TF-IDF to work on cluster/categorical/topic level instead of document level). The GCCMPD provides 4 topic models for different analysis needs, namely, GCCMPD-IEA-CP-CCLW-Topic, GCCMPD-EXPAND-Topic (counting multilateral policies as multiple), GCCMPD-Topic (counting multilateral policies as one) and GCCMPD-EXCEPT-ECOLEX-Topic (excluding ECOLEX). The specific model details can be found in the Supplementary Information (SI2).
Data Records
To facilitate the use of various scientific researchers, government staff, and personnel of international organizations, the GCCMPD provides three forms of data storage: MySQL, MongoDB, and EXCEL. All data records have been uploaded through Figshare: https://doi.org/10.6084/m9.figshare.22590028.v269. The relational data management system (MySQL) can well reflect the composition of core data (Fig. 3):
-
Policy: contains all deduplicated climate mitigation policies (policy ID, policy title and content translated into English, policy source).
-
Additional information: includes the year the policy was published, the original title and content of the policy, and the policy source document web page.
-
Characteristic: contains indicators, sector, objective, instrument, executive/legislative, and binding force, for each policy.
-
Similar information: provides the total number of policies in each policy group, the index, the BM25 score, and the title of the most similar policy in the group.
-
Country and Jurisdiction: provides information about the country where the policy was adopted. The table is also linked to external countries’ economic (e.g., GDP, imports and exports), social (e.g., population, democracy), political (e.g., left-wing or right-wing, federal), and energy (e.g., primary energy usage) and characteristic dataset media.
-
Topic. provides the topic category of each policy, topic name, the most representative words under this topic, the probability that the policy belongs to this topic, and whether it is a representative policy of this topic.
In addition to the core data, the GCCMPD also retains the processing results of each step. Supplementary Table 27 lists some important result files and their descriptions. These files are all stored in EXCEL.
Technical Validation
The data quality of the GCCMPD is affected by several factors. The first is the data source. The comprehensiveness of information from different data sources varies greatly, which directly affects the model training ability. The second is the reliability of the annotation and training sets and the training ability of the model. Since the construction idea of the GCCMPD is to first form a training set through manual labelling and then extrapolate to a richer data source, the quality of the annotation and training set is highly important. We took the following steps to enhance the data quality:
-
Manual data retrieval: The determination of data sources was performed through high-quality literature, Google searches, and data sources from authoritative datasets. The requirements for the data sources were as follows: 1. Include the title and content of the policy, and there should not be too many missing values; 2. Include the year of the policy; 3. The entity information of the policy release is contained as much as possible.
-
Manual annotations and checks: We compared the annotations of PhD students in multiple climate and energy fields (with each person annotating a small number of fields) and found that the differences in their annotations were concentrated in several places (Manual check, verification, annotation section). Although we had given classification criteria and explained where it is easy to misclassify, manual annotation was inevitably subjective. Fortunately, objectivity and consistency can be guaranteed through dictionary mapping. It can be considered that IEA, CP, and CCLW data are the results of authoritative experts. We regard them as real values and the GCCMPD data as predicted values and compare the results (It is called comparison because the prediction result is not based on the model, but only draws on the metrics to give a more intuitive comparison result) of the two (Tables 7–11, for the comparison of subsectors, sector instruments, and subobjectives. See Supplementary Table 28-30 for additional information). The resulting recall rates were generally high, even close to 1, which means that we fully refer to the results of dictionary mapping. Some categories with lower recall results (such as Other Strategy Plan or Target) indicate that we later focused on checking such results. Some results with low precision rates were mostly because the original policy data source did not have corresponding keyword labels, and we annotated them during manual verification and inspection. The precision rate of the taxes category was not high because we checked and found that the classification standard of the IEA dataset was different from that of the GCCMPD dataset, while the precision rate of the subsidies category was not high because keywords such as “Award” and “Grants” cannot determine whether they are subsidies. Therefore, we did not construct a dictionary for relevant keywords and instead we added annotations through manual inspection. For the manual inspection of duplicate data, we also tried to be as detailed as possible and record the specific corresponding relationship of policy duplication.
Table 7 Comparison between dictionary mapping and manual verification of policy instrument results. Table 8 Comparison between dictionary mapping and manual verification of policy sector results. Table 9 Comparison between dictionary mapping and manual verification of policy objective results. Table 10 Comparison between dictionary mapping and manual verification of binding force results. Table 11 Comparison between dictionary mapping and manual verification of executive/legislative results. -
Enhanced model training capabilities: Based on the use of state-of-the-art models, we improved model learning by translating non-English policies into English and checking and completing policy content. In terms of manually checking policy content, we selected content that could be used to summarize the policy, such as goals, purposes, and objectives, as well as information that could be used to assist in judging the legal category.
After the GCCMPD data are made public, we will also enhance the data sources and correct errors through interactions with data users.
Nonetheless, our dataset still has flaws, one of which is that our dataset does not contain adaptation policies. Climate adaptation is a very important field70, especially for countries with high climate vulnerability and high mitigation costs. Future research can improve our study by constructing an adaptation policy dataset.
Code availability
Code, dataset and some intermediate results are freely available on the following GitHub repository: https://github.com/HUANGZHIHAO1994/GCCMPD-climate-policy-dataset.
References
Zhu, J., Ge, Z., Wang, J., Li, X. & Wang, C. Evaluating regional carbon emissions trading in China: effects, pathways, co-benefits, spillovers, and prospects. Climate Policy 22, 918–934 (2022).
Zhu, J., Fan, Y., Deng, X. & Xue, L. Low-carbon innovation induced by emissions trading in China. Nat Commun 10, 4088 (2019).
Cui, J., Wang, C., Zhang, J. & Zheng, Y. The effectiveness of China’s regional carbon market pilots in reducing firm emissions. Proc. Natl. Acad. Sci. USA 118, e2109912118 (2021).
Yamazaki, A. Jobs and climate policy: Evidence from British Columbia’s revenue-neutral carbon tax. Journal of Environmental Economics and Management 83, 197–216 (2017).
Maestre-Andrés, S., Drews, S., Savin, I. & van den Bergh, J. Carbon tax acceptability with information provision and mixed revenue uses. Nat Commun 12, 7017 (2021).
Liski, M. & Tahvonen, O. Can carbon tax eat OPEC’s rents? Journal of Environmental Economics and Management 47, 1–12 (2004).
Demailly, D. & Quirion, P. European Emission Trading Scheme and competitiveness: A case study on the iron and steel industry. Energy Economics 30, 2009–2027 (2008).
Nong, D., Simshauser, P. & Nguyen, D. B. Greenhouse gas emissions vs CO2 emissions: Comparative analysis of a global carbon tax. Applied Energy 298, 117223 (2021).
Iacobuta, G., Dubash, N. K., Upadhyaya, P., Deribe, M. & Höhne, N. National climate change mitigation legislation, strategy and targets: a global update. Climate Policy 18, 1114–1132 (2018).
Lachapelle, E. & Paterson, M. Drivers of national climate policy. Climate Policy 13, 547–571 (2013).
Nascimento, L. et al. Twenty years of climate policy: G20 coverage and gaps. Climate Policy 22, 158–174 (2022).
Dubash, N. K., Hagemann, M., Höhne, N. & Upadhyaya, P. Developments in national climate change mitigation legislation and strategy. Climate Policy 13, 649–664 (2013).
Fankhauser, S., Gennaioli, C. & Collins, M. Do international factors influence the passage of climate change legislation? Climate Policy 16, 318–331 (2016).
Fankhauser, S., Gennaioli, C. & Collins, M. The political economy of passing climate change legislation: Evidence from a survey. Global Environmental Change 35, 52–61 (2015).
Tews, K., Busch, P.-O. & Jorgens, H. The diffusion of new environmental policy instruments1. Eur J Political Res 42, 569–600 (2003).
Busch, P. & Jörgens, H. The international sources of policy convergence: explaining the spread of environmental policy innovations. Journal of European Public Policy 12, 860–884 (2005).
Biesbroek, R., Wright, S. J., Eguren, S. K., Bonotto, A. & Athanasiadis, I. N. Policy attention to climate change impacts, adaptation and vulnerability: a global assessment of National Communications (1994–2019). Climate Policy 22, 97–111 (2022).
Linsenmeier, M., Mohommad, A. & Schwerhoff, G. Policy sequencing towards carbon pricing among the world’s largest emitters. Nat. Clim. Chang. 12, 1107–1110 (2022).
Averchenkova, A., Fankhauser, S. & Nachmany, M. Trends in Climate Change Legislation. (Edward Elgar Publishing, Cheltenham, UK; Northampton, MA, USA, 2017).
Meckling, J. & Allan, B. B. The evolution of ideas in global climate policy. Nat. Clim. Chang. 10, 434–438 (2020).
Meckling, J., Sterner, T. & Wagner, G. Policy sequencing toward decarbonization. Nat Energy 2, 918–922 (2017).
Le Quéré, C. et al. Drivers of declining CO2 emissions in 18 developed economies. Nat. Clim. Chang. 9, 213–217 (2019).
Eskander, S. M. S. U. & Fankhauser, S. Reduction in greenhouse gas emissions from national climate legislation. Nature Climate Change 10, 750–756 (2020).
Chen, P. et al. The heterogeneous role of energy policies in the energy transition of Asia–Pacific emerging economies. Nature Energy 7, 588–596 (2022).
Lamb, W. F. & Minx, J. C. The political economy of national climate policy: Architectures of constraint and a typology of countries. Energy Research & Social Science 64, 101429 (2020).
Peñasco, C., Anadón, L. D. & Verdolini, E. Systematic review of the outcomes and trade-offs of ten types of decarbonization policy instruments. Nat. Clim. Chang. 11, 257–265 (2021).
Rogge, K. S. & Reichardt, K. Policy mixes for sustainability transitions: An extended concept and framework for analysis. Research Policy 45, 1620–1635 (2016).
van den Bergh, J. et al. Designing an effective climate-policy mix: accounting for instrument synergy. Climate Policy 21, 745–764 (2021).
Schmidt, T. S. & Sewerin, S. Measuring the temporal dynamics of policy mixes – An empirical analysis of renewable energy policy mixes’ balance and design features in nine countries. Research Policy 48, 103557 (2019).
Schmidt, N. M. & Fleig, A. Global patterns of national climate policies: Analyzing 171 country portfolios on climate policy integration. Environmental Science & Policy 84, 177–185 (2018).
Fankhauser, S., Hepburn, C. & Park, J. Combining multiple climate policy instruments: how not to do it. Clim. Change Econ. 01, 209–225 (2010).
Viguié, V. & Hallegatte, S. Trade-offs and synergies in urban climate policies. Nature Clim Change 2, 334–337 (2012).
Persha, L., Agrawal, A. & Chhatre, A. Social and Ecological Synergy: Local Rulemaking, Forest Livelihoods, and Biodiversity Conservation. Science 331, 1606–1608 (2011).
Bryan, B. A. et al. Designer policy for carbon and biodiversity co-benefits under global change. Nature Clim Change 6, 301–305 (2016).
Schaub, S., Tosun, J., Jordan, A. & Enguer, J. Climate Policy Ambition: Exploring A Policy Density Perspective. Politics and Governance 10, 226–238 (2022).
Nascimento, L. & Höhne, N. Expanding climate policy adoption improves national mitigation efforts. npj Clim. Action 2, 12 (2023).
Schaffrin, A., Sewerin, S. & Seubert, S. Toward a Comparative Measure of Climate Policy Output. Policy Studies Journal 43, 257–282 (2015).
CLÒ, S. The effectiveness of the EU Emissions Trading Scheme. Climate Policy 9, 227–241 (2009).
Sandoff, A. & Schaad, G. Does EU ETS lead to emission reductions through trade? The case of the Swedish emissions trading sector participants. Energy Policy 37, 3967–3977 (2009).
Castro, P. Does the CDM discourage emission reduction targets in advanced developing countries? Climate Policy 12, 198–218 (2012).
Lin, B. & Li, X. The effect of carbon tax on per capita CO2 emissions. Energy Policy 39, 5137–5146 (2011).
Desmarais, B. A., Harden, J. J. & Boehmke, F. J. Persistent Policy Pathways: Inferring Diffusion Networks in the American States. American Political Science Review 109, 392–406 (2015).
Lawrence, P. & Wong, D. Soft law in the Paris Climate Agreement: Strength or weakness? Review of European. Comparative & International Environmental Law 26, 276–286 (2017).
Vihma, A. Analyzing Soft Law and Hard Law in Climate Change. in Climate Change and the Law (eds. Hollo, E. J., Kulovesi, K. & Mehling, M.) 143–164 (Springer Netherlands, Dordrecht, 2013). https://doi.org/10.1007/978-94-007-5440-9_7.
Thompson, T. M., Rausch, S., Saari, R. K. & Selin, N. E. A systems approach to evaluating the air quality co-benefits of US carbon policies. Nature Clim Change 4, 917–923 (2014).
Gütschow, J. et al. The PRIMAP-hist national historical emissions time series. Earth Syst. Sci. Data 8, 571–603 (2016).
Jeffery, M. L., Gütschow, J., Gieseke, R. & Gebel, R. PRIMAP-crf: UNFCCC CRF data in IPCC 2006 categories. Earth Syst. Sci. Data 10, 1427–1438 (2018).
Eggleston, H. S, Buendia, L, Miwa, K, Ngara, T. & Tanabe, K. 2006 IPCC Guidelines for National Greenhouse Gas Inventories. (2006).
Lamb, W. F. et al. A review of trends and drivers of greenhouse gas emissions by sector from 1990 to 2018. Environ. Res. Lett. 16, 073005 (2021).
Minx, J. C. et al. A comprehensive and synthetic dataset for global, regional, and national greenhouse gas emissions by sector 1970–2018 with an extension to 2019. Earth Syst. Sci. Data 13, 5213–5252, https://doi.org/10.5194/essd-13-5213-2021 (2021).
Hood, C., Margetts, H. & Hood, C. The Tools of Government in the Digital Age. (Palgrave Macmillan, Basingstoke, 2007).
IPCC. Climate Change 2014: Mitigation of Climate Change: Working Group III Contribution to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. (Cambridge University Press, New York, NY, 2014).
Metz, B., Davidson, O., Swart, R. & Pan, J. Climate Change 2001: Mitigation: Contribution of Working Group III to the Third Assessment Report of the Intergovernmental Panel on Climate Change. (Cambridge university press, 2001).
Robilant, A. D. Genealogies of Soft Law. The American Journal of Comparative Law 54, 499–554 (2006).
Senden, L. Soft Law in European Community Law. vol. 1 (Hart publishing, 2004).
Clegg, M., Ellena, K., Ennis, D. & Vickery, C. The hierarchy of laws: understanding and implementing the legal frameworks that govern election. International Foundation for Electoral Systems, Arlington, VA (2016).
Di Gregorio, M. et al. Multi-level governance and power in climate change policy networks. Global Environmental Change 54, 64–77 (2019).
Doukas, H., Karakosta, C. & Psarras, J. RES technology transfer within the new climate regime: A “helicopter” view under the CDM. Renewable and Sustainable Energy Reviews 13, 1138–1143 (2009).
Bulkeley, H. Cities and the Governing of Climate Change. Annual Review of Environment and Resources 35, 229–253 (2010).
Parry, M. L., Canziani, O., Palutikof, J., Van der Linden, P. & Hanson, C. Climate Change 2007-Impacts, Adaptation and Vulnerability: Working Group II Contribution to the Fourth Assessment Report of the IPCC. vol. 4 (Cambridge University Press, 2007).
Robertson, S. & Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. FNT in Information Retrieval 3, 333–389 (2009).
Webersinke, N., Kraus, M., Bingler, J. & Leippold, M. ClimateBERT: A Pretrained Language Model for Climate-Related Text. arXiv preprint arXiv:2110.12010 (2021).
Gök, A., Antai, R., Milošević, N. & Al-Nabki, W. Building the European Social Innovation Database with Natural Language Processing and Machine Learning. Sci Data 9, 697 (2022).
Minaee, S. et al. Deep learning–based text classification: a comprehensive review. ACM computing surveys (CSUR) 54, 1–40 (2021).
Kölbel, J. F., Leippold, M., Rillaerts, J. & Wang, Q. Ask BERT: How Regulatory Disclosure of Transition and Physical Climate Risks Affects the CDS Term Structure*. Journal of Financial Econometrics 22, 30–69 (2024).
Bingler, J. A., Kraus, M., Leippold, M. & Webersinke, N. Cheap talk and cherry-picking: What ClimateBert has to say on corporate climate risk disclosures. Finance Research Letters 47, 102776 (2022).
Liu, Y. et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022).
Huang, Z., Wu, L., Zhang, X. & Wang, Y. Global Climate Change Mitigation Policy Database. Figshare https://doi.org/10.6084/m9.figshare.22590028.v2 (2023).
Lesnikowski, A., Ford, J., Biesbroek, R., Berrang-Ford, L. & Heymann, S. J. National-level progress on adaptation. Nature Clim Change 6, 261–264 (2016).
Acknowledgements
This study acknowledges financial support from the National Natural Science Foundation of China (71925010).
Author information
Authors and Affiliations
Contributions
L.W. conceived the study. Z.H. performed analysis. All authors (Z.H., Y.W., X.Z. and L.W.) interpreted the data. Z.H. prepared the manuscript. Z.H., Y.W. and L.W. revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wu, L., Huang, Z., Zhang, X. et al. Harmonizing existing climate change mitigation policy datasets with a hybrid machine learning approach. Sci Data 11, 580 (2024). https://doi.org/10.1038/s41597-024-03411-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03411-z
This article is cited by
-
Occupational carbon footprints and exposure to climate transition risks
Nature Communications (2025)
-
Measuring China’s Policy Stringency on Climate Change for 1954–2022
Scientific Data (2025)