Abstract
The accurate and transparent estimation of greenhouse gas emissions is essential for corporate sustainability reporting and machine learning applications. Existing emission-factor datasets have restrictive licenses, insufficient spatiotemporal granularity, or outdated information, limiting their reproducibility and utility across disciplines. We present ExioML, an open-source dataset derived from Exiobase 3.8.2. It integrates environmentally extended multi-regional input-output tables with a graphics processing unit (GPU)-accelerated computational toolkit, facilitating compatibility with and extensibility to other datasets. ExioML encompasses sector-level emission factor data for 49 regions and 28 years from 1995 to 2022, structured into two aggregation schemes: a product-by-product format covering 200 categories, and an industry-by-industry format covering 163 categories. To validate dataset usability and establish a reproducible baseline, we define a regression task for predicting sectoral greenhouse gas emissions. The task is evaluated using tree-based and neural-network-based models, with mean squared error as the evaluation metric. ExioML provides openly accessible emission-factor tables and a reproducible baseline intended to support reuse and benchmarking across sustainability and machine-learning studies.
Similar content being viewed by others
Data availability
The ExioML dataset40, including the Factor Accounting and Footprint Network tables, is publicly available on the Zenodo repository (https://doi.org/10.5281/zenodo.10604610). The repository provides four CSV files: ExioML_factor_accounting_PxP.csv, ExioML_factor_accounting_IxI.csv, ExioML_footprint_network_PxP.csv, and ExioML_footprint_network_IxI.csv, covering 49 regions from 1995 to 2022. The PxP/IxI suffixes distinguish product-by-product and industry-by-industry variants, and the two components correspond to the tabular factor tables and footprint edge lists described in Data Records. ExioML redistributes only derived emission factors and footprint summaries computed from the openly licensed EXIOBASE 3.8.2 dataset (CC BY-SA 4.0)15; no proprietary MRIO inputs are included.
Code availability
The code for constructing ExioML can be found on GitHub (https://github.com/Yvnminc/ExioML).
References
Dumit, A. et al. Atlas: A spend classification benchmark for estimating scope 3 carbon emissions. In: NeurIPS 2024 Workshop on Tackling Climate Change with Machine Learning https://www.climatechange.ai/papers/neurips2024/70 (2024).
Balaji, B. et al. Flamingo: Environmental impact factor matching for life cycle assessment with zero-shot machine learning. ACM Journal on Computing and Sustainable Societies 1(2), 1–23 (2023).
Jain, A., Padmanaban, M., Hazra, J., Godbole, S., Weldemariam, K.: Scope 3 emission estimation using large language models. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2023)
Balaji, B., Vunnava, V.S.G., Guest, G., Kramer, J.: Caml: Carbon footprinting of household products with zero-shot semantic text similarity. In: Proceedings of the ACM Web Conference 2023, pp. 4004-4014 (2023)
Rao, N. D., Riahi, K. & Grubler, A. Climate impacts of poverty eradication. Nature Climate Change 4(9), 749–751, https://doi.org/10.1038/nclimate2340 (2014).
Jorgenson, A. K. Economic development and the carbon intensity of human well-being. Nature Climate Change 4(3), 186–189, https://doi.org/10.1038/nclimate2110 (2014).
Rolnick, D. et al. Tackling climate change with machine learning. ACM Computing Surveys (CSUR) 55(2), 1–96, https://doi.org/10.1145/3485128 (2022).
Lam, R. et al. Learning skillful medium-range global weather forecasting. Science 382(6677), 1416–1421, https://doi.org/10.1126/science.adi2336 (2023).
Stanimirova, R. et al. A global land cover training dataset from 1984 to 2020. Sci. Data 10(1), 879, https://doi.org/10.1038/s41597-023-02798-5 (2023).
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778 (2016)
Zheng, X. et al. A multi-scale time-series dataset with benchmark for machine learning in decarbonized energy grids. Sci. Data 9(1), 359, https://doi.org/10.1038/s41597-022-01455-7 (2022).
Nangini, C. et al. A global dataset of co2 emissions and ancillary data related to emissions for 343 cities. Scientific data 6(1), 1–29, https://doi.org/10.1038/sdata.2018.280 (2019).
Zhu, B. et al. Carbonmonitor-power near-real-time monitoring of global power generation on hourly to daily scales. Sci. Data 10(1), 217, https://doi.org/10.1038/s41597-023-02094-2 (2023).
Ballarin, A. S. et al. Climbra-climate change dataset for brazil. Sci. Data 10(1), 47, https://doi.org/10.1038/s41597-023-01956-z (2023).
Stadler, K. Exiobase 3: Developing a time series of detailed environmentally extended multi-regional input-output tables. Journal of Industrial Ecology 22(3), 502–515, https://doi.org/10.1111/jiec.12715 (2018).
Leontief, W., Strout, A. Multiregional input-output analysis. In: Structural Interdependence and Economic Development: Proceedings of an International Conference on Input-Output Techniques, Geneva, September 1961, pp. 119–150 https://doi.org/10.1007/978-1-349-81634-7_8 (1963).
Wang, S., Zhao, Y. & Wiedmann, T. Carbon emissions embodied in china–australia trade: A scenario analysis based on input–output analysis and panel regression models. Journal of cleaner production 220, 721–731, https://doi.org/10.1016/j.jclepro.2019.02.071 (2019).
Sun, C., Chen, L. & Zhang, F. Exploring the trading embodied co2 effect and low-carbon globalization from the international division perspective. Environmental Impact Assessment Review 83, 106414, https://doi.org/10.1016/j.eiar.2020.106414 (2020).
Steinberger, J. K., Roberts, J. T., Peters, G. P. & Baiocchi, G. Pathways of human development and carbon emissions embodied in trade. Nature Climate Change 2(2), 81–85, https://doi.org/10.1038/nclimate1371 (2012).
Jakob, M. & Marschinski, R. Interpreting trade-related co2 emission transfers. Nature Climate Change 3(1), 19–23, https://doi.org/10.1038/nclimate1630 (2013).
Isard, W.: Interregional and regional input-output analysis: a model of a space-economy. The review of Economics and Statistics, 318–328 (1951).
Chenery, H.B., Watanabe, T. International comparisons of the structure of production. Econometrica: Journal of the Econometric Society, 487–521 (1958).
Hoekstra, R. & Bergh, J. C. Comparing structural decomposition analysis and index. Energy economics 25(1), 39–64, https://doi.org/10.1016/S0140-9883(02)00059-2 (2003).
Peters, G. P. et al. Key indicators to track current progress and future ambition of the paris agreement. Nature Climate Change 7(2), 118–122, https://doi.org/10.1038/nclimate3202 (2017).
Duan, Y. & Yan, B. Economic gains and environmental losses from international trade: A decomposition of pollution intensity in china’s value-added trade. Energy economics 83, 540–554, https://doi.org/10.1016/j.eneco.2019.08.002 (2019).
Kitzes, J. An introduction to environmentally-extended input-output analysis. Resources 2(4), 489–503 (2013).
Peters, G. P. & Hertwich, E. G. Pollution embodied in trade: The norwegian case. Global Environmental Change 16(4), 379–387 (2006).
Hertwich, E. G. & Peters, G. P. Carbon footprint of nations: a global, trade-linked analysis. Environmental science & technology 43(16), 6414–6420 (2009).
Meng, J. et al. The narrowing gap in developed and developing country emission intensities reduces global trade’s carbon leakage. Nature Communications 14(1), 3775, https://doi.org/10.1038/s41467-023-39449-7 (2023).
Tian, K. et al. Regional trade agreement burdens global carbon emissions mitigation. Nature communications 13(1), 408, https://doi.org/10.1038/s41467-022-28004-5 (2022).
Akbari, M. & Do, T. N. A. A systematic review of machine learning in logistics and supply chain management: current trends and future directions. Benchmarking: An International Journal 28(10), 2977–3005, https://doi.org/10.1108/BIJ-10-2020-0514 (2021).
Rolnick, D. et al. Tackling Climate Change with Machine Learning https://arxiv.org/abs/1906.05433 (2019).
Abdella, G. M., Kucukvar, M., Onat, N. C., Al-Yafay, H. M. & Bulak, M. E. Sustainability assessment and modeling based on supervised machine learning techniques: The case for food consumption. Journal of Cleaner Production 251, 119661, https://doi.org/10.1016/j.jclepro.2019.119661 (2020).
Nilashi, M. et al. Measuring sustainability through ecological sustainability and human sustainability: A machine learning approach. Journal of Cleaner Production 240, 118162, https://doi.org/10.1016/j.jclepro.2019.118162 (2019).
He, Y. et al. Factors influencing carbon emissions from china’s electricity industry: Analysis using the combination of lmdi and k-means clustering. Environmental Impact Assessment Review 93, 106724, https://doi.org/10.1016/j.eiar.2021.106724 (2022).
Kijewska, A. & Bluszcz, A. Research of varying levels of greenhouse gas emissions in european countries using the k-means method. Atmospheric Pollution Research 7(5), 935–944, https://doi.org/10.1016/j.apr.2016.05.010 (2016).
Wiedmann, T. et al. Development of an embedded carbon emissions indicator–producing a time series of input–output tables for the uk by using a mrio data optimisation system. Report to the UK Department for Environment, Food and Rural Affairs by Stockholm Environment Institute at the University of York and Centre for Integrated Sustainability Analysis at the University of Sydney, London, DEFRA (2007).
Stadler, K. Pymrio–a python based multi-regional input-output analysis toolbox https://doi.org/10.5334/jors.251 (2021).
Ang, B. W. Decomposition analysis for policymaking in energy:: which is the preferred method? Energy policy 32(9), 1131–1139 (2004).
Guo, Y., Ma, J. ExioML: Eco-economic Dataset for Machine Learning in Global Sectoral Sustainability. Zenodo https://doi.org/10.5281/zenodo.10604610, https://zenodo.org/records/10604610 (2024).
Sun, W. & Huang, C. Predictions of carbon emission intensity based on factor analysis and an improved extreme learning machine from the perspective of carbon emission efficiency. Journal of Cleaner Production 338, 130414, https://doi.org/10.1016/j.jclepro.2022.130414 (2022).
Riahi, K., Grübler, A. & Nakicenovic, N. Scenarios of long-term socio-economic and environmental development under climate stabilization. Technological forecasting and social change 74(7), 887–935, https://doi.org/10.1016/j.techfore.2006.05.026 (2007).
Matisoff, D. C. Different rays of sunlight: Understanding information disclosure and carbon transparency. Energy Policy 55, 579–592, https://doi.org/10.1016/j.enpol.2012.12.049 (2013).
Gorishniy, Y., Rubachev, I., Khrulkov, V. & Babenko, A. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems 34, 18932–18943, https://doi.org/10.48550/arXiv.2106.11959 (2021).
Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems 30 (2017).
Zhang, O. Tips for data science competitions. https://datascience.stackexchange.com/questions/10839 (2016).
Joseph, M. Pytorch tabular: A framework for deep learning with tabular data. arXiv preprint arXiv:2104.13638 https://doi.org/10.48550/arXiv.2104.13638 (2021).
Dietzenbacher, E., Los, B., Stehrer, R., Timmer, M. & De Vries, G. The construction of world input–output tables in the wiod project. Economic systems research 25(1), 71–98, https://doi.org/10.1080/09535314.2012.761180 (2013).
Chepeliev, M. Gtap-power data base: Version 11. Journal of Global Economic Analysis 8(2) https://doi.org/10.21642/JGEA.080203AF (2023).
Lenzen, M., Moran, D., Kanemoto, K. & Geschke, A. Building eora: a global multi-region input–output database at high country and sector resolution. Economic Systems Research 25(1), 20–49, https://doi.org/10.1080/09535314.2013.769938 (2013).
Ingwersen, W. W., Li, M., Young, B., Vendries, J. & Birney, C. Useeio v2. 0, the us environmentally-extended input-output model v2. 0. Sci. Data 9(1), 194 (2022).
Stadler, K. et al. Exiobase 3 (version 3.8.2). Zenodo https://doi.org/10.5281/zenodo.5589597 (2021).
Cover, T. & Hart, P. Nearest neighbor pattern classification. IEEE transactions on information theory 13(1), 21–27, https://doi.org/10.1109/TIT.1967.1053964 (1967).
Hoerl, A. E. & Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67, https://doi.org/10.1080/00401706.1970.10488634 (1970).
Quinlan, J. R. Induction of decision trees. Machine learning 1, 81–106, https://doi.org/10.1007/BF00116251 (1986).
Breiman, L. Random forests. Machine learning 45, 5–32, https://doi.org/10.1023/A:1010933404324 (2001).
Friedman, J.H. Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232 (2001).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521(7553), 436–444, https://doi.org/10.1038/nature14539 (2015).
Joseph, M. & Raj, H. Gate: Gated additive tree ensemble for tabular classification and regression. arXiv preprint arXiv:2207.08548 https://doi.org/10.48550/arXiv.2207.08548 (2022).
Acknowledgements
This work received no external funding.
Author information
Authors and Affiliations
Contributions
Y. Guo designed the study and produced the dataset, visualisations, and regression models for technical validation. J. Ma supervised the project. C. Guan participated in the project design discussion and helped improve the paper draft. All authors contributed to the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Guo, Y., Guan, C. & Ma, J. Global emission factor dataset for Scope 3 machine learning applications. Sci Data (2026). https://doi.org/10.1038/s41597-026-06699-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-06699-1


