Abstract
We present a dataset of over 3,000 global disaster events from 2014 to 2024, derived from the Emergency Events Database (EM-DAT). Events are extracted from news using a pipeline combining Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) for semantic extraction. The corpus is the Europe Media Monitor (EMM), aggregating content from millions of news outlets. For each event, structured storylines are automatically generated, summarizing hazard characteristics, drivers, impacts, and responses, and transformed into knowledge graphs. This enables analysis of relationships, inter-hazard dynamics, and human-environment interactions often missed in traditional records. A small subset of knowledge graphs was evaluated by domain experts in a workshop, while a larger sample of extracted triplets was independently assessed to quantify precision and inter-annotator agreement. The dataset supports retrospective analysis and multi-hazard risk assessment, complementing resources like the Hazard Information Profiles (HIPs). All data, code, and workflows are openly available, with an interactive dashboard for exploration. This resource advances data-driven approaches to disaster scenario modeling, impact analysis, and decision support in disaster risk management.
Similar content being viewed by others
Data availability
The dataset containing both storylines and KGs is available in CSV format within the Joint Research Centre Data Catalogue at https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/ETOHA/storylines/DisasterStory.csv21. To maximize access and visibility, we also make all data and code available on Zenodo in a single repository, which can be downloaded at https://doi.org/10.5281/zenodo.18598183.
Code availability
All Python code used for data processing, RAG pipeline, and knowledge graph extraction is available at https://github.com/jrcf7/crisesStorylinesRAG. The code for the Gradio application can be found at https://huggingface.co/spaces/roncmic/crisesStorylinesRAG. All data and code are also made available on Zenodo in a single repository, which can be downloaded at https://doi.org/10.5281/zenodo.18598183.
References
Jacot des Combes, H. et al. Hazard definition and classification review: Technical report (2025). United Nations Office for Disaster Risk Reduction https://doi.org/10.24948/2025.05 (2025).
De Angeli, S. et al. A multi-hazard framework for spatial-temporal impact analysis. International Journal of Disaster Risk Reduction 73, 102829, https://doi.org/10.1016/j.ijdrr.2022.102829 (2022).
Gill, J. C. & Malamud, B. D. Reviewing and visualizing the interactions of natural hazards. Reviews of Geophysics 52, 680–722, https://doi.org/10.1002/2013RG000445 (2014).
Šakić Trogrlić, R. et al. Challenges in assessing and managing multi-hazard risks: A european stakeholders perspective. Environmental Science & Policy 157, 103774, https://doi.org/10.1016/j.envsci.2024.103774 (2024).
Thomas, D. S. K., Jang, S. & Scandlyn, J. The chasms conceptual model of cascading disasters and social vulnerability: The covid-19 case example. International Journal of Disaster Risk Reduction 51, 101828, https://doi.org/10.1016/j.ijdrr.2020.101828 (2020).
Tilloy, A., Malamud, B., Winter, H. & Joly-Laugel, A. A review of quantification methodologies for multi-hazard interrelationships. Earth-Science Reviews 196, 102881, https://doi.org/10.1016/j.earscirev.2019.102881 (2019).
Gallina, V. et al. A review of multi-risk methodologies for natural hazards: Consequences and challenges for a climate change impact assessment. Journal of Environmental Management 168, 123–132, https://doi.org/10.1016/j.jenvman.2015.11.011 (2016).
Rokhideh, M., Fearnley, C. & Budimir, M. Multi-hazard early warning systems in the sendai framework for disaster risk reduction: Achievements, gaps, and future directions. International Journal of Disaster Risk Science 16, 103–116, https://doi.org/10.1007/s13753-025-00622-9 (2025).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems, 5999–6009 (2017).
Brown, T. B. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, vol. 2020-December (2020).
Lei, Z. et al. Harnessing large language models for disaster management: A survey. Findings of the Association for Computational Linguistics: ACL https://doi.org/10.18653/v1/2025.findings-acl.750 (2025).
Xu, F., Ma, J., Li, N. & Cheng, J. C. P. Large language model applications in disaster management: An interdisciplinary review. International Journal of Disaster Risk Reduction 127, 105642, https://doi.org/10.1016/j.ijdrr.2025.105642 (2025).
Jeba, S. M., Aurpa, T. T. & Adib, M. R. S. From facebook posts to news headlines: Using transformer models to predict post-disaster impact on mass media content. Social Network Analysis and Mining 14, 200 (2024).
Hou, J. & Xu, S. Near-real-time seismic human fatality information retrieval from social media with few-shot large-language models. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, 1141–1147 (2022).
Balashankar, A. et al. Predicting food crises using news streams. Science Advances 9, eabm3449, https://doi.org/10.1126/sciadv.abm3449 (2023).
Delforge, D. et al. EM-DAT: The Emergency Events Database. International Journal of Disaster Risk Reduction 124, 105509, https://doi.org/10.1016/j.ijdrr.2025.105509 (2025).
Sodoge, J., Kuhlicke, C., Mahecha, M. D. & de Brito, M. M. Text mining uncovers the unique dynamics of socio-economic impacts of the 2018-2022 multi-year drought in germany. Natural Hazards and Earth System Sciences 24, 1757–1777, https://doi.org/10.5194/nhess-24-1757-2024 (2024).
Alencar, P. H. L., Sodoge, J., Paton, E. N. & de Brito, M. M. Flash droughts and their impacts-using newspaper articles to assess the perceived consequences of rapidly emerging droughts. Environmental Research Letters 19, 074048, https://doi.org/10.1088/1748-9326/ad58fa (2024).
Firmansyah, H. B. et al. Enhancing disaster response with automated text information extraction from social media images. In 2023 IEEE Ninth International Conference on Big Data Computing Service and Applications (BigDataService), 71–78, https://doi.org/10.1109/BigDataService58306.2023.00017 (2023).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33, 9459–9474 (2020).
Ronco, M. et al. crisesStorylinesRAG [Data set]. Zenodo, https://doi.org/10.5281/zenodo.18598183 (2026).
Steinberger, R. et al. EMM: Supporting the analyst by turning multilingual text into structured data. In Transparenz aus Verantwortung: neue Herausforderungen für die digitale Datenanalyse (Erich Schmidt Verlag, 2017).
Ji, S., Pan, S., Cambria, E., Marttinen, P. & Yu, P. S. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. IEEE Transactions on Neural Networks and Learning Systems 33, https://doi.org/10.1109/TNNLS.2021.3070843 (2022).
Auer, S. et al. Towards a Knowledge Graph for Science. In Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, WIMS ’18 (Association for Computing Machinery, New York, NY, USA, 2018).
Hogan, A. et al. Knowledge graphs. ACM Computing Surveys 54, https://doi.org/10.1145/3447772 (2021).
Tiwari, S., Ortíz-Rodriguez, F., Abbés, S. B., Usip, P. U. & Hantach, R.Semantic AI in Knowledge Graphs (Taylor & Francis, Boca Raton, US, 2023).
Heath, T. & Bizer, C. Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology 1, 1–121 (2011).
Steinberger, R., Pouliquen, B. & van der Goot, E. An introduction to the Europe Media Monitor family of applications (2013).
Touvron, H. et al. LLaMA: Open and Efficient Foundation Language Models (2023).
Dubey, A. et al. The Llama 3 Herd of Models (2024).
Yang, J., Han, S. C. & Poon, J. A survey on extraction of causal relations from natural language text. Knowl. Inf. Syst. 64, 1161–1186 (2022).
Yerkhassym, A., Pak, A. A., Akhmetov, I., Yelenov, A. & Gelbukh, A. On causality problem in natural language processing field. Computacion y Sistemas 26, 1549 - 1556 (2022).
Coletta, V. R. et al. Causal loop diagrams for supporting nature based solutions participatory design and performance assessment. Journal of Environmental Management 280, 111668 (2021).
Inam, A., Adamowski, J., Halbe, J. & Prasher, S. Using causal loop diagrams for the initialization of stakeholder engagement in soil salinity management in agricultural watersheds in developing countries: A case study in the rechna doab watershed, pakistan. Journal of Environmental Management 152, 251–267 (2015).
Dong, Q. et al. A survey on in-context learning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing https://doi.org/10.18653/v1/2024.emnlp-main.64 (2024).
Peng, C., Xia, F. & Naseriparsa, M. et al. Knowledge graphs: Opportunities and challenges. Artificial Intelligence Review 56, 13071–13102 (2023).
Consoli, S., Coletti, P. & Markov, P. V. et al. An epidemiological knowledge graph extracted from the world health organization’s disease outbreak news. Scientific Data 12, 970 (2025).
Bertolini, L., Hulsman, R., Consoli, S., Puertas Gallardo, A. & Ceresa, M. On constructing biomedical text-to-graph systems with large language models. In CEUR Workshop Proceedings, vol. 3747 (2024).
Yang, R., Zhu, J., Man, J., Fang, L. & Zhou, Y. Enhancing text-based knowledge graph completion with zero-shot large language models: A focus on semantic enhancement. Knowl. Based Syst. 300, 112155 (2023).
Antonucci, A., Piqué, G. & Zaffalon, M. Zero-shot causal graph extrapolation from text via llms. arXiv preprint (2023).
Yang, R. et al. Graphusion: A RAG Framework for Scientific Knowledge Graph Construction with a Global Perspective. WWW ‘25: The ACM Web Conference. https://dl.acm.org/doi/10.1145/3701716.3717821 (2025).
Long, S., Schuster, T. & Piché, A. Can large language models build causal graphs? arXiv preprint (2023).
Samarajeewa, C., De Silva, D., Osipov, E., Alahakoon, D. & Manic, M. Causal reasoning in large language models using causal graph retrieval augmented generation. In 2024 16th International Conference on Human System Interaction (HSI), 1–6, https://doi.org/10.1109/HSI61632.2024.10613566 (2024).
Jiralerspong, T., Chen, X., More, Y., Shah, V. & Bengio, Y. Efficient causal graph discovery using large language models. arXiv preprint (2024).
Krippendorff, K. Content analysis: An introduction to its methodology (1980).
Cohen, J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37–46 (1960).
Author information
Authors and Affiliations
Contributions
Michele Ronco: Conceptualization, data curation, formal analysis, software development, writing - original draft preparation. Luca Bandelli: Conceptualization, data curation, software development, writing. Lorenzo Bertolini: Conceptualization, data curation, supervision - review and editing. Sergio Consoli: Conceptualization, data curation, software development, writing. Damien Delforge: Writing - review. Alessio Spadaro: Conceptualization, review. Marco Verile: Conceptualization, supervision. Christina Corbane: Conceptualization, supervision, writing - review and editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ronco, M., Bandelli, L., Bertolini, L. et al. Disaster Storylines and Knowledge Graphs from Global News with Large Language Models and Retrieval-Augmented Generation. Sci Data (2026). https://doi.org/10.1038/s41597-026-07036-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-07036-2


