Abstract
Landslides are a major geological hazard causing significant casualties and economic losses. Reliable risk assessment requires high-quality spatiotemporal event data, yet no publicly available landslide catalogue with fine-grained precision exists for China. To address this, we developed a landslide event catalogue for mainland China from 2008–2024 based on news reports. The dataset was generated via large-scale web crawling, information extraction using an open-source large language model (LLM), event deduplication, geocoding, and multi-stage validation. It contains 1,582 events with detailed spatiotemporal attributes, some with minute-level temporal precision and spatial resolution down to the county, village, or specific reported sites. Evaluation shows that, while casualty-related information is less accurate, the LLM reliably captures key attributes such as time, location, and triggering factors. This demonstrates the feasibility of using LLMs to extract critical landslide data from news reports. Compared with existing catalogues, our dataset offers more events and improved spatiotemporal accuracy, providing a valuable resource for landslide hazard assessment, early warning model development, and disaster risk management in China.
Similar content being viewed by others
Data availability
The landslide event catalogue is available on figshare https://doi.org/10.6084/m9.figshare.29603420.
Code availability
The code used in this study is implemented in Python and publicly available at https://doi.org/10.6084/m9.figshare.31298212. It includes scripts for extracting landslide-related information from news reports using large language models, identifying and removing duplicate landslide event records, and performing geocoding to assign spatial coordinates to landslide events.
References
Fidan, S. et al. Understanding fatal landslides at global scales: a summary of topographic, climatic, and anthropogenic perspectives. Nat Hazards 120, 6437–6455 (2024).
Froude, M. J. & Petley, D. N. Global fatal landslide occurrence from 2004 to 2016. Natural Hazards and Earth System Sciences 18, 2161–2181 (2018).
Haque, U. et al. Fatal landslides in Europe. Landslides 13, 1545–1554 (2016).
Clague, J. J. & Stead, D. Landslides: Types, Mechanisms and Modeling. (Cambridge University Press, 2012).
Khatun, M. et al. Landslide Susceptibility Mapping Using Weighted-Overlay Approach in Rangamati, Bangladesh. Earth Syst Environ 7, 223–235 (2023).
Petley, D. N. et al. Trends in landslide occurrence in Nepal. Nat Hazards 43, 23–44 (2007).
Hong, Y., Adler, R. & Huffman, G. Use of satellite remote sensing data in the mapping of global landslide susceptibility. Nat Hazards 43, 245–256 (2007).
Petley, D. Global patterns of loss of life from landslides. Geology 40, 927–930 (2012).
Wang, D. et al. Assessment of landslide susceptibility and risk factors in China. Nat Hazards 108, 3045–3059 (2021).
Fusco, F. et al. A revised landslide inventory of the Campania region (Italy). Sci Data 10, 355 (2023).
Guzzetti, F., Galli, M., Reichenbach, P., Ardizzone, F. & Cardinali, M. Landslide hazard assessment in the Collazzone area, Umbria, Central Italy. Natural Hazards and Earth System Sciences 6, 115–131 (2006).
Westen, C. Jvan, Abella, E. A. C. & Kuriakose, S. L. Spatial data for landslide susceptibility, hazards and vulnerability assessment: an overview. ENG GEOL 102, 112–131 (2008).
Di Napoli, M. et al. Machine learning ensemble modelling as a tool to improve landslide susceptibility mapping reliability. Landslides 17, 1897–1914 (2020).
Guerriero, L. et al. Kinematics and geologic control of the deep-seated landslide affecting the historic center of Buonalbergo, southern Italy. Geomorphology 394, 107961 (2021).
Bozzano, F. et al. Geological and geomorphological analysis of a complex landslides system: the case of San Martino sulla Marruccina (Abruzzo, Central Italy). Journal of Maps 16, 126–136 (2020).
Guzzetti, F., Cardinali, M. & Reichenbach, P. The Influence of Structural Setting and Lithology on Landslide Type and Pattern. Environmental & Engineering Geoscience II, 531–555 (1996).
Lupiano, V., Rago, V., Terranova, O. G. & Iovine, G. Landslide inventory and main geomorphological features affecting slope stability in the Picentino river basin (Campania, southern Italy). Journal of Maps (2019).
Confuorto, P. et al. Intervention model for natural and anthropogenic risk scenarios in the framework of Municipal Emergency Plans. International Journal of Disaster Risk Reduction 58, 102204 (2021).
Malamud, B. D., Turcotte, D. L., Guzzetti, F. & Reichenbach, P. Landslides, earthquakes, and erosion. Earth and Planetary Science Letters 229, 45–59 (2004).
Delforge, D. et al. EM-DAT: the Emergency Events Database. Preprint at https://doi.org/10.21203/rs.3.rs-3807553/v2 (2025).
Santangelo, M., Cardinali, M., Rossi, M., Mondini, A. C. & Guzzetti, F. Remote landslide mapping using a laser rangefinder binocular and GPS. Natural Hazards and Earth System Sciences 10, 2539–2546 (2010).
Bianchini, S. et al. From Picture to Movie: Twenty Years of Ground Deformation Recording Over Tuscany Region (Italy) With Satellite InSAR. Front. Earth Sci. 6 (2018).
McKean, J. & Roering, J. Objective landslide detection and surface morphology mapping using high-resolution airborne laser altimetry. Geomorphology. 57(3-4), 331–351, https://doi.org/10.1016/s0169-555x(03)00164-8 (2004).
Fiorucci, F. et al. Criteria for the optimal selection of remote sensing optical images to map event landslides. Natural Hazards and Earth System Sciences 18, 405–417 (2018).
Santurri, L. et al. Assessment of very high resolution satellite data fusion techniques for landslide recognition (2010).
Kirschbaum, D. B., Adler, R., Hong, Y., Hill, S. & Lerner-Lam, A. A global landslide catalog for hazard applications: method, results, and limitations. Nat Hazards 52, 561–575 (2010).
Vennari, C. et al. Rainfall thresholds for shallow landslide occurrence in Calabria, southern Italy. Natural Hazards and Earth System Sciences 14, 317–330 (2014).
Klimeš, J. et al. Challenges for landslide hazard and risk management in ‘low-risk’ regions, Czech Republic—landslide occurrences and related costs (IPL project no. 197). Landslides 14, 771–780 (2017).
Görüm, T. & Fidan, S. Spatiotemporal variations of fatal landslides in Turkey. Landslides 18, 1691–1705 (2021).
Rosi, A. et al. Landslides in the Mountain Region of Rio de Janeiro: A Proposal for the Semi-Automated Definition of Multiple Rainfall Thresholds. Geosciences 9, 203 (2019).
Dikau, R., Cavallin, A. & Jäger, S. Databases and GIS for landslide research in Europe. Geomorphology 15, 227–239 (1996).
Rosi, A., Segoni, S., Catani, F. & Casagli, N. Statistical and environmental analyses for the definition of a regional rainfall threshold system for landslide triggering in Tuscany (Italy). J. Geogr. Sci. 22, 617–629 (2012).
Rosser, B., Dellow, S., Haubrock, S. & Glassey, P. New Zealand’s National Landslide Database. Landslides 14, 1949–1959 (2017).
Exploring event landslide mapping using Sentinel-1 SAR backscatter products. Geomorphology 397, 108021 (2022).
Fischer, H. W. Response to Disaster: Fact Versus Fiction and Its Perpetuation. (UPA, Lanham, Md. u.a.], 2008).
Goswami, S., Chakraborty, S., Ghosh, S., Chakrabarti, A. & Chakraborty, B. A review on application of data mining techniques to combat natural disasters. Ain Shams Engineering Journal 9, 365–378 (2018).
Franceschini, R., Rosi, A., Catani, F. & Casagli, N. Detecting information from Twitter on landslide hazards in Italy using deep learning models. Geoenvironmental Disasters 11, 22 (2024).
Dogra, V. et al. A Complete Process of Text Classification System Using State-of-the-Art NLP Models. Computational Intelligence and Neuroscience 2022, 1883698 (2022).
Rodrigues, S. G., Silva, M. M. & Alencar, M. H. A proposal for an approach to mapping susceptibility to landslides using natural language processing and machine learning. Landslides 18, 2515–2529 (2021).
Chen, J., Tam, D., Raffel, C., Bansal, M. & Yang, D. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP. Transactions of the Association for Computational Linguistics 11, 191–211 (2023).
Chang, Y. et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 39:1–39:45 (2024).
Mohandoss, R. Context-based Semantic Caching for LLM Applications. 2024 IEEE Conference on Artificial Intelligence (CAI) 371–376, https://doi.org/10.1109/CAI59869.2024.00075 (2024).
Hoseini, S. et al. Challenges and Opportunities of LLM-Augmented Semantic Model Creation for Dataspaces. In The Semantic Web: ESWC 2024 Satellite Events (eds. Meroño Peñuela, A. et al.) 183–200, https://doi.org/10.1007/978-3-031-78955-7_17 (Springer Nature Switzerland, Cham, 2025).
Wang, S., He, J., Ma, R., Cheng, Z. & Ding, H. A Comprehensive Vector Dataset of Bus Networks Across China for the Year 2024. Sci Data 12, 524 (2025).
Zhao, B. et al. A high-precision catalogue of landslide events in China based on news text mining with large language model. figshare https://doi.org/10.6084/m9.figshare.29603420 (2026).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No. 42571090) and the Natural Science Foundation of Higher Education Institutions of Jiangsu Province (Grant No. 25KJB170011).
Author information
Authors and Affiliations
Contributions
Binru Zhao led the design of the catalogue structure, implemented the data processing workflow, and reviewed and revised the manuscript prior to final submission. Zhenxia Liu contributed to large language model–based information extraction, record processing and revisions. Lulu Zhang contributed to data collection, data analysis and the initial drafting of the manuscript. Wenchao Ma and Jian Wang contributed to data curation through manual review and revision of the extracted landslide records prior to catalogue finalization. Qiang Sun provided independent validation data for quality assessment of the catalogue. Wen Luo, Zhaoyuan Yu, and Linwang Yuan provided overall supervision and guidance on dataset design.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhao, B., Zhang, L., Liu, Z. et al. A high-precision catalogue of landslide events in China based on news text mining with large language model. Sci Data (2026). https://doi.org/10.1038/s41597-026-07066-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-07066-w


