Abstract
Iron, the most abundant element on Earth by mass (34.6%), primarily exists as iron minerals due to its inherent reactivity. The study of iron mineral phase transformations under changing environmental conditions remains an important research focus due to its geological, environmental, and industrial significance. Yet, the complexity of the system prevents the development of a universal principle to interpret phase transformation behaviors across diverse environmental conditions. An alternative approach is to employ data-driven methods to obtain approximate predictive results. Nevertheless, the data concerning iron-containing phase transformations remain fragmented due to a lack of standardized integration, hindering the advancement of related research. To address this gap, we have developed an automated pipeline that extracts and curates iron-containing phase transformation pathways, creating the first text-mined dataset of 11,241 pathways. Each record includes the precursor/product phases, reaction category, procedures, and associated parameters, as well as the extent of transformation and reaction equations, providing a comprehensive foundation for advancing data-driven research.
Similar content being viewed by others
Data availability
The dataset is available on figshare at https://doi.org/10.6084/m9.figshare.30759095. It contains the main file pathways.jsonl as well as extended_mineral_glossary.json for indexing mineral phases.
Code availability
All the code created in this work is publicly available at https://github.com/Laaery/feptp_pipeline. The best model checkpoint for topic filtering is publicly available at https://huggingface.co/Laerry/feptp-topic-filter for users of interest.
References
Emsley, J. Nature’s Building Blocks: An a-Z Guide to the Elements. (Oxford University Press, Incorporated, Oxford, 2011).
Herchenroeder, J. W. & Gschneidner, K. A. Stable, metastable and nonexistent allotropes. Bulletin of Alloy Phase Diagrams 9, 2–12 (1988).
IMA Outreach Committee: mineral list group. IMA Mineral List with Database of Mineral Properties. https://rruff.info/ima/ (2025).
Huang, J. et al. Fe(II) Redox Chemistry in the Environment. Chem. Rev. 121, 8161–8233 (2021).
Chen, C., Dong, Y. & Thompson, A. Electron Transfer, Atom Exchange, and Transformation of Iron Minerals in Soils: The Influence of Soil Organic Matter. Environ. Sci. Technol. 57, 10696–10707 (2023).
Gouné, M. et al. Overview of the current issues in austenite to ferrite transformation and the role of migrating interfaces therein for low alloyed steels. Mater. Sci. Eng. R 92, 1–38 (2015).
Cudennec, Y. & Lecerf, A. The transformation of ferrihydrite into goethite or hematite, revisited. J. Solid State Chem 179, 716–722 (2006).
Furcas, F. E. et al. Transformation of 2-Line Ferrihydrite to Goethite at Alkaline pH. Environ. Sci. Technol. 57, 16097–16108 (2023).
Ruiz, F. et al. Iron’s role in soil organic carbon (de)stabilization in mangroves under land use change. Nat Commun 15, 10433 (2024).
Patzner, M. S. et al. Iron mineral dissolution releases iron and associated organic carbon during permafrost thaw. Nat Commun 11, 6329 (2020).
Faust, J. C. et al. Millennial scale persistence of organic carbon bound to iron in Arctic marine sediments. Nat Commun 12, 275 (2021).
Lalonde, K., Mucci, A., Ouellet, A. & Gélinas, Y. Preservation of organic matter in sediments promoted by iron. Nature 483, 198–200 (2012).
Yin, J., Li, H. & Xiao, K. Origin of Banded Iron Formations: Links with Paleoclimate, Paleoenvironment, and Major Geological Processes. Minerals 13, 547 (2023).
Bethke, C. M. Geochemical and Biogeochemical Reaction Modeling (Cambridge University Press, 2021).
Jung, I.-H. & Van Ende, M.-A. Computational Thermodynamic Calculations: FactSage from CALPHAD Thermodynamic Database to Virtual Process Simulation. Metall. Mater. Trans. B 51, 1851–1874 (2020).
Hummel, W. & Thoenen, T. Nagra/PSI Chemical Thermodynamic Data Base 12/07. (Paul Scherrer Institute, 2023).
Wei, X. et al. ChatIE: Zero-Shot Information Extraction via Chatting with ChatGPT. Preprint at https://doi.org/10.48550/arXiv.2302.10205 (2024).
Swain, M. C. & Cole, J. M. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J. Chem. Inf. Model 56, 1894–1904 (2016).
Kononova, O. et al. Text-mined dataset of inorganic materials synthesis recipes. Sci. Data. 6, 203 (2019).
Yang, X. et al. PcMSP: A Dataset for Scientific Action Graphs Extraction from Polycrystalline Materials Synthesis Procedure Text. in Findings of the Association for Computational Linguistics: EMNLP 2022 (eds. Goldberg, Y., Kozareva, Z. & Zhang, Y.) 6033–6046, https://doi.org/10.18653/v1/2022.findings-emnlp.446 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).
Song, Y., Miret, S. & Liu, B. MatSci-NLP: Evaluating Scientific Language Models on Materials Science Language Tasks Using Text-to-Schema Modeling. in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Rogers, A., Boyd-Graber, J. & Okazaki, N.) 3621–3639, https://doi.org/10.18653/v1/2023.acl-long.201 (Association for Computational Linguistics, Toronto, Canada, 2023).
Jablonka, K. M. et al. 14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon. Digital Discovery 2, 1233–1250 (2023).
Xie, T. et al. Creation of a structured solar cell material dataset and performance prediction using large language models. Patterns 5, 100955 (2024).
Xiao, Z. et al. Generative Artificial Intelligence GPT-4 Accelerates Knowledge Mining and Machine Learning for Synthetic Biology. ACS Synth. Biol. 12, 2973–2982 (2023).
Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat Commun 15, 1418 (2024).
Polak, M. P. & Morgan, D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat Commun 15, 1569 (2024).
Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. in Advances in Neural Information Processing Systems vol. 33, 9459–9474 (Curran Associates, Inc., 2020).
Singh, V. Replace or Retrieve Keywords In Documents at Scale. Preprint at https://doi.org/10.48550/arXiv.1711.00046 (2017).
Scrapy Development Team. Source code for: Scrapy, a fast high-level web crawling & scraping framework for Python. https://github.com/scrapy/scrapy (2025).
MongoDB Development Team. MongoDB Community Server. https://www.mongodb.com/ (2025).
Beltagy, I., Lo, K. & Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (eds. Inui, K., Jiang, J., Ng, V. & Wan, X.) 3615–3620 (Association for Computational Linguistics, Hong Kong, https://doi.org/10.18653/v1/D19-1371 China, 2019).
Gupta, T., Zaki, M., Krishnan, N. M. A. & Mausam MatSciBERT: A materials domain language model for text mining and information extraction. Npj Comput. Mater 8, 102 (2022).
Sun, C., Qiu, X., Xu, Y. & Huang, X. How to Fine-Tune BERT for Text Classification? in Chinese Computational Linguistics (eds. Sun, M., Huang, X., Ji, H., Liu, Z. & Liu, Y.) 194–206, https://doi.org/10.1007/978-3-030-32381-3_16 (Springer International Publishing, Cham, 2019).
Wolf, T. et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. Preprint at https://doi.org/10.48550/arXiv.1910.03771 (2020).
Grießhaber, D., Maucher, J. & Vu, N. T. Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning. in Proceedings of the 28th International Conference on Computational Linguistics 1158–1171, https://doi.org/10.18653/v1/2020.coling-main.100 (International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020).
Levenshtein, V. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady 10, 707–710 (1965).
Lanchantin, J., Toshniwal, S., Weston, J., Szlam, A. & Sukhbaatar, S. Learning to reason and memorize with self-notes. in Advances in neural information processing systems (eds. Oh, A. et al.) Vol. 36, 11891–11911 (Curran Associates, Inc., 2023).
Yu, W. et al. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models. in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (eds. Al-Onaizan, Y., Bansal, M. & Chen, Y.-N.) 14672–14685, https://doi.org/10.18653/v1/2024.emnlp-main.813 (Association for Computational Linguistics, Miami, Florida, USA, 2024).
OpenAI et al. GPT-4 Technical Report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2024).
Pydantic Development Team. Pydantic: Data validation using python type hints. https://github.com/pydantic/pydantic (2025).
Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Preprint at https://doi.org/10.48550/arXiv.2403.05530 (2024).
LangChain Development Team. LangChain. https://github.com/langchain-ai/langchain (2022).
Ghosh, S. et al. Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Dataset. in Findings of the Association for Computational Linguistics: ACL 2024 (eds. Ku, L.-W., Martins, A. & Srikumar, V.) 15109–15123, https://doi.org/10.18653/v1/2024.findings-acl.897 (Association for Computational Linguistics, Bangkok, Thailand, 2024).
Ansari, M. & Moosavi, S. M. Agent-based learning of materials datasets from the scientific literature. Digital Discovery 3, 2607–2617 (2024).
Lála, J. et al. PaperQA: Retrieval-Augmented Generative Agent for Scientific Research. Preprint at https://doi.org/10.48550/arXiv.2312.07559 (2023).
Weaviate Development Team. Weaviate. https://github.com/weaviate/weaviate (2025).
Li, C. et al. PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System. Preprint at https://doi.org/10.48550/arXiv.2206.03001 (2022).
OpenAI et al. GPT-4o System Card. Preprint at, https://doi.org/10.48550/arXiv.2410.21276 (2024).
Thorne, J. & Vlachos, A. Evidence-based Factual Error Correction. in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing Vol.1 (eds. Zong, C., Xia, F., Li, W. & Navigli, R.) 3298–3309, https://doi.org/10.18653/v1/2021.acl-long.256 (Association for Computational Linguistics, Online, 2021).
Xiong, M. et al. CAN LLMS EXPRESS THEIR UNCERTAINTY? AN EMPIRICAL EVALUATION OF CONFIDENCE ELICITATION IN LLMS. in Proceedings of the 12th International Conference on Learning Representations (ICLR 2024) (Hybrid, Vienna, Austria, 2024).
Ma, X. et al. OpenMindat: Open and FAIR mineralogy data from the Mindat database. Geosci. Data J 11, 94–104 (2024).
Jain, A. et al. The Materials Project: A materials genome approach to accelerating materials innovation. APL. Materials 1, 011002 (2013).
SpringerMaterials Development Team. SpringerMaterials – properties of materials. https://materials.springer.com/ (2025).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci 28, 31–36 (1988).
Kim, S., Thiessen, P. A., Cheng, T., Yu, B. & Bolton, E. E. An update on PUG-REST: RESTful interface for programmatic access to PubChem. Nucleic. Acids. Res. 46, W563–W570 (2018).
Fan, Y. et al. Evaluating Generative Language Models in Information Extraction as Subjective Question Correction. in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (eds. Calzolari, N. et al.) 6409–6417 (ELRA and ICCL, Torino, Italia, 2024).
Mondal, I. et al. ADAPTIVE IE: Investigating the Complementarity of Human-AI Collaboration to Adaptively Extract Information on-the-fly. in Proceedings of the 31st International Conference on Computational Linguistics (eds. Rambow, O. et al.) 5870–5889 (Association for Computational Linguistics, Abu Dhabi, UAE, 2025).
Zhu, M. & Cole, J. M. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. J. Chem. Inf. Model. 62, 1633–1643 (2022).
Lin, L. et al. FePTP: A text-mined dataset of transformation pathways among iron-containing phases. figshare https://doi.org/10.6084/m9.figshare.30759095.v2 (2025).
Hou, Z., Takagiwa, Y., Shinohara, Y., Xu, Y. & Tsuda, K. Machine-Learning-Assisted Development and Theoretical Consideration for the Al2Fe3Si3 Thermoelectric Material. ACS Appl. Mater. Interfaces 11, 11545–11554 (2019).
Acknowledgements
This work was supported by the Key Fund of the National Natural Science Foundation of China (No. 22336006), the Youth Fund of the National Natural Science Foundation of China (No. 22306204) and the Major Program of the National Natural Science Foundation of China (No. 22494680, 22494681).
Author information
Authors and Affiliations
Contributions
L.L. designed the framework, developed the software, analyzed the data, and wrote the manuscript. C.R. collected search terms, reviewed, and edited the manuscript. Y.X. and J.N. performed manual inspection of the dataset. C.Q. and X.L. conducted the accuracy evaluation and dataset compilation. H.W. acquired funds, curated the dataset, reviewed and edited the manuscript. Z.L. supervised the project, acquired funds, and reviewed the manuscript. All authors discussed the results and contributed to the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lin, L., Ren, C., Xiao, Y. et al. FePTP: A text-mined dataset of transformation pathways among iron-containing phases. Sci Data (2026). https://doi.org/10.1038/s41597-026-07067-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-07067-9


