Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Data
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific data
  3. data descriptors
  4. article
FePTP: A text-mined dataset of transformation pathways among iron-containing phases
Download PDF
Download PDF
  • Data Descriptor
  • Open access
  • Published: 26 March 2026

FePTP: A text-mined dataset of transformation pathways among iron-containing phases

  • Le Lin1,2,3,
  • Changhai Ren2,3,
  • Yang Xiao2,3,
  • Jingyu Nie2,3,
  • Chongchong Qi2,3,
  • Xiaoqin Li1,
  • Han Wang2,3 &
  • …
  • Zhang Lin  ORCID: orcid.org/0000-0002-6600-20552,3 

Scientific Data , Article number:  (2026) Cite this article

  • 1231 Accesses

  • 1 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Chemical engineering
  • Cheminformatics
  • Geochemistry
  • Sustainability

Abstract

Iron, the most abundant element on Earth by mass (34.6%), primarily exists as iron minerals due to its inherent reactivity. The study of iron mineral phase transformations under changing environmental conditions remains an important research focus due to its geological, environmental, and industrial significance. Yet, the complexity of the system prevents the development of a universal principle to interpret phase transformation behaviors across diverse environmental conditions. An alternative approach is to employ data-driven methods to obtain approximate predictive results. Nevertheless, the data concerning iron-containing phase transformations remain fragmented due to a lack of standardized integration, hindering the advancement of related research. To address this gap, we have developed an automated pipeline that extracts and curates iron-containing phase transformation pathways, creating the first text-mined dataset of 11,241 pathways. Each record includes the precursor/product phases, reaction category, procedures, and associated parameters, as well as the extent of transformation and reaction equations, providing a comprehensive foundation for advancing data-driven research.

Similar content being viewed by others

Double-edge sword roles of iron in driving energy production versus instigating ferroptosis

Article Open access 10 January 2022

Complexions at the iron-magnetite interface

Article Open access 19 March 2025

Iron disproportionation in peridotite fragments from the mantle transition zone

Article Open access 01 July 2025

Data availability

The dataset is available on figshare at https://doi.org/10.6084/m9.figshare.30759095. It contains the main file pathways.jsonl as well as extended_mineral_glossary.json for indexing mineral phases.

Code availability

All the code created in this work is publicly available at https://github.com/Laaery/feptp_pipeline. The best model checkpoint for topic filtering is publicly available at https://huggingface.co/Laerry/feptp-topic-filter for users of interest.

References

  1. Emsley, J. Nature’s Building Blocks: An a-Z Guide to the Elements. (Oxford University Press, Incorporated, Oxford, 2011).

  2. Herchenroeder, J. W. & Gschneidner, K. A. Stable, metastable and nonexistent allotropes. Bulletin of Alloy Phase Diagrams 9, 2–12 (1988).

    Google Scholar 

  3. IMA Outreach Committee: mineral list group. IMA Mineral List with Database of Mineral Properties. https://rruff.info/ima/ (2025).

  4. Huang, J. et al. Fe(II) Redox Chemistry in the Environment. Chem. Rev. 121, 8161–8233 (2021).

    Google Scholar 

  5. Chen, C., Dong, Y. & Thompson, A. Electron Transfer, Atom Exchange, and Transformation of Iron Minerals in Soils: The Influence of Soil Organic Matter. Environ. Sci. Technol. 57, 10696–10707 (2023).

    Google Scholar 

  6. Gouné, M. et al. Overview of the current issues in austenite to ferrite transformation and the role of migrating interfaces therein for low alloyed steels. Mater. Sci. Eng. R 92, 1–38 (2015).

    Google Scholar 

  7. Cudennec, Y. & Lecerf, A. The transformation of ferrihydrite into goethite or hematite, revisited. J. Solid State Chem 179, 716–722 (2006).

    Google Scholar 

  8. Furcas, F. E. et al. Transformation of 2-Line Ferrihydrite to Goethite at Alkaline pH. Environ. Sci. Technol. 57, 16097–16108 (2023).

    Google Scholar 

  9. Ruiz, F. et al. Iron’s role in soil organic carbon (de)stabilization in mangroves under land use change. Nat Commun 15, 10433 (2024).

    Google Scholar 

  10. Patzner, M. S. et al. Iron mineral dissolution releases iron and associated organic carbon during permafrost thaw. Nat Commun 11, 6329 (2020).

    Google Scholar 

  11. Faust, J. C. et al. Millennial scale persistence of organic carbon bound to iron in Arctic marine sediments. Nat Commun 12, 275 (2021).

    Google Scholar 

  12. Lalonde, K., Mucci, A., Ouellet, A. & Gélinas, Y. Preservation of organic matter in sediments promoted by iron. Nature 483, 198–200 (2012).

    Google Scholar 

  13. Yin, J., Li, H. & Xiao, K. Origin of Banded Iron Formations: Links with Paleoclimate, Paleoenvironment, and Major Geological Processes. Minerals 13, 547 (2023).

    Google Scholar 

  14. Bethke, C. M. Geochemical and Biogeochemical Reaction Modeling (Cambridge University Press, 2021).

  15. Jung, I.-H. & Van Ende, M.-A. Computational Thermodynamic Calculations: FactSage from CALPHAD Thermodynamic Database to Virtual Process Simulation. Metall. Mater. Trans. B 51, 1851–1874 (2020).

    Google Scholar 

  16. Hummel, W. & Thoenen, T. Nagra/PSI Chemical Thermodynamic Data Base 12/07. (Paul Scherrer Institute, 2023).

  17. Wei, X. et al. ChatIE: Zero-Shot Information Extraction via Chatting with ChatGPT. Preprint at https://doi.org/10.48550/arXiv.2302.10205 (2024).

  18. Swain, M. C. & Cole, J. M. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J. Chem. Inf. Model 56, 1894–1904 (2016).

    Google Scholar 

  19. Kononova, O. et al. Text-mined dataset of inorganic materials synthesis recipes. Sci. Data. 6, 203 (2019).

    Google Scholar 

  20. Yang, X. et al. PcMSP: A Dataset for Scientific Action Graphs Extraction from Polycrystalline Materials Synthesis Procedure Text. in Findings of the Association for Computational Linguistics: EMNLP 2022 (eds. Goldberg, Y., Kozareva, Z. & Zhang, Y.) 6033–6046, https://doi.org/10.18653/v1/2022.findings-emnlp.446 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).

  21. Song, Y., Miret, S. & Liu, B. MatSci-NLP: Evaluating Scientific Language Models on Materials Science Language Tasks Using Text-to-Schema Modeling. in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Rogers, A., Boyd-Graber, J. & Okazaki, N.) 3621–3639, https://doi.org/10.18653/v1/2023.acl-long.201 (Association for Computational Linguistics, Toronto, Canada, 2023).

  22. Jablonka, K. M. et al. 14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon. Digital Discovery 2, 1233–1250 (2023).

    Google Scholar 

  23. Xie, T. et al. Creation of a structured solar cell material dataset and performance prediction using large language models. Patterns 5, 100955 (2024).

  24. Xiao, Z. et al. Generative Artificial Intelligence GPT-4 Accelerates Knowledge Mining and Machine Learning for Synthetic Biology. ACS Synth. Biol. 12, 2973–2982 (2023).

    Google Scholar 

  25. Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat Commun 15, 1418 (2024).

    Google Scholar 

  26. Polak, M. P. & Morgan, D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat Commun 15, 1569 (2024).

    Google Scholar 

  27. Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. in Advances in Neural Information Processing Systems vol. 33, 9459–9474 (Curran Associates, Inc., 2020).

  28. Singh, V. Replace or Retrieve Keywords In Documents at Scale. Preprint at https://doi.org/10.48550/arXiv.1711.00046 (2017).

  29. Scrapy Development Team. Source code for: Scrapy, a fast high-level web crawling & scraping framework for Python. https://github.com/scrapy/scrapy (2025).

  30. MongoDB Development Team. MongoDB Community Server. https://www.mongodb.com/ (2025).

  31. Beltagy, I., Lo, K. & Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (eds. Inui, K., Jiang, J., Ng, V. & Wan, X.) 3615–3620 (Association for Computational Linguistics, Hong Kong, https://doi.org/10.18653/v1/D19-1371 China, 2019).

  32. Gupta, T., Zaki, M., Krishnan, N. M. A. & Mausam MatSciBERT: A materials domain language model for text mining and information extraction. Npj Comput. Mater 8, 102 (2022).

    Google Scholar 

  33. Sun, C., Qiu, X., Xu, Y. & Huang, X. How to Fine-Tune BERT for Text Classification? in Chinese Computational Linguistics (eds. Sun, M., Huang, X., Ji, H., Liu, Z. & Liu, Y.) 194–206, https://doi.org/10.1007/978-3-030-32381-3_16 (Springer International Publishing, Cham, 2019).

  34. Wolf, T. et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. Preprint at https://doi.org/10.48550/arXiv.1910.03771 (2020).

  35. Grießhaber, D., Maucher, J. & Vu, N. T. Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning. in Proceedings of the 28th International Conference on Computational Linguistics 1158–1171, https://doi.org/10.18653/v1/2020.coling-main.100 (International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020).

  36. Levenshtein, V. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady 10, 707–710 (1965).

    Google Scholar 

  37. Lanchantin, J., Toshniwal, S., Weston, J., Szlam, A. & Sukhbaatar, S. Learning to reason and memorize with self-notes. in Advances in neural information processing systems (eds. Oh, A. et al.) Vol. 36, 11891–11911 (Curran Associates, Inc., 2023).

  38. Yu, W. et al. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models. in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (eds. Al-Onaizan, Y., Bansal, M. & Chen, Y.-N.) 14672–14685, https://doi.org/10.18653/v1/2024.emnlp-main.813 (Association for Computational Linguistics, Miami, Florida, USA, 2024).

  39. OpenAI et al. GPT-4 Technical Report. Preprint at https://doi.org/10.48550/arXiv.2303.08774 (2024).

  40. Pydantic Development Team. Pydantic: Data validation using python type hints. https://github.com/pydantic/pydantic (2025).

  41. Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Preprint at https://doi.org/10.48550/arXiv.2403.05530 (2024).

  42. LangChain Development Team. LangChain. https://github.com/langchain-ai/langchain (2022).

  43. Ghosh, S. et al. Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Dataset. in Findings of the Association for Computational Linguistics: ACL 2024 (eds. Ku, L.-W., Martins, A. & Srikumar, V.) 15109–15123, https://doi.org/10.18653/v1/2024.findings-acl.897 (Association for Computational Linguistics, Bangkok, Thailand, 2024).

  44. Ansari, M. & Moosavi, S. M. Agent-based learning of materials datasets from the scientific literature. Digital Discovery 3, 2607–2617 (2024).

    Google Scholar 

  45. Lála, J. et al. PaperQA: Retrieval-Augmented Generative Agent for Scientific Research. Preprint at https://doi.org/10.48550/arXiv.2312.07559 (2023).

  46. Weaviate Development Team. Weaviate. https://github.com/weaviate/weaviate (2025).

  47. Li, C. et al. PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System. Preprint at https://doi.org/10.48550/arXiv.2206.03001 (2022).

  48. OpenAI et al. GPT-4o System Card. Preprint at, https://doi.org/10.48550/arXiv.2410.21276 (2024).

  49. Thorne, J. & Vlachos, A. Evidence-based Factual Error Correction. in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing Vol.1 (eds. Zong, C., Xia, F., Li, W. & Navigli, R.) 3298–3309, https://doi.org/10.18653/v1/2021.acl-long.256 (Association for Computational Linguistics, Online, 2021).

  50. Xiong, M. et al. CAN LLMS EXPRESS THEIR UNCERTAINTY? AN EMPIRICAL EVALUATION OF CONFIDENCE ELICITATION IN LLMS. in Proceedings of the 12th International Conference on Learning Representations (ICLR 2024) (Hybrid, Vienna, Austria, 2024).

  51. Ma, X. et al. OpenMindat: Open and FAIR mineralogy data from the Mindat database. Geosci. Data J 11, 94–104 (2024).

    Google Scholar 

  52. Jain, A. et al. The Materials Project: A materials genome approach to accelerating materials innovation. APL. Materials 1, 011002 (2013).

    Google Scholar 

  53. SpringerMaterials Development Team. SpringerMaterials – properties of materials. https://materials.springer.com/ (2025).

  54. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci 28, 31–36 (1988).

    Google Scholar 

  55. Kim, S., Thiessen, P. A., Cheng, T., Yu, B. & Bolton, E. E. An update on PUG-REST: RESTful interface for programmatic access to PubChem. Nucleic. Acids. Res. 46, W563–W570 (2018).

    Google Scholar 

  56. Fan, Y. et al. Evaluating Generative Language Models in Information Extraction as Subjective Question Correction. in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (eds. Calzolari, N. et al.) 6409–6417 (ELRA and ICCL, Torino, Italia, 2024).

  57. Mondal, I. et al. ADAPTIVE IE: Investigating the Complementarity of Human-AI Collaboration to Adaptively Extract Information on-the-fly. in Proceedings of the 31st International Conference on Computational Linguistics (eds. Rambow, O. et al.) 5870–5889 (Association for Computational Linguistics, Abu Dhabi, UAE, 2025).

  58. Zhu, M. & Cole, J. M. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. J. Chem. Inf. Model. 62, 1633–1643 (2022).

    Google Scholar 

  59. Lin, L. et al. FePTP: A text-mined dataset of transformation pathways among iron-containing phases. figshare https://doi.org/10.6084/m9.figshare.30759095.v2 (2025).

  60. Hou, Z., Takagiwa, Y., Shinohara, Y., Xu, Y. & Tsuda, K. Machine-Learning-Assisted Development and Theoretical Consideration for the Al2Fe3Si3 Thermoelectric Material. ACS Appl. Mater. Interfaces 11, 11545–11554 (2019).

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Key Fund of the National Natural Science Foundation of China (No. 22336006), the Youth Fund of the National Natural Science Foundation of China (No. 22306204) and the Major Program of the National Natural Science Foundation of China (No. 22494680, 22494681).

Author information

Authors and Affiliations

  1. School of Environment and Energy, Guangdong Provincial Key Laboratory of Solid Wastes Pollution Control and Recycling, South China University of Technology, Guangzhou, 510006, China

    Le Lin & Xiaoqin Li

  2. School of Metallurgy and Environment, Central South University, Changsha, 410083, China

    Le Lin, Changhai Ren, Yang Xiao, Jingyu Nie, Chongchong Qi, Han Wang & Zhang Lin

  3. Chinese National Engineering Research Center for Control & Treatment of Heavy Metal Pollution, Changsha, 410083, China

    Le Lin, Changhai Ren, Yang Xiao, Jingyu Nie, Chongchong Qi, Han Wang & Zhang Lin

Authors
  1. Le Lin
    View author publications

    Search author on:PubMed Google Scholar

  2. Changhai Ren
    View author publications

    Search author on:PubMed Google Scholar

  3. Yang Xiao
    View author publications

    Search author on:PubMed Google Scholar

  4. Jingyu Nie
    View author publications

    Search author on:PubMed Google Scholar

  5. Chongchong Qi
    View author publications

    Search author on:PubMed Google Scholar

  6. Xiaoqin Li
    View author publications

    Search author on:PubMed Google Scholar

  7. Han Wang
    View author publications

    Search author on:PubMed Google Scholar

  8. Zhang Lin
    View author publications

    Search author on:PubMed Google Scholar

Contributions

L.L. designed the framework, developed the software, analyzed the data, and wrote the manuscript. C.R. collected search terms, reviewed, and edited the manuscript. Y.X. and J.N. performed manual inspection of the dataset. C.Q. and X.L. conducted the accuracy evaluation and dataset compilation. H.W. acquired funds, curated the dataset, reviewed and edited the manuscript. Z.L. supervised the project, acquired funds, and reviewed the manuscript. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Han Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, L., Ren, C., Xiao, Y. et al. FePTP: A text-mined dataset of transformation pathways among iron-containing phases. Sci Data (2026). https://doi.org/10.1038/s41597-026-07067-9

Download citation

  • Received: 17 July 2025

  • Accepted: 10 March 2026

  • Published: 26 March 2026

  • DOI: https://doi.org/10.1038/s41597-026-07067-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Editors & Editorial Board
  • Journal Metrics
  • Policies
  • Open Access Fees and Funding
  • Calls for Papers
  • Contact

Publish with us

  • Submission Guidelines
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Data (Sci Data)

ISSN 2052-4463 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing