Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Nature Communications
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. nature communications
  3. articles
  4. article
ReactionSeek: LLM-powered literature data mining and knowledge discovery in organic synthesis
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 02 March 2026

ReactionSeek: LLM-powered literature data mining and knowledge discovery in organic synthesis

  • Jiawei Li  ORCID: orcid.org/0009-0007-8858-83101,2,
  • Minzhou Li1,2,
  • Qi Yang  ORCID: orcid.org/0009-0005-3198-16191,3 &
  • …
  • Sanzhong Luo  ORCID: orcid.org/0000-0001-8714-40471,2,3 

Nature Communications , Article number:  (2026) Cite this article

  • 4545 Accesses

  • 1 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Cheminformatics
  • Organic chemistry

Abstract

The application of artificial intelligence (AI) to chemical discovery is critically hindered by the inaccessibility of data locked within unstructured scientific literature. Existing data acquisition methods are often manual, limited in scope, or require extensive custom software development, impeding progress in leveraging AI for chemical discovery. Here, we introduce ReactionSeek, a framework that synergistically combines large language models (LLMs) with established cheminformatics tools to automate multi-modal data mining from organic synthesis literature. Through sophisticated prompt engineering with minimal custom code, ReactionSeek extracts and standardizes complex textual, graphical, and semantic chemical information. We validate this framework on the century-spanning Organic Syntheses collection, achieving over 95% precision and recall for key reaction parameters. This enables three applications: the generation of a large, AI-ready dataset; an interactive Synthetic Chatbot (SynChat) for natural language querying of chemical data; and an autonomous analysis that revealed decades-long trends in catalysis. ReactionSeek thus provides a general solution to the data curation bottleneck, representing a step forward in for AI-driven archive mining and knowledge discovery across the chemical sciences.

Similar content being viewed by others

Bridging chemistry and artificial intelligence by a reaction description language

Article 13 May 2025

Large language models for reticular chemistry

Article 01 February 2025

Large language models to accelerate organic chemistry synthesis

Article 01 July 2025

Data availability

The mined data, code, and evaluation datasets generated in this study have been deposited in the Zenodo database under accession code (https://doi.org/10.5281/zenodo. 18523416).

The raw textual data utilized in this study are available under restricted access for copyright and licensing restrictions; access can be obtained by subscription to the original publishers. The processed evaluation data are available at the GitHub repository (https://github.com/DeepSynthesis/ReactionSeek.git) under the file path examples/OrganicSyntheses/reaction_extract/evaluate_raw.json. The literature list used in this study is available in the GitHub repository under the file path examples/OrganicSyntheses/OrgS_cite.csv. The Source Data underlying the main text figures are provided in the cited Zenodo repository54.

Code availability

The source code developed for this study, including all scripts and algorithms, has been deposited in a public GitHub repository https://github.com/DeepSynthesis/ReactionSeek.git.54 All code is available under the MIT License. The primary results reported in this work were generated using the GLM-4 model. The vision language model for image recognition tasks was based on GLM-4V. A comprehensive list of all large language models used for the multi-model evaluation is provided in the main text. Our framework is designed to be model-agnostic. The choice of specific models for testing was to benchmark performance and guide users in model selection. The provided codebase includes documentation to assist users in configuring and integrating their models, including proprietary models requiring user-provided API keys or other services compatible with the OpenAI API format. For the chemical structure recognition task, the InDraw34 OCSR tool was employed. Users may substitute this with any alternative chemical structure recognition tool suitable for their requirements.

References

  1. Jiang, X. et al. Artificial intelligence and automation to power the future of chemistry. Cell Rep. Phys. Sci. 5, 102049 (2024).

    Google Scholar 

  2. Hong, X. et al. AI for organic and polymer synthesis. Sci. China Chem. 67, 2461–2496 (2024).

    Google Scholar 

  3. Tan, Z., Yang, Q. & Luo, S. AI molecular catalysis: Where are we now?. Org. Chem. Front. 12, 2759–2776 (2025).

    Google Scholar 

  4. Skoraczyński, G. et al. Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient?. Sci. Rep. 7, 3582 (2017).

    Google Scholar 

  5. Mercado, R., Kearnes, S. M. & Coley, C. W. Data sharing in chemistry: lessons learned and a case for mandating structured reaction data. J. Chem. Inf. Model. 63, 4253–4265 (2023).

    Google Scholar 

  6. Strieth-Kalthoff, F. et al. Artificial intelligence for retrosynthetic planning needs both data and expert knowledge. J. Am. Chem. Soc. 146, 11005–11017 (2024).

    Google Scholar 

  7. Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 360, 186–190 (2018).

    Google Scholar 

  8. Kearnes, S. M. et al. The open reaction database. J. Am. Chem. Soc. 143, 18820–18826 (2021).

    Google Scholar 

  9. Su, Y. et al. Automation and machine learning augmented by large language models in a catalysis study. Chem. Sci. 15, 12200–12233 (2024).

    Google Scholar 

  10. Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J. & Valencia, A. Information retrieval and text mining technologies for chemistry. Chem. Rev. 117, 7673–7761 (2017).

    Google Scholar 

  11. Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15, 1418 (2024).

    Google Scholar 

  12. Kang, Y. et al. Harnessing large language models to collect and analyze metal–organic framework property data set. J. Am. Chem. Soc. 147, 3943–3958 (2025).

    Google Scholar 

  13. Leong, S. X., Pablo-García, S., Zhang, Z. & Aspuru-Guzik, A. Automated electrosynthesis reaction mining with multimodal large language models (MLLMs). Chem. Sci. 15, 17881–17891 (2024).

    Google Scholar 

  14. Wang, X., Huang, L., Xu, S. & Lu, K. How does a generative large language model perform on domain-specific information extraction?─A comparison between GPT-4 and a rule-based method on band gap extraction. J. Chem. Inf. Model. 64, 7895–7904 (2024).

    Google Scholar 

  15. Zhang, W. et al. Fine-tuning large language models for chemical text mining. Chem. Sci. 15, 10600–10611 (2024).

    Google Scholar 

  16. Zheng, Z. et al. Image and data mining in reticular chemistry powered by GPT-4V. Digit. Discov. 3, 491–501 (2024).

    Google Scholar 

  17. Zheng, Z., Zhang, O., Borgs, C., Chayes, J. T. & Yaghi, O. M. ChatGPT chemistry assistant for text mining and the prediction of MOF synthesis. J. Am. Chem. Soc. 145, 18048–18062 (2023).

    Google Scholar 

  18. Ruan, Y. et al. An automatic end-to-end chemical synthesis development platform powered by large language models. Nat. Commun. 15, 10160 (2024).

    Google Scholar 

  19. Ai, Q., Meng, F., Shi, J., Pelkie, B. & Coley, C. W. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. Digit. Discov. 3, 1822–1831 (2024).

    Google Scholar 

  20. Leong, S. X., Pablo-García, S., Wong, B. & Aspuru-Guzik, A. MERMaid: universal multimodal mining of chemical reactions from PDFs using vision-language models. Matter 8, 102331 (2025).

  21. Walker, N. et al. Extracting structured seed-mediated gold nanorod growth procedures from scientific text with LLMs. Digit. Discov. 2, 1768–1782 (2023).

    Google Scholar 

  22. Chen, Y. et al. A multi-agent system enables versatile information extraction from the chemical literature. Preprint at. https://ui.adsabs.harvard.edu/abs/2025arXiv250720230C (2025).

  23. Fan, V. et al. OpenChemIE: an information extraction toolkit for chemistry literature. J. Chem. Inf. Model. 64, 5521–5534 (2024).

    Google Scholar 

  24. Rajan, K., Zielesny, A. & Steinbeck, C. DECIMER 1.0: deep learning for chemical image recognition using transformers. J. Cheminform. 13, 61 (2021).

    Google Scholar 

  25. Staker, J., Marshall, K., Abel, R. & McQuaw, C. M. Molecular structure extraction from documents using deep learning. J. Chem. Inf. Model. 59, 1017–1029 (2019).

    Google Scholar 

  26. Wilary, D. M. & Cole, J. M. ReactionDataExtractor: a tool for automated extraction of information from chemical reaction schemes. J. Chem. Inf. Model. 61, 4962–4974 (2021).

    Google Scholar 

  27. Wilary, D. M. & Cole, J. M. ReactionDataExtractor 2.0: a deep learning approach for data extraction from chemical reaction schemes. J. Chem. Inf. Model. 63, 6053–6067 (2023).

    Google Scholar 

  28. Guo, J. et al. Automated chemical reaction extraction from scientific literature. J. Chem. Inf. Model. 62, 2035–2045 (2022).

    Google Scholar 

  29. Vangala, S. R. et al. Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature. J. Cheminform. 16, 131 (2024).

    Google Scholar 

  30. Andrew, S., Kende & Jeremiah, P., Freeman. Organic Syntheses (John Wiley and Sons, 2003).

  31. Clevert, D.-A., Le, T., Winter, R. & Montanari, F. Img2Mol – accurate SMILES recognition from molecular graphical depictions. Chem. Sci. 12, 14174–14181 (2021).

    Google Scholar 

  32. Oldenhof, M., Arany, A., Moreau, Y. & Simm, J. ChemGrapher: optical graph recognition of chemical compounds by deep learning. J. Chem. Inf. Model. 60, 4506–4517 (2020).

    Google Scholar 

  33. Qian, Y. et al. MolScribe: robust molecular structure recognition with image-to-graph generation. J. Chem. Inf. Model. 63, 1925–1934 (2023).

    Google Scholar 

  34. Integle Inc. InDraw. https://indrawforweb.integle.com (2025).

  35. Li, Y. et al. The ChEMU 2022 evaluation campaign: information extraction in chemical patents. in Advances in Information Retrieval, 400–407 (Springer, 2022).

  36. He, J. et al. ChEMU 2020: natural language processing methods are effective for information extraction from chemical patents. Front. Res. Metr. Anal. 6, 654438 (2021).

    Google Scholar 

  37. Jang, Y. J. et al. Context aware named entity recognition and relation extraction with domain-specific language model. in Conference and Labs of the Evaluation Forum (CEUR-WS.org, 2022).

  38. Li, Y. et al. Overview of ChEMU 2022 evaluation campaign: information extraction in chemical patents. in Experimental IR Meets Multilinguality, Multimodality, and Interaction, 521–540 (Springer-Verlag, 2022).

  39. Jablonka, K. M. et al. 14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon. Digit. Discov. 2, 1233–1250 (2023).

    Google Scholar 

  40. Kim, S. et al. PubChem 2025 update. Nucleic Acids Res. 53, D1516–D1525 (2025).

    Google Scholar 

  41. Lowe, D. M., Corbett, P. T., Murray-Rust, P. & Glen, R. C. Chemical name to structure: OPSIN, an open source solution. J. Chem. Inf. Model. 51, 739–753 (2011).

    Google Scholar 

  42. Zeng, A. et al. GLM-130B: an open bilingual pre-trained model. In Proc. 11th Int. Conf. Learn. Represent. (2023).

  43. Machlab, D. & Battle, R. LLM in-context recall is prompt dependent. Preprint at. https://arxiv.org/abs/2404.08865 (2024).

  44. Tom, B. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

    Google Scholar 

  45. OpenAI et al. GPT-4 Technical report. Preprint at. https://arxiv.org/abs/2303.08774 (2023).

  46. Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at. https://arxiv.org/abs/2307.09288 (2023).

  47. Jiang, A. Q. et al. Mixtral of experts. Preprint at. https://arxiv.org/abs/2401.04088 (2024).

  48. Yang, A. et al. Baichuan 2: open large-scale language models. Preprint at. https://arxiv.org/abs/2309.10305 (2023).

  49. Yang, A. et al. Qwen2 technical report. Preprint at. https://arxiv.org/abs/2407.10671 (2024).

  50. DeepSeek-AI et al. DeepSeek-V3 technical report. Preprint at. https://arxiv.org/abs/2412.19437 (2024).

  51. Google DeepMind. Gemini 2.5 Pro. https://deepmind.google/models/gemini/pro/ (2025).

  52. Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 33, 9459–9474 (2020).

    Google Scholar 

  53. Chroma. Chroma, https://www.trychroma.com/ (2025).

  54. Li, J., Li, M.,Yang, Q. & Luo, S. DeepSynthesis/ReactionSeek: source code and data for nature communications. Zenodo https://doi.org/10.5281/zenodo.18523416 (2026).

Download references

Acknowledgements

We thank the Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China (JYB2025XDXM414), Natural Science Foundation of China (22031006; 22271192; 22373056 and 22393891), the National Key R&D Program of China (2023YFA1506401; 2023YFA1506402) and Haihe Laboratory of Sustainable Chemical Transformations (25HHWCSS00032, 24HHWCSS00018) for financial support. This work is supported by High performance computing Center, Tsinghua University and the robotic AI-Scientist platform of Chinese Academy of Sciences.

Author information

Authors and Affiliations

  1. Center of Basic Molecular Science, Department of Chemistry, Tsinghua University, Beijing, China

    Jiawei Li, Minzhou Li, Qi Yang & Sanzhong Luo

  2. Beijing Key Laboratory of Intelligent Machine for Organic Synthesis, Department of Chemistry, Tsinghua University, Beijing, China

    Jiawei Li, Minzhou Li & Sanzhong Luo

  3. Haihe Laboratory of Sustainable Chemical Transformations, Tianjin, China

    Qi Yang & Sanzhong Luo

Authors
  1. Jiawei Li
    View author publications

    Search author on:PubMed Google Scholar

  2. Minzhou Li
    View author publications

    Search author on:PubMed Google Scholar

  3. Qi Yang
    View author publications

    Search author on:PubMed Google Scholar

  4. Sanzhong Luo
    View author publications

    Search author on:PubMed Google Scholar

Contributions

S.L. and Q.Y. conceived and directed the project. J.L. designed the research workflow and developed the programs and prompt engineering for the data mining framework. M.L. built and deployed SynChat. All authors contributed to the writing and revision of the manuscript.

Corresponding authors

Correspondence to Qi Yang or Sanzhong Luo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Masaharu Yoshioka, Yanyan Xu, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Li, M., Yang, Q. et al. ReactionSeek: LLM-powered literature data mining and knowledge discovery in organic synthesis. Nat Commun (2026). https://doi.org/10.1038/s41467-026-70180-1

Download citation

  • Received: 26 July 2025

  • Accepted: 20 February 2026

  • Published: 02 March 2026

  • DOI: https://doi.org/10.1038/s41467-026-70180-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Videos
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims & Scope
  • Editors
  • Journal Information
  • Open Access Fees and Funding
  • Calls for Papers
  • Editorial Values Statement
  • Journal Metrics
  • Editors' Highlights
  • Contact
  • Editorial policies
  • Top Articles

Publish with us

  • For authors
  • For Reviewers
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Nature Communications (Nat Commun)

ISSN 2041-1723 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing