Bridging chemistry and artificial intelligence by a reaction description language

Xiong, Jiacheng; Zhang, Wei; Wang, Yinquan; Huang, Jiatao; Shi, Yuqi; Xu, Mingyan; Li, Manjia; Fu, Zunyun; Kong, Xiangtai; Wang, Yitian; Xiong, Zhaoping; Zheng, Mingyue

doi:10.1038/s42256-025-01032-8

Article
Published: 13 May 2025

Bridging chemistry and artificial intelligence by a reaction description language

Jiacheng Xiong^1,2^na1,
Wei Zhang ORCID: orcid.org/0000-0003-1067-8328^1,2^na1,
Yinquan Wang^1,3,
Jiatao Huang⁴,
Yuqi Shi ORCID: orcid.org/0009-0000-6921-8794^1,2,
Mingyan Xu^1,2,
Manjia Li¹,
Zunyun Fu¹,
Xiangtai Kong^1,2,
Yitian Wang^1,2,
Zhaoping Xiong ORCID: orcid.org/0000-0001-5041-9385⁵ &
…
Mingyue Zheng ORCID: orcid.org/0000-0002-3323-3092^1,2

Nature Machine Intelligence volume 7, pages 782–793 (2025)Cite this article

4277 Accesses
1 Citations
28 Altmetric
Metrics details

Subjects

A preprint version of the article is available at ChemRxiv.

Abstract

With the fast-paced development of artificial intelligence, large language models are increasingly used to tackle various scientific challenges. A critical step in this process is converting domain-specific data into a sequence of tokens for language modelling. In chemistry, molecules are often represented by molecular linear notations, and chemical reactions are depicted as sequence pairs of reactants and products. However, this approach does not capture atomic and bond changes during reactions. Here, we present ReactSeq, a reaction description language that defines molecular editing operations for step-by-step chemical transformation. Based on ReactSeq, language models for retrosynthesis prediction may consistently excel in all benchmark tests, and demonstrate promising emergent abilities in the human-in-the-loop and explainable artificial intelligence. Moreover, ReactSeq has allowed us to obtain universal and reliable representations of chemical reactions, which enable navigation of the reaction space and aid in the recommendation of experimental procedures and prediction of reaction yields. We foresee that ReactSeq can serve as a bridge to narrow the gap between chemistry and artificial intelligence.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 3: Interpretable retrosynthesis prediction with ReactSeq.**

**Fig. 4: The prompt-based learning with ReactSeq.**

**Fig. 5: Representations of chemical reactions and their applications.**

Inferring experimental procedures from text-based representations of chemical reactions

Article Open access 06 May 2021

Large language models to accelerate organic chemistry synthesis

Article 01 July 2025

Retrosynthesis prediction with an iterative string editing model

Article Open access 30 July 2024

Data availability

All datasets used in this work are available at https://github.com/jiachengxiong/ReactSeq, https://drive.google.com/drive/folders/1a6NL5apcP_7isY3HccLjkSsjJGwp_FwD and via Zenodo at https://doi.org/10.5281/zenodo.13338263 (ref. ⁵¹). Source data are provided with this paper.

Code availability

All code for the generating and transforming of ReactSeq, as well as the code for model training and inference, is available at https://github.com/jiachengxiong/ReactSeq and via Zenodo at https://doi.org/10.5281/zenodo.13338263 (ref. ⁵¹). The webpage for our prompt learning model is available at https://huggingface.co/spaces/Oopstom/ReactSeq.

References

Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article Google Scholar
Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).
Article Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Article Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article MathSciNet Google Scholar
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Article Google Scholar
Grisoni, F. Chemical language models for de novo drug design: challenges and opportunities. Curr. Opin. Struct. Biol. 79, 102527 (2023).
Article Google Scholar
Skinnider, M. A., Stacey, R. G., Wishart, D. S. & Foster, L. J. Chemical language models enable navigation in sparsely populated chemical space. Nat. Mach. Intell. 3, 759–770 (2021).
Article Google Scholar
Moret, M. et al. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat. Commun. 14, 114 (2023).
Article Google Scholar
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article Google Scholar
Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 1, 045024 (2020).
Article Google Scholar
Kuenneth, C. & Ramprasad, R. polyBERT: a chemical language model to enable fully machine-driven ultrafast polymer informatics. Nat. Commun. 14, 4099 (2023).
Article Google Scholar
Krenn, M. et al. SELFIES and the future of molecular string representations. Patterns 3, 100588 (2022).
Article Google Scholar
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
Article Google Scholar
Liu, B. et al. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Cent. Sci. 3, 1103–1113 (2017).
Article Google Scholar
Sun, Y. & Sahinidis, N. V. Computer-aided retrosynthetic design: fundamentals, tools, and outlook. Curr. Opin. Chem. Eng. 35, 100721 (2022).
Article Google Scholar
Wang, X. et al. RetroPrime: a diverse, plausible and transformer-based method for single-step retrosynthesis predictions. Chem. Eng. J. 420, 129845 (2021).
Article Google Scholar
Thakkar, A. et al. Unbiasing retrosynthesis language models with disconnection prompts. ACS Cent. Sci. 9, 1488–1498 (2023).
Article Google Scholar
Huang, T. & Li, Y. Current progress, challenges, and future perspectives of language models for protein representation and protein design. The Innovation 4, 100446 (2023).
Article Google Scholar
Min, B. et al. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput. Surv. 56, 1–40 (2023).
Article Google Scholar
Schwaller, P., Hoover, B., Reymond, J.-L., Strobelt, H. & Laino, T. Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. Sci. Adv. 7, eabe4166 (2021).
Article Google Scholar
Schwaller, P. et al. Mapping the space of chemical reactions using attention-based neural networks. Nat. Mach. Intell. 3, 144–152 (2021).
Article Google Scholar
Strieth-Kalthoff, F. et al. Artificial intelligence for retrosynthetic planning needs both data and expert knowledge. J. Am. Chem. Soc. 146, 11005–11017 (2024).
Google Scholar
Nugmanov, R. I. et al. CGRtools: Python library for molecule, reaction, and condensed graph of reaction processing. J. Chem. Inf. Model. 59, 2516–2521 (2019).
Article Google Scholar
Shi, C., Xu, M., Guo, H., Zhang, M. & Tang, J. A graph to graphs framework for retrosynthesis prediction. In Proc. 37th International Conference on Machine Learning (eds Blei, D. et al.) 8818–8827 (PMLR, 2020).
Yan, C. et al. Retroxpert: decompose retrosynthesis prediction like a chemist. Adv. Neural Inf. Process. Syst. 33, 11248–11258 (2020).
Somnath, V. R., Bunne, C., Coley, C., Krause, A. & Barzilay, R. Learning graph models for retrosynthesis prediction. Adv. Neural Inf. Process. Syst. 34, 9405–9415 (2021).
Zhong, W., Yang, Z. & Chen, C. Y.-C. Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing. Nat. Commun. 14, 3009 (2023).
Article Google Scholar
Wang, Y. et al. Retrosynthesis prediction with an interpretable deep-learning framework based on molecular assembly tasks. Nat. Commun. 14, 6155 (2023).
Article Google Scholar
Saebi, M. et al. On the use of real-world datasets for reaction yield prediction. Chem. Sci. 14, 4997–5005 (2023).
Article Google Scholar
Lu, J. & Zhang, Y. Unified deep learning model for multitask reaction predictions with explanation. J. Chem. Inf. Model. 62, 1376–1387 (2022).
Article Google Scholar
Wan, Y., Hsieh, C.-Y., Liao, B. & Zhang, S. Retroformer: pushing the limits of end-to-end retrosynthesis transformer. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 22475–22490 (PMLR, 2022).
Dong, J. et al. Ketones and aldehydes as alkyl radical equivalents for C-H functionalization of heteroarenes. Sci. Adv. 5, eaax9955 (2019).
Article Google Scholar
Peltzer, R. M., Gauss, J., Eisenstein, O. & Cascella, M. The Grignard reaction–unraveling a chemical puzzle. J. Am. Chem. Soc. 142, 2984–2994 (2020).
Article Google Scholar
Heravi, M. M., Hashemi, E. & Nazari, N. Negishi coupling: an easy progress for C–C bond construction in total synthesis. Mol. Divers. 18, 441–472 (2014).
Article Google Scholar
Kotha, S., Lahiri, K. & Kashinath, D. Recent applications of the Suzuki–Miyaura cross-coupling reaction in organic synthesis. Tetrahedron 58, 9633–9695 (2002).
Article Google Scholar
Zhou, J., Zhao, Z. & Shibata, N. Transition-metal-free silylboronate-mediated cross-couplings of organic fluorides with amines. Nat. Commun. 14, 1847 (2023).
Article Google Scholar
Vulovic, B., Cinderella, A. P. & Watson, D. A. Palladium-catalyzed cross-coupling of monochlorosilanes and Grignard reagents. ACS Catal. 7, 8113–8117 (2017).
Article Google Scholar
Xu, W. Q., Xu, X. H. & Qing, F. L. Synthesis and properties of CF3 (OCF3) CH‐substituted arenes and alkenes. Chin. J. Chem. 38, 847–854 (2020).
Article Google Scholar
Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Probst, D., Schwaller, P. & Reymond, J.-L. Reaction classification and yield prediction using the differential reaction fingerprint DRFP. Digit. Discov. 1, 91–97 (2022).
Article Google Scholar
Schneider, N., Lowe, D. M., Sayle, R. A. & Landrum, G. A. Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. J. Chem. Inf. Model. 55, 39–53 (2015).
Article Google Scholar
Kajino, M., Hasuoka, A. & Nishida, H. 1-heterocyclylsulfonyl, 2-aminomethyl, 5- (hetero-) aryl substituted 1-H-pyrrole derivatives as acid secretion inhibitors. Patent WO2007026916A1 (2007).
Yu, Q.-Y., Zeng, H., Yao, K., Li, J.-Q. & Liu, Y. Novel and practical synthesis of vonoprazan fumarate. Synth. Commun. 47, 1169–1174 (2017).
Article Google Scholar
Chen, S. & Jung, Y. Deep retrosynthetic reaction prediction using local reactivity and global attention. JACS Au. 1, 1612–1620 (2021).
Article Google Scholar
Tetko, I. V., Karpov, P., Van Deursen, R. & Godin, G. State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis. Nat. Commun. 11, 5575 (2020).
Article Google Scholar
Zhong, Z. et al. Root-aligned SMILES: a tight representation for chemical reaction prediction. Chem. Sci. 13, 9023–9034 (2022).
Article Google Scholar
Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pre-trained transformer for computational chemistry. Mach. Learn. Sci. Technol. 3, 015022 (2022).
Article Google Scholar
Zdrazil, B. et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 52, D1180–D1192 (2024).
Article Google Scholar
Tingle, B. I. et al. ZINC-22—a free multi-billion-scale database of tangible compounds for ligand discovery. J. Chem. Inf. Model. 63, 1166–1176 (2023).
Article Google Scholar
Chilingaryan, G. et al. BartSmiles: generative masked language models for molecular representations. J. Chem. Inf. Model. 64, 5832–5843 (2024).
Article Google Scholar
zw-SIMM & Xiong, J. jiachengxiong/ReactSeq: ReactSeq (1.0). Zenodo https://doi.org/10.5281/zenodo.13338263 (2024).
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assisted retrosynthesis based on molecular similarity. ACS Cent. Sci. 3, 1237–1245 (2017).
Article Google Scholar
Segler, M. H. & Waller, M. P. Neural‐symbolic machine learning for retrosynthesis and reaction prediction. Chem-Eur. J. 23, 5966–5971 (2017).
Article Google Scholar
Dai, H., Li, C., Coley, C., Dai, B. & Song, L. Retrosynthesis prediction with conditional graph logic network. Adv. Neural Inf. Process. Syst. 32, 8872–8882 (2019).
Sacha, M. et al. Molecule edit graph attention network: modeling chemical reactions as sequences of graph edits. J. Chem. Inf. Model. 61, 3273–3284 (2021).
Article Google Scholar
Chen, Z., Ayinde, O. R., Fuchs, J. R., Sun, H. & Ning, X. G2Retro as a two-step graph generative models for retrosynthesis prediction. Commun. Chem. 6, 102 (2023).
Article Google Scholar
Yao, L. et al. Node-aligned graph-to-graph: elevating template-free deep learning approaches in single-step retrosynthesis. JACS Au. 4, 992–1003 (2024).
Article Google Scholar
Liu, X. et al. RetroCaptioner: beyond attention in end-to-end retrosynthesis transformer via contrastively captioned learnable graph representation. Bioinformatics 40, btae561 (2024).
Article Google Scholar
Zheng, S., Rao, J., Zhang, Z., Xu, J. & Yang, Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. J. Chem. Inf. Model. 60, 47–55 (2019).
Article Google Scholar
Sun, R., Dai, H., Li, L., Kearnes, S. & Dai, B. Towards understanding retrosynthesis by energy-based models. Adv. Neural Inf. Process. Syst. 34, 10186–10194 (2021).

Download references

Acknowledgements

We gratefully acknowledge financial support from the National Natural Science Foundation of China (grant nos. T2225002 and 82273855, M.Z.), the National Key Research and Development Program of China (grant nos. 2022YFC3400504 and 2023YFC2305904, M.Z.), the Strategic Priority Research Program of the Chinese Academy of sciences (grant no. XDB0830200, M.Z.), the open fund of state key laboratory of Pharmaceutical Biotechnology, Nanjing University, China (grant no. KF-202301, M.Z.) and Shanghai Post-doctoral Excellence Program (grant no. 2024707, J.X.).

Author information

These authors contributed equally: Jiacheng Xiong, Wei Zhang.

Authors and Affiliations

Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
Jiacheng Xiong, Wei Zhang, Yinquan Wang, Yuqi Shi, Mingyan Xu, Manjia Li, Zunyun Fu, Xiangtai Kong, Yitian Wang & Mingyue Zheng
University of Chinese Academy of Sciences, Beijing, China
Jiacheng Xiong, Wei Zhang, Yuqi Shi, Mingyan Xu, Xiangtai Kong, Yitian Wang & Mingyue Zheng
Department of Medicinal Chemistry, School of Pharmacy, Fudan University, Shanghai, China
Yinquan Wang
School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
Jiatao Huang
ProtonUnfold Technology Co. Ltd, Suzhou, China
Zhaoping Xiong

Authors

Jiacheng Xiong
View author publications
Search author on:PubMed Google Scholar
Wei Zhang
View author publications
Search author on:PubMed Google Scholar
Yinquan Wang
View author publications
Search author on:PubMed Google Scholar
Jiatao Huang
View author publications
Search author on:PubMed Google Scholar
Yuqi Shi
View author publications
Search author on:PubMed Google Scholar
Mingyan Xu
View author publications
Search author on:PubMed Google Scholar
Manjia Li
View author publications
Search author on:PubMed Google Scholar
Zunyun Fu
View author publications
Search author on:PubMed Google Scholar
Xiangtai Kong
View author publications
Search author on:PubMed Google Scholar
Yitian Wang
View author publications
Search author on:PubMed Google Scholar
Zhaoping Xiong
View author publications
Search author on:PubMed Google Scholar
Mingyue Zheng
View author publications
Search author on:PubMed Google Scholar

Contributions

J.X. proposed the idea, conducted computational experiments and drafted the initial paper together with W.Z. Yinquan W., J.H., W.Z., M.X. and M.L. carried out the synthetic experiments. Y.S. developed the web application. Z.F. and J.H. contributed to the case analysis. Z.F., X.K. and M.Z. helped check and improve the paper. Yitian W. and Z.X. participated in the analysis of results. M.Z. led the project and designed the study. All authors read and approved the final paper.

Corresponding author

Correspondence to Mingyue Zheng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Jannis Born, Xiangliang Zhang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Multistep retrosynthesis predictions by our model for (a) Vonoprazan, (b) Mitapivat, (c) Daridorexant.

The reaction centers and leaving groups are highlighted in different colors at different reaction steps.

Supplementary information

Supplementary Information

Supplementary Figs 1–26, Tables 1–12, Additional Results and Discussion, and Methods.

Reporting Summary

Source data

Source Data Fig. 3

Raw data for Fig. 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xiong, J., Zhang, W., Wang, Y. et al. Bridging chemistry and artificial intelligence by a reaction description language. Nat Mach Intell 7, 782–793 (2025). https://doi.org/10.1038/s42256-025-01032-8

Download citation

Received: 15 May 2024
Accepted: 04 April 2025
Published: 13 May 2025
Issue date: May 2025
DOI: https://doi.org/10.1038/s42256-025-01032-8