Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Making the collective knowledge of chemistry open and machine actionable

Abstract

Large amounts of data are generated in chemistry labs—nearly all instruments record data in a digital form, yet a considerable proportion is also captured non-digitally and reported in ways non-accessible to both humans and their computational agents. Chemical research is still largely centred around paper-based lab notebooks, and the publication of data is often more an afterthought than an integral part of the process. Here we argue that a modular open-science platform for chemistry would be beneficial not only for data-mining studies but also, well beyond that, for the entire chemistry community. Much progress has been made over the past few years in developing technologies such as electronic lab notebooks that aim to address data-management concerns. This will help make chemical data reusable, however it is only one step. We highlight the importance of centring open-science initiatives around open, machine-actionable data and emphasize that most of the required technologies already exist—we only need to connect, polish and embrace them.

The alternative text for this image may have been generated using AI.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The five core theses of this perspective.
The alternative text for this image may have been generated using AI.
Fig. 2: Overview of a possible importation procedure of the ELN.
The alternative text for this image may have been generated using AI.
Fig. 3: Example of the flow of data from an ELN to an interactive visualization for the reader of a paper.
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

References

  1. Heidorn, P. B. Shedding light on the dark data in the long tail of science. Libr. Trends 57, 280–299 (2008).

    Article  Google Scholar 

  2. Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).

    Article  CAS  PubMed  Google Scholar 

  3. Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10, 712–712 (2011).

    Article  CAS  PubMed  Google Scholar 

  4. Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Pietsch, W. & Wernecke, J. in Berechenbarkeit der Welt? (eds Pietsch, W., Wernecke, J. Ott, M.) 37–57 (Springer, 2017).

  6. Hunter, M. Establishing the New Science: the Experience of the Early Royal Society (Boydell Press, 1989).

  7. McAlpine, J. B. et al. The value of universally available raw NMR data for transparency, reproducibility, and integrity in natural product research. Nat. Prod. Rep. 36, 35–107 (2019).

    Article  CAS  PubMed  Google Scholar 

  8. Helliwell, J. R., McMahon, B., Guss, J. M. & Kroon-Batenburg, L. M. J. The science is in the data. IUCrJ 4, 714–722 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Kwok, R. How to pick an electronic laboratory notebook. Nature 560, 269–270 (2018).

    Article  CAS  PubMed  Google Scholar 

  10. Kanza, S. et al. Electronic lab notebooks: can they replace paper? J. Cheminformatics 9, 31 (2017).

    Article  Google Scholar 

  11. Rubacha, M., Rattan, A. K. & Hosselet, S. C. A review of electronic laboratory notebooks available in the market today. J. Lab. Autom. 16, 90–98 (2011).

    Article  CAS  PubMed  Google Scholar 

  12. Guerrero, S. et al. Analysis and implementation of an electronic laboratory notebook in a biomedical research institute. PLoS ONE 11, e0160428 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Dirnagl, U. & Przesdzing, I. A pocket guide to electronic laboratory notebooks in the academic life sciences. F1000Research 5, 2 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Coley, C. W. in Artificial Intelligence in Drug Discovery (ed. Brown, N) 327–348 (Royal Society of Chemistry, 2020).

  15. Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).

    Article  CAS  PubMed  Google Scholar 

  16. Moosavi, S. M. et al. Capturing chemical intuition in synthesis of metal–organic frameworks. Nat. Commun. 10, 539 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Ojea-Jiménez, I., Bastús, N. G. & Puntes, V. Influence of the sequence of the reagents addition in the citrate-mediated synthesis of gold nanoparticles. J. Phys. Chem. C 115, 15752–15757 (2011).

    Article  Google Scholar 

  18. Huang, Y. et al. Importance of reagent addition order in contaminant degradation in an Fe(II)/PMS system. RSC Adv. 6, 70271–70276 (2016).

    Article  CAS  Google Scholar 

  19. Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. PhD thesis, Univ. Cambridge (2012).

  20. Jin, W., Coley, C. W., Barzilay, R. & Jaakkola, T. Predicting organic reaction outcomes with Weisfeiler-Lehman network. In Proc. 31st International Conference on Neural Information Processing Systems 2604–2613 (NIPS, 2017).

  21. Kim, E., Huang, K., Kononova, O., Ceder, G. & Olivetti, E. Distilling a materials synthesis ontology. Matter 1, 8–12 (2019).

    Article  Google Scholar 

  22. Roughley, S. D. & Jordan, A. M. The medicinal chemist’s toolbox: an analysis of reactions used in the pursuit of drug candidates. J. Med. Chem. 54, 3451–3479 (2011).

    Article  CAS  PubMed  Google Scholar 

  23. Schneider, N., Lowe, D. M., Sayle, R. A., Tarselli, M. A. & Landrum, G. A. Big data from pharmaceutical patents: a computational analysis of medicinal chemists’ bread and butter. J. Med. Chem. 59, 4385–4402 (2016).

    Article  CAS  PubMed  Google Scholar 

  24. Brown, D. G., Gagnon, M. M. & Boström, J. Understanding our love affair with p-chlorophenyl: present day implications from historical biases of reagent selection. J. Med. Chem. 58, 2390–2405 (2015).

    Article  CAS  PubMed  Google Scholar 

  25. Brown, D. G. & Boström, J. Analysis of past and present synthetic methodologies on medicinal chemistry: where have all the new reactions gone? J. Med. Chem. 59, 4443–4458 (2015).

    Article  PubMed  Google Scholar 

  26. L. Bird, C., Willoughby, C. & G. Frey, J. Laboratory notebooks in the digital era: the role of ELNs in record keeping for chemistry and other sciences. Chem. Soc. Rev. 42, 8157–8175 (2013).

    Article  Google Scholar 

  27. Oleksik, G., Milic-Frayling, N. & Jones, R. Study of electronic lab notebook design and practices that emerged in a collaborative scientific environment. In CSCW’14 Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (ACM Press, 2014).

  28. McDonald, R. S. & Wilks, P. A. Jcamp-dx: a standard form for exchange of infrared spectra in computer readable form. Appl. Spectrosc. 42, 151–162 (1988).

    Article  CAS  Google Scholar 

  29. Chalk, S. J. The open spectral database: an open platform for sharing and searching spectral data. J. Cheminformatics 8, 55 (2016).

    Article  Google Scholar 

  30. Mehr, S. H. M., Craven, M., Leonov, A. I., Keenan, G. & Cronin, L. A universal system for digitization and automatic execution of the chemical synthesis literature. Science 370, 101–108 (2020).

    Article  CAS  PubMed  Google Scholar 

  31. Directorate General for Research and Innovation (European Commission) Turning FAIR into Reality: Final Report and Action Plan from the European Commission Expert Group on FAIR Data (Publications Office, 2018).

  32. Harrow, I. et al. Ontology mapping for semantically enabled applications. Drug Discov. Today 24, 2068–2075 (2019).

    Article  PubMed  Google Scholar 

  33. Davies, A. & Patiny, L. NMRium browser-based nuclear magnetic resonance data processing. Spectrosc. Eur. https://doi.org/10.1255/sew.2021.a18 (2021).

  34. Bonney, R. et al. Next steps for citizen science. Science 343, 1436–1437 (2014).

    Article  PubMed  Google Scholar 

  35. Nielsen, M. Reinventing Discovery: the New Era of Networked Science (Princeton Univ. Press, 2012).

  36. European Organization For Nuclear Research & OpenAIRE Zenodo https://www.zenodo.org/ (2013).

  37. Tremouilhac, P. et al. Chemotion repository, a curated repository for reaction information and analytical data. Chem. Methods 1, 8–11 (2020).

    Article  Google Scholar 

  38. Coudert, F.-X. The rise of preprints in chemistry. Nat. Chem. 12, 499–502 (2020).

    Article  CAS  PubMed  Google Scholar 

  39. Bradley, J.-C. Open notebook science using blogs and wikis. Nat. Prec. https://doi.org/10.1038/npre.2007.39.1 (2007).

  40. Jablonka, K. M., Ongari, D., Moosavi, S. M. & Smit, B. Big-data science in porous materials: materials genomics and machine learning. Chem. Rev. 120, 8066–8129 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Olson, M. The Logic of Collective Action; Public Goods and the Theory of Groups (Schocken Books, 1971).

  42. Strasser, B. GENETICS: genbank—natural history in the 21st century? Science 322, 537–538 (2008).

    Article  CAS  PubMed  Google Scholar 

  43. Williamson, A. E. et al. Open source drug discovery: highly potent antimalarial compounds derived from the Tres Cantos arylpyrroles. ACS Centr. Sci. 2, 687–701 (2016).

    Article  CAS  Google Scholar 

  44. Chodera, J., Lee, A. A., London, N. & von Delft, F. Crowdsourcing drug discovery for pandemics. Nat. Chem. 12, 581–581 (2020).

    Article  CAS  PubMed  Google Scholar 

  45. Perkmann, M. & Schildt, H. Open data partnerships between firms and universities: the role of boundary organizations. Res. Policy 44, 1133–1143 (2015).

    Article  Google Scholar 

  46. Jones, M. M. & Chataway, J. The structural genomics consortium: successful organisational technology experiment or new institutional infrastructure for health research? Technol. Anal. Strategic Manage. 33, 296–306 (2021).

    Article  Google Scholar 

  47. Edwards, A. M., Bountra, C., Kerr, D. J. & Willson, T. M. Open access chemical and clinical probes to support drug discovery. Nat. Chem. Biol. 5, 436–440 (2009).

    Article  CAS  PubMed  Google Scholar 

  48. Jung, N., Deckers, A. & Bräse, S. Ein molekülarchiv als akademisch integrierte service-einrichtung. Biospektrum 23, 212–214 (2017).

    Google Scholar 

  49. Jablonka, K. M., Patiny, L. & Smit, B. Making molecules vibrate: Interactive web environment for the teaching of infrared spectroscopy. J. Chem. Educ. https://doi.org/10.1021/acs.jchemed.1c01101 (2022).

  50. Herres-Pawlis, S., Koepler, O. & Steinbeck, C. NFDI4chem: shaping a digital and cultural change in chemistry. Angew. Chem. Int. Ed. 58, 10766–10768 (2019).

    Article  CAS  Google Scholar 

  51. Steinbeck, C. et al. NFDI4chem—towards a national research data infrastructure for chemistry in Germany. Res. Ideas Outcomes 6, e55852 (2020).

    Article  Google Scholar 

  52. Wulf, C. et al. A unified research data infrastructure for catalysis research—challenges and concepts. ChemCatChem 13, 3223–3236 (2021).

    Article  CAS  Google Scholar 

  53. Cooper, D. & Springer, R. Data Communities: A New Model for Supporting STEM Data Sharing Technical Report (Univ. Nebraska-Lincoln, 2019).

  54. Evans, J. D., Bon, V., Senkovska, I. & Kaskel, S. A universal standard archive file for adsorption data. Langmuir 37, 4222–4226 (2021).

    Article  CAS  PubMed  Google Scholar 

  55. Siderius, D. NIST/ARPA-E Database of Novel and Emerging Adsorbent Materials (NIST, accessed 29 June 2020); https://doi.org/10.18434/T43882

  56. Ongari, D., Talirz, L., Jablonka, K. M., Siderius, D. W. & Smit, B. Data-driven matching of experimental crystal structures and gas adsorption isotherms of Metal–Organic frameworks. J. Chem. Eng. Data https://doi.org/10.1021/acs.jced.1c00958 (2022).

  57. Watson, M. When will ‘open science’ become simply ‘science’? Genome Biol. 16, 101 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Tennant, J. Open science: Just science done right? https://figshare.com/articles/Open_Science_Just_science_done_right_/9759353/1 (2019).

  59. Long, M. & Schonfeld, R. Supporting the Changing Research Practices of Chemists Technical Report (Ithaca, 2013).

  60. Tremouilhac, P. et al. Chemotion ELN: an open source electronic lab notebook for chemists in academia. J. Cheminformatics 9, 54 (2017).

    Article  Google Scholar 

  61. Huang, Y.-C., Tremouilhac, P., Nguyen, A., Jung, N. & Bräse, S. ChemSpectra: a web-based spectra editor for analytical data. J. Cheminformatics 13, 8 (2021).

    Article  CAS  Google Scholar 

  62. Barillari, C. et al. openBIS ELN-LIMS: an open-source database for academic laboratories. Bioinformatics 32, 638–640 (2016).

    Article  CAS  PubMed  Google Scholar 

  63. Patiny, L. et al. The c6h6 NMR repository: an integral solution to control the flow of your data from the magnet to the public. Magn. Reson. Chem. 56, 520–528 (2017).

    Article  PubMed  Google Scholar 

  64. A. Badiola, K. et al. Experiences with a researcher-centric ELN. Chem. Sci. 6, 1614–1629 (2015).

    Article  PubMed  Google Scholar 

  65. Woelfle, M., Olliaro, P. & Todd, M. H. Open science is a research accelerator. Nat. Chem. 3, 745–748 (2011).

    Article  CAS  PubMed  Google Scholar 

  66. Carpi, N., Minges, A. & Piel, M. eLabFTW: an open source laboratory notebook for research labs. J. Open Source Softw. 2, 146 (2017).

    Article  Google Scholar 

  67. Rudolphi, F. Ein elektronisches laborjournal als open-source-software. Nachr. Chem. 58, 548–550 (2010).

    Article  CAS  Google Scholar 

  68. Brandt, N. et al. Kadi4mat: a research data infrastructure for materials science. Data Sci. J. 20, 8 (2021).

    Article  Google Scholar 

  69. Jablonka, K. M. et al. Connecting lab experiments with computer experiments: making ‘routine’ simulations routine. Preprint at ChemRxiv https://doi.org/10.26434/chemrxiv-2021-h3381-v2 (2021).

  70. Gray, A. J., Goble, C. A., Jimenez, R. et al. Bioschemas: from potato salad to protein annotation. In 16th International Semantic Web Conference (2017).

  71. Jablonka, K. M. et al. A data-driven perspective on the colours of metal–organic frameworks. Chem. Sci. 12, 3587–3598 (2021).

    Article  CAS  Google Scholar 

  72. Kratsios, M., Kent, S. & Rinat. O. Connecting Americans to coronavirus information online. Trump White House Archives https://trumpwhitehouse.archives.gov/articles/connecting-americans-coronavirus-information-online/ (2020).

  73. COVID-19 Announcements Structured Data (Google Search Central, 2021); https://developers.google.com/search/docs/advanced/structured-data/special-announcements

  74. Fletcher, G., Groth, P. & Sequeda, J. Knowledge scientists: unlocking the data-driven organization. Preprint at https://arxiv.org/abs/2004.07917 (2020).

  75. Kellogg, G., Champin, P.-A. & Longley, D. JSON-LD 1.1—A JSON-based Serialization for Linked Data. (W3C, 2020).

  76. Tennison, J. CSV on the Web: A Primer (W3C, 2016).

  77. Coles, S. J., Frey, J. G., Bird, C. L., Whitby, R. J. & Day, A. E. First steps towards semantic descriptions of electronic laboratory notebook records. J. Cheminformatics 5, 52 (2013).

    Article  Google Scholar 

  78. Lütjohann, D. S., Jung, N. & Bräse, S. Open source life science automation: design of experiments and data acquisition via ‘dial-a-device’. Chemometr. Intell. Lab. Syst. 144, 100–107 (2015).

    Article  Google Scholar 

  79. Chung, Y. G. et al. Advances, updates, and analytics for the computation-ready, experimental metal–organic framework database: CoRE MOF 2019. J. Chem. Eng. Data 64, 5985–5998 (2019).

    Article  CAS  Google Scholar 

  80. Gražulis, S. et al. Crystallography Open Database—an open-access collection of crystal structures. J. Appl. Crystallogr. 42, 726–729 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  81. Gražulis, S. et al. Crystallography Open Database (COD): an open-access collection of crystal structures and platform for world-wide collaboration. Nucleic Acids Res. 40, D420–D427 (2012).

    Article  PubMed  Google Scholar 

  82. Chalk, S. J. SciData: a data model and ontology for semantic representation of scientific data. J. Cheminformatics 8, 54 (2016).

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 666983, MaGic) and the Swiss National Science Foundation (SNSF) through the National Center of Competence in Research (NCCR) and Materials’ Revolution: Computational Design and Discovery of Novel Materials (MARVEL). We thank M. Evans, L. Talirz, M. Moosavi, M. Asgari, N. Marzari, G. Pizzi and fellow EPFL Data Champions for discussion and inputs and thank the cheminfo and Zakodium developers (among others, M. Zasso, D. Kostro, J. Wiest, A. M. Castillo, A. Bolaños, J. Osorio and N. Pellet; also see https://cheminfo.github.io/team for a list of contributors) for their invaluable contributions (conceiving and implementing many of the examples discussed in this perspective). Of course, we also thank the chemists whose feedback about our ELN implementation shaped our Perspective.

Author information

Authors and Affiliations

Authors

Contributions

K.M.J. and B.S. wrote the manuscript with inputs from L.P. All the authors contributed to discussions.

Corresponding authors

Correspondence to Luc Patiny or Berend Smit.

Ethics declarations

Competing interests

L.P. is chief scientific officer of Zakodium Sàrl, a company dedicated to the development of tools for storing, processing and analysis of scientific information. All the authors are contributors to the cheminfo ecoystem.

Peer review

Peer review information

Nature Chemistry thanks Samantha Kanza, Matthew Todd and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Fragment of a NMR spectrum serialized to a classic standard format.

This is an example of a JCAMP-DX file. This format is a widely used IUPAC-recommended format for spectra that is, for example, supported by the cheminfo and chemotion ELNs. Also, spectra in many databases such as the NIST webbook or the Infrared & Raman Users Group (IRUG) Spectral Database can be downloaded in JCAMP-DX format. A JCAMP-DX file can contain multiple blocks of labelled data records (LDR). That is, one can store multiple related spectra (such as repeated measurements) in the same file. All data blocks must contain a CORE header with basic metadata such as OWNER, DATATYPE. The IUPAC working group also provides a vocabulary of further global labels such as for the temperature/pressure/CAS-number. Data can also be compressed using various compression schemes. Note that the JCAMP-DX format is only one, old standard, and many others have been proposed. The JCAMP-DX format, however, does allow for the addition of an unlimited number of private labels by using the ##$ prefix, which allows every system to tailor the format to its own needs. Drawbacks of this format are, however, that it does not come with native, standardised, support for semantic web features (such as linking to a vocabulary) and, in contrast to formats like xml, csv, or json, that it is not natively supported by many general purpose tools.

Extended Data Fig. 2 Fragment of a NMR spectrum serialized to a modern standard format.

We show another NMR dataset (taken from the SciData website from the Chalk Group at the University of North Florida) serialized to JSON-LD using the SciData data model82. One important part on the JSON-LD file is the @context field. The values in this field links to the vocabularies that are used for naming things in this datafile. For instance, for units, the vocabularies provided by qudt are used, whereas the method is described using the chemical methods ontology (from which it is clear that, for instance, NMR spectroscopy is—similar to electron spin resonance spectroscopy–a magnetic resonance method). Importantly, almost all modern programming languages provide support for reading such json files. The @type field can describe the format of the data, for instance, to let a computer now that it can expect a list of doubles. Different parts of the file (such as methodology, the dataset) can be access by their own address.

Supplementary information

Supplementary Information (download PDF )

Supplementary Note 1 and glossary, Tables 1–4 and Fig. 1.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jablonka, K.M., Patiny, L. & Smit, B. Making the collective knowledge of chemistry open and machine actionable. Nat. Chem. 14, 365–376 (2022). https://doi.org/10.1038/s41557-022-00910-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41557-022-00910-7

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing