Background & Summary

The quest for sustainable, efficient, and carbon-free energy storage solutions has brought hydrogenation reactions to the forefront of scientific research1,2,3,4,5,6. Hydrogen, with its high energy density and clean combustion, is a promising candidate for future energy systems3,7,8,9,10. However, its storage and transport pose significant challenges due to its low volumetric energy density and the complexities of handling gaseous substances11,12,13. Thus, reactions involving the chemical storage and release of hydrogen, particularly those involved in Liquid Organic Hydrogen Carriers (LOHCs), emerge as vital components in efficient and safe hydrogen storage3,8,14,15,16,17,18. LOHC systems offer a promising approach to the chemical storage of hydrogen, thereby addressing the limitations of high pressure hydrogen-storage methods14,19,20.

LOHCs operate by absorbing hydrogen atoms via hydrogenation reactions (add hydrogens to the unsaturated bonds) and releasing them via dehydrogenation reactions3,18. The efficacy of the technology depends on the identification of organic carrier molecules that can reversibly store hydrogen in high densities while being economically viable, catalytically stable, and environmentally benign17,19. Therefore, the study of hydrogenation reactions, specifically focusing on their thermodynamic and kinetic properties, is crucial for advancing LOHC technology3,14. Within the scope of LOHC research, several chemical systems have garnered particular interest due to their unique characteristics and potential applications, four categories of which are shown in Fig. 1 and described below:

  1. a.

    Conventional LOHC Systems: These systems employ a catalyst-mediated hydrogenation/dehydrogenation cycle, such as benzene/cyclohexane, toluene/methyl-cyclohexane, and N-ethylcarbazole(NEC)/dodecahydro-N-ethylcarbazole. Central to their operation is the catalytic efficiency and the ability to store hydrogen reversibly without significant degradation of the LOHC molecule. Representing one of the first approaches to chemical storage of hydrogen in liquid carriers, these conventional systems set the stage for modern LOHC technology, offering a proven, viable pathway for sustainable energy integration and marking a significant stride towards a cleaner, hydrogen-based economy14,18,21,22,23,24,25.

  2. b.

    Mixture of LOHCs: By controlling the mole-ratios of different LOHC components, it is possible to tailor the physical properties of the system, such as its melting and boiling points. Such precise control enables the formulation of LOHC mixtures that remain liquid at ambient conditions, which is crucial for practical storage and transportation. Additionally, this method allows for the optimization of the system’s hydrogen storage capacity and thermal stability, while also enhancing the economic viability of the hydrogen carriers. This tailored approach promises to address both efficiency and cost concerns, paving the way for more adaptable and user-friendly hydrogen storage solutions3,26. Illustratively, Stark et al. reported a mixture 42% N-ethyl carbazole (NEC) and 58% N-propyl carbazole (NPC), originally melting at 342 K and 320 K respectively, lowers the mixture’s melting point to 297 K26.

  3. c.

    Electrochemical LOHCs: These systems are distinguished due to their capability for electrochemical reversibility, enabling them to be employed effectively for hydrogen storage and release. An illustrative example of such a system is the isopropanol/acetone pair, where electrochemical reactions are employed to either liberate or absorb hydrogen27. This feature can notably improve the efficiency and control of the hydrogenation and dehydrogenation processes, offering an alternative approach to LOHC technology27,28,29.

  4. d.

    Alkali Metal-LOHCs: These systems feature a substitution whereby an alkali metal replaces a proton in the carrier molecule, exemplified by systems such as Na-Phenoxide/Na-Cyclohexanolate30. This critical modification can considerably decrease the enthalpy of hydrogenation and dehydrogenation reactions, which may result in more efficient storage cycles and lower energy demands for releasing hydrogen. The potential for such systems to lower the overall reaction enthalpy of hydrogen storage makes them a compelling area of research within the field of LOHCs30,31,32.

Fig. 1
figure 1

Four categories of Liquid Organic Hydrogen Carrier (LOHC) molecules, as described in the text: Conventional, Electrochemical, Alkali Metal, and Mixtures of LOHCs. MP stands for melting point, ∆H is dehydrogenation enthalpy.

These categories and diverse approaches to LOHC technology reflect the breadth of research aimed at overcoming the challenges of hydrogen storage. Each category offers unique advantages and research opportunities for developing and deploying efficient, sustainable, and cost-effective hydrogen storage solutions.

Data-driven approaches have had a transformative impact in chemistry33,34,35, enabling, for example, innovation in molecular synthesizability predictions36,37,38,39, energy storage applications40,41,42,43, and drug discovery44,45,46. Accurate molecular data is essential for these machine learning efforts. One popular small molecule dataset is QM933,34, which provides, for 133 K molecules with 9 (or fewer) heavy (non-hydrogen) atoms (the GDB-9 subset of the larger GDB-1747 collection of molecules with 17 or fewer heavy atoms), properties computed using the B3LYP48 method and later refined49 with the more accurate composite quantum chemical method G4MP250,51,52,53.

The QM9 dataset contains computed properties (e.g.: optimum molecular geometries, enthalpies of formation, dipole moments, partial charges, and vibrational frequencies) for an extensive array of small organic molecules48,49. This invaluable dataset has facilitated the exploration and discovery of novel compounds and reaction pathways through computational methods54,55,56.

Recently we developed an in-silico discovery pipeline that employs cheminformatics and quantum chemical calculations to identify novel conventional LOHC molecules19. Using this pipeline, we screened the large GDB-17 database, containing 166 billion molecules, and the ZINC1557 database, containing 1.2 billion molecules, implementing a selection protocol that integrated machine learning to predict physical properties (melting/boiling points) and synthetic accessibility. Our efforts identified 41 novel LOHC molecules including benzofuran, benzoxazole groups, substituted quinolines and phenyl pyridines, all of which exhibit promising chemical properties for LOHC applications. In another study, Paragian et al. harnessed the QM9 database to predict hydrogenation enthalpies for over a million potential LOHC molecules from PubChem, identifying 37 prime candidates using machine learning models54. The success of both studies demonstrates that extensive datasets (e.g. GDB-1747, ZINC1557, QM948,49, and PubChem58) lay the groundwork to discover novel small molecules suitable for practical LOHC systems when they are effectively utilized33,34,35,59,60.

In this investigation, we compile a dataset of 10,373 (de)hydrogenation reactions derived from the QM9 dataset. Our selection process focused on chemical reactions that met two primary criteria: (a) the reactant and product pairs differ only by their levels of hydrogen saturation, and (b) the pairs exhibit a significant hydrogen storage capacity, specifically 5.5% by weight hydrogen or more, as set by the standards of the Department of Energy61. While additional constraints—such as an enthalpy of dehydrogenation (∆H) between 40 and 70 kJ/mol per H2 and being liquid at room temperature, along with safety, non-toxicity, stability, and reversibility—are essential for practical applications3,18,19, it is noteworthy that several of these properties can be tuned3,14,19,26,31. This tunability is possible by shifting from conventional LOHC technology to more innovative approaches (as described in the introduction, see Fig. 1). Such versatility in a technological approach broadens the scope of potential applications, accommodating a wider range of operational requirements.

To accurately determine the structures and enthalpies of dehydrogenation reactions, we performed 9,841 new quantum chemical calculations using the G4MP2 method. The 9,841 new quantum chemical calculations we performed were specifically aimed at determining the enthalpies of hydrogenated forms of unsaturated molecules, which, while already present in the original QM9 dataset, lacked their hydrogenated counterparts. This effort, combined with existing data from the original QM9 dataset, culminated in the creation of the QM9-LOHC62 open-access dataset, which encompasses 10,373 dehydrogenation reactions. We propose the QM9-LOHC dataset as a reference dataset for hydrogen storage technologies using LOHCs.

The QM9-LOHC dataset, with its newly calculated G4MP2 energies, extends the utility of the original QM9 database for in-depth studies of hydrogenation reactions, which are crucial in the development of advanced energy storage applications. Furthermore, the database will serve as a vital asset for the application of machine learning techniques in quantum chemistry, fostering the development of innovative methods that can accurately predict reaction enthalpies, particularly catering to the needs of energy storage research. This augmented database, moreover, is a significant step toward enabling the accurate calculation of reaction energies and provides a robust foundation for the development of predictive machine learning models for molecular discovery.

As shown in Fig. 2, the search for LOHC candidates started within the QM9 database47,48,49. First, We used the RDKit63 library to analyze the SMILES representations of molecules in the QM9 dataset, allowing us to identify those with unsaturated sites such as double or triple bonds, or unsaturated aromatic rings. Second, we deduced the saturated counterpart of each SMILES string from step 1 by simple string manipulation of unsaturated molecules. These two steps allowed us to identify over 100,160 dehydrogenation reactions of organic molecules, with each reaction consisting of a pair of molecules: a hydrogen-lean molecule and its hydrogen-rich counterpart. Third, we down selected by considering only reactions that have a hydrogen storage capacity of at least 5.5% wt. H2, as calculated with Eq. (1):

$$ \% {wt}\,{H}_{2}=\frac{{{MW}}_{H-{rich}}-{{MW}}_{H-{lean}}}{{{MW}}_{H-{rich}}}\times 100$$
(1)

where MWH-rich and MWH-lean are the molar weights of the hydrogen-rich and the hydrogen-lean species, respectively. From this, we obtain a dataset of 10,373 reactions of pairs of unsaturated and saturated organic molecules.

Fig. 2
figure 2

Schematic of the protocol for developing the QM9-LOHC Dataset from the GDB-9 and QM9 databases49. The dataset includes 10,373 reactions with a minimum hydrogen storage capacity of 5.5% wt. H2, derived from selecting unsaturated molecules and generating their corresponding saturated forms. Representative molecules from the dataset are shown.

Upon the attempt to obtain G4MP2 enthalpy data from the QM9 dataset for the hydrogen-rich molecules, we identified that out of 10,373 molecules, only 532 were present. To fill in the data for the missing 9,841 molecular enthalpies, we perform quantum chemical calculations using Gaussian 16 software64 with the G4MP2 method50,51. G4MP2 is a composite method based on G4 theory that uses MP2 perturbation theory to obtain higher computational efficiency. A previous assessment of G4MP2 energies on a subset of QM9 molecules reported an accuracy of 0.79 kcal/mol (3.3 kJ/mol) with respect to accurate experimental enthalpies of formation49, showing that the G4MP2 method is highly accurate and reliable for molecules that are in or similar to the QM9 dataset.

Before running the quantum chemical calculations, the minimum energy conformers were obtained using the Universal Force Field (UFF) method in RDKit. The G4MP2 method employs B3LYP/6–31 G(2df,p) optimized geometries for a series of single-point energy calculations at higher levels of theory. The zero-point energy (E(ZPE)) is computed using B3LYP/6–31 G(2df,p) frequencies, which are scaled by a factor of 0.9854 to account for anharmonic effects. The nature of each located potential energy surface stationary point was confirmed as a minimum by the absence of imaginary frequencies. The initial energy calculation is performed at the coupled-cluster level of theory, CCSD(T), with the 6–31 G(d) basis set. This energy is subsequently refined by applying a series of corrections, including those derived from MP2 and Hartree-Fock (HF) energies and high-level corrections. This multi-step approach allows for the calculation of highly accurate total energies, benefiting from the computational efficiency of MP2 and the accuracy of coupled-cluster methods.

From the computed G4MP2 energies, the reaction enthalpies (ΔHrxn) of the 10,373 pairs are calculated using Eq. (2)

$$\triangle {H}_{{rxn}}=\frac{\left[{H}_{H-{lean}}^{o}+n\ast {H}_{{H}_{2}}^{o}\right]-{H}_{H-{rich}}^{o}}{n{H}_{2}}$$
(2)

where H° is the absolute enthalpy (at 298.15 K and 1 atm) and nH2 is the number of moles of H2 involved in the reaction. For all what follows, ∆H refers to the standard gas-phase enthalpy of dehydrogenation reaction and is reported in units of kJ/mol H2 unless stated otherwise.

Data Records

The dataset is accessible via Zenodo at the following link62: https://doi.org/10.5281/zenodo.10926772. Contained within a zip file, the dataset comprises two CSV files: “QM9_G4MP2_all.csv” and “QM9-LOHC_new_molecules.csv”. The “QM9_G4MP2_all.csv” file encapsulates the entirety of the QM9-LOHC dataset, delineating 10,373 reactions. The data columns present include unsaturated SMILES strings (unsat_SMILE), saturated SMILES strings (sat_SMILE), dehydrogenation enthalpy measured in kJ/mol H2 (delta_H), the number of H2 molecules (nH2) involved in the dehydrogenation reaction, and the hydrogen storage capacity, %wt. H2 (pH2). The “QM9-LOHC_new_molecules.csv” file narrows its focus to a selection of the QM9-LOHC dataset, spotlighting saturated SMILES strings that represent novel molecules not identified in the original dataset. This file mirrors the columns found in the first, providing data on unsaturated SMILES strings (unsat_SMILE), saturated SMILES strings (sat_SMILE), dehydrogenation enthalpy (delta_H), number of H2 molecules (nH2), and storage capacity (%wt H2). Additionally, the zip archive encompasses the source code (app.py) for an accompanying web application, available at https://qm9-lohc.streamlit.app/, and a Python script (query.py) designed for database querying. The interactive web app allows the user to select a range of dehydrogenation enthalpy, a range of hydrogen storage capacity, and the number of desired results. The app will then query the dataset and display molecules (in either 2D or 3D) along their hydrogen storage capacity and dehydrogenation enthalpy.

Technical Validation

The validation of our G4MP2 calculations is based in methodologies derived from prior research employing similar computational techniques49,50,51,59. Specifically, Narayanan et al.49 reported that the G4MP2 calculations yielded a mean absolute error (MAE) of 1.04 kcal/mol (4.35 kJ/mol) when compared with experimental gas-phase enthalpies of formation, showcasing their reliability and accuracy for the purposes of this study49. Additionally, Rogers et al. employed Gn methods to accurately calculate hydrogenation enthalpies of various small hydrocarbons, reporting MAEs between 3.5 and 5.0 kJ/mol65,66,67,68.

Due to the limited availability of experimental dehydrogenation enthalpies data for molecules within the QM9-LOHC dataset, the comparison was conducted on a set of 14 molecules (Table 1). The first nine molecules (Entries 1–9) in Table 1 are part of a benchmark set from our recent study19. The remaining entries (Entries 10–14) represent data obtained from the Pedley Dataset69 and the NIST workbook70, where we utilized the standard enthalpies of formation of the hydrogenated and dehydrogenated species to calculate the reaction enthalpies. This comparative analysis is shown in Table 1, showcasing the experimental dehydrogenation enthalpies alongside the corresponding values derived from QM9-LOHC via G4MP2 calculations. The findings reveal a close alignment between the computed values and experimental benchmarks, characterized by a root mean square deviation (RMSD) of 7.3 kJ/mol H2 and an MAE of 6.3 kJ/mol H2.

Table 1 A comparison of experimental, G4MP2, and G4 dehydrogenation enthalpies for known LOHC molecules.

To better contextualize these results, we also performed G4 calculations for the same validation set (Table 1), obtaining a reduced RMSD of 4.6 kJ/mol H2 and an MAE of 2.9 kJ/mol H2. While the G4 method demonstrates improved accuracy on this small benchmark set, it comes at a significantly higher computational cost, making it less feasible for large-scale datasets such as QM9-LOHC. In contrast, the G4MP2 dataset offers a practical balance between computational efficiency and accuracy, with deviations that remain well within acceptable limits for quantum chemistry-based reaction studies. This highlights the suitability of the G4MP2 dataset as a reliable resource for hydrogenation and dehydrogenation reaction modeling in LOHC research.

We note that the observed discrepancies (up to 14.3 kJ/mol H2 for Aminobenzene) can be attributed to several factors. G4MP2, while reliable for many systems, is known to exhibit limitations for nitrogen-containing molecules, such as aniline derivatives71. Suntsova and Dorofeeva demonstrate that deviations for nitrogen species can exceed 10 kJ/mol, even with the higher-accuracy G4 method71. Furthermore, systematic underestimation in enthalpies of formation is observed for certain molecular classes, such as nitro compounds, with deviations ranging between 5–15 kJ/mol, as these classes were underrepresented in the original test sets used for method parameterization71,72. Finally, large deviations (>20 kJ/mol) may also reflect uncertainties or errors in the experimental reference data, as suggested by the isodesmic reaction validation method72.

Figure 3a presents the dehydrogenation enthalpies derived from QM9-G4MP2 for the QM9-LOHC dataset and F all the QM9-LOHC energy values. The left histogram includes 532 reactions from the original QM9-G4MP2 dataset that meet the Department of Energy’s 5.5% wt. H2 criterion. A majority of the reactions have dehydrogenation enthalpies above the optimal range for LOHC functionality, with most falling within 100–150 kJ/mol. In contrast, the right histogram, encapsulating a broader scope of 10,373 reactions in the QM9-LOHC dataset, identifies 3040 reactions with enthalpies between 40–70 kJ/mol, the optimal range for LOHC applications. Additionally, there are 423 reactions with dehydrogenation enthalpies below 40 kJ/mol and 5616 reactions with values between 70–120 kJ/mol (Fig. 3b). The QM9-LOHC dataset further reveals that the majority of molecules (8615), contain 9 heavy atoms, and 1114 comprise 8 heavy atoms (Fig. 3c). This distribution reflects a notable trend within the dataset: larger organic molecules, specifically those with 8 or 9 heavy atoms, are more likely to be liquid at room temperature—a key property for their function as LOHCs73.

Fig. 3
figure 3

(a) Histogram showing the distribution of dehydrogenation enthalpies (∆H) in QM9-LOHC dataset. (b) Pie chart showing the percentages and counts of reactions in the QM9-LOHC dataset that fall in the desired 40 – 70 kJ/mol H2 range as well as reactions with enthalpies less than 40 kJ/mol, 70–120 kJ/mol, and above 120 kJ/mol. (c) shows the distributions of heavy atoms in the QM9-LOHC dataset.

Figure 4 offers a nuanced view of the QM9-LOHC dataset, highlighting the interplay between dehydrogenation enthalpies and hydrogen storage capacities. Figure 4a shows the distribution of dehydrogenation enthalpies against hydrogen storage capacities, with the gold-shaded area in the figure indicating the optimal ∆H range (40–70 kJ/mol per H2) for conventional LOHCs. This visualization shows that that higher hydrogen storage capacities are generally associated with increased dehydrogenation enthalpies. Furthermore, it is apparent that fewer molecules reside within the gold-shaded region as storage capacity rises, illustrating a potential compromise between storage capacity and enthalpic efficiency.

Fig. 4
figure 4

(a) Scatter plot of hydrogen storage capacity and dehydrogenation enthalpy (∆H) across the dataset with a highlight on the preferred enthalpy range for conventional LOHCs (40–70 kJ/mol H2), (b) Pie chart showing the distribution of hydrogen storage capacity across the QM9-LOHC dataset (given in %wt. H2) and (c) shows four (A–D) random dehydrogenation reactions, dehydrogenation enthalpies, and hydrogen storage capacities. (shown in a. as the four red points).

In Fig. 4b, the pie chart delineates the hydrogen storage capacity distribution within the dataset, revealing that a significant 62.5% of molecules have a capacity between 6 and 6.5%, with the next substantial group (19.9%) possessing a capacity between 7 and 8%. This data shows that the overwhelming majority of the dataset (95.5%) exceeds the DOE’s storage capacity threshold by at least 0.5%, affirming the dataset’s relevance in sourcing effective LOHCs. Representative reactions, denoted as Points A, B, C, and D in Fig. 4c, fall within the desired ∆H range and are depicted in Fig. 4c with varying hydrogen storage capacities. These points are characterized by diverse hydrogenation sites, including six-membered nitrogen-containing rings, carbonyl groups, azides, and five-membered rings with nitrogen and oxygen—each contributing to the molecular variety suitable for hydrogenation (Fig. 5).

Fig. 5
figure 5

Plots of selected features versus dehydrogenation enthalpies in the QM9-LOHC dataset (top left: number of oxygen atoms, top right: number of nitrogen atoms, bottom left: heavy atom count, bottom right: number of aromatic rings).