Background & Summary

History demonstrates that access to high-quality, well-organized data significantly advances specific fields. The ImageNet1 dataset exemplifies this perfectly. It provided a benchmark dataset for image classification and supported introducing groundbreaking architectures such as AlexNet2, VGG3, and ResNet4. Electronic structure and property calculations have become essential in modern materials and drug discovery research and development (R&D) portfolios. While quantum mechanical (QM) methods like Coupled Cluster (CC), Multi-configurational self-consistent field (MCSCF), etc offer the highest accurate data but are computationally intensive. Density Functional Theory (DFT) offers a better compromise between accuracy and efficiency. However, its computational requirements still make it unsuitable for large-scale drug screening. A central challenge in modern theoretical chemistry is to develop and implement approximations that accelerate QM methods while maintaining accuracy. Recent advances in machine learning (ML) techniques have proven immensely useful to address this challenge. ML can either minimize the need for extensive QM calculations or even bypass them altogether5. However, the performance of ML methods, including graph neural networks (GNNs)6,7, large language models (LLMs)8,9, and generative models10, is heavily influenced by the size and quality of the training data. The ability of currently available QM datasets to provide the size and quality required for machine learning applications is questionable. We believe developing a high-quality dataset will catalyze the application of ML techniques for predicting QM properties, coordinates and bond strengths.

Despite its widespread use in drug discovery and materials science, the so-called QM911 dataset has limitations. Composed solely of smaller molecules with a maximum of nine atoms (C, O, N, and F), it fails to represent the full spectrum of chemical complexity in real-world applications, particularly drug discovery, where molecules are often much larger. While QMugs12 offers the advantage of a vast collection of drug-like molecules (over 665,000) and the ability to handle structures with up to 100 atoms, it’s important to consider that these molecules were optimized using a less computationally expensive, but potentially less accurate, semi-empirical level of theory. Table 1 describes the selected QM datasets currently available. Additionally, analyzing 2600 FDA-approved drugs13 we found that the QM9 molecules capture only 10% of drug-relevant space, while the molecules with 40 atoms encompass 88 % as illustrated in Fig. 1b. Therefore, we have designed the QM40 database, which considerably expands the QM9 chemical space by incorporating molecules with up to 40 atoms including also S and Cl (C, O, N, S, F, and Cl), making it a valuable training set for ML tasks predicting various QM parameters as depicted in Fig. 1a. The QM40 dataset includes 162,954 molecules originally obtained from the ZINC14 dataset which contains nearly 700 million drug-like molecules.

Table 1 QM Dataset Details: number of molecules, number of heavy atoms and level of theory.
Fig. 1
figure 1

Statistical analysis of 2,584 FDA-approved drugs by (DrugCentral 2023)44 (a) Distribution of heavy atoms, (b) Distribution of heavy atom count.

QM calculations are performed at the B3LYP/6-31G(2df,p) level of theory in consistency with the QM9 and Alchemy datasets. The computational method was chosen to provide the best compromise between accuracy and efficiency, following recent suggestions in the literature15,16,17. Additionally, QM40 can be seamlessly combined with QM9, which includes molecules with 0-10 heavy atoms, while QM40 covers molecules with more than 10 heavy atoms, as both datasets were generated using the same method. In particular, QM40 offers a new feature, including our unique local vibrational mode force constant as a quantitative bond strength measure18,19. Normal vibrational modes are generally delocalized due to kinematic and electronic coupling20,21. A certain normal vibrational mode cannot always be associated with an isolated bond because it can combine with other molecular fragment stretching, bending, or torsional movements. This combination hinders the direct relationship between the normal stretching frequency or associated normal mode force constant and bond strength and the comparison between stretching modes in related molecules. Konkoli and Cremer addressed this problem by solving mass-decoupled Euler–Lagrange equations22,23,24 and introducing the Local Vibrational Mode Theory. In particular, the local mode force constants ka have qualified as a quantitative measure of bond strength for both covalent bonds25,26,27,28 and weak chemical interactions29,30,31. The QM40 dataset is continuously updated with additional molecules and features. New information can be found on our Figshare repository32 and GitHub page QM40 dataset for ML.

In the dataset descriptor list reported here, we provide an extended dataset beyond QM9, accommodating up to 40 heavy atoms, which represents 88% of the FDA-approved drug chemical space, thus offering a closer reflection of drug-like chemical space. Additionally, It includes bond strength data for all bonds within the dataset. Therefore, we anticipate that the QM40 dataset will establish itself as a new standard benchmark for evaluating current and future methods in machine-learned potentials. Even more significantly, it is a robust foundation for developing future general-purpose machine-learned potentials. This dataset provides a substantial head start on data generation, and its capabilities can be further enhanced by incorporating existing or future datasets encompassing additional relevant regions of chemical space.

Methods

QM calculations

All electronic structure calculations, including geometry optimizations and frequency calculations, were carried out using the B3LYP/6-31G(2df,p) level of theory in the Gaussian1633 package. Local mode force constants were calculated with our LModeA34 software package and local vibrational mode parameters were automatically generated using our LModeAGen protocol35.

Molecular geometry generation

The QM40 dataset is a meticulously chosen subset of molecules from the ZINC database, specifically designed for drug discovery applications. To achieve this focus, QM40 excludes anions and cations and only considers neutral molecules with a maximum of 40 atoms composed of C, N, O, S, F, and Cl. This selection of atom count and elements aligns with the analysis of FDA-approved drugs up to 2023. Figure 1 depicts the distribution of atom count (a), and elements (b) in FDA-approved drugs.

Molecular SMILES strings from the ZINC database were converted into PDB files using RDKit36. This process incorporates atomic connectivity, atomic coordinates, and the addition of hydrogen atoms, resulting in charge-neutral singlet ground states. The initial geometries for DFT calculations were obtained by pre-optimizing the structures using the extended tight-binding (xTB)37 method with the GFN2-xTB38 level of theory. Employing the final optimized coordinates from the xTB calculations, DFT calculations were performed, followed by frequency calculations. LModeA calculations were performed for each molecule using the final checkpoint file generated from the corresponding frequency calculation. Any molecule encountering convergence failures, imaginary frequencies, or LModeA unphysical parameters was excluded from the dataset throughout each stage. Figure 2 comprehensively illustrates the data generation workflow.

Fig. 2
figure 2

Scheme for generating optimized QM parameters, geometry and Local vibrational mode frequencies of 162,954 molecules from the ZINC database.

Data Records

The QM40 dataset is archived in CSV file format and publicly available through a Figshare data repository32. The dataset is organized into three separate sets of CSV files. The core information resides in the “QM40 Main Dataset” CSV file containing 162,954 SMILES strings and corresponding QM parameters. These parameters are detailed in Table 2. “QM40 xyz Dataset” stores each molecule’s initial and optimized atomic Cartesian coordinates alongside Mulliken charges. The third file, “QM40 bond Dataset” contains the bond information with local mode force constants for every bond within the molecule. The QM40 xyz and bond datasets are further detailed in Tables 3 and 4, respectively.

Table 2 Calculated properties in the B3LYP/6-31G(2df,p) level of theory.
Table 3 Calculated geometry in the B3LYP/6-31G(2df,p) level of theory.
Table 4 Calculated Local vibrational mode force constants in the B3LYP/6-31G(2df,p) level of theory.

Technical Validation

Validation of geometric consistency

The geometry optimization of structures initially derived from SMILES strings can lead to changes that alter the type of molecule, causing inconsistencies between the optimized geometry and the original SMILES code. To address this, the consistency of the B3LYP optimization in the dataset was verified using LModeA to check for unphysical parameters. LModeA input files were generated based on connectivity information derived from the initial geometry in the PDB files. The LModeA package then uses this connectivity information to retrieve optimized data from the formatted checkpoint file (FCHK) for local vibrational mode analysis. If the specific connectivity in the LModeA input file does not match that created from the optimized FCHK file, the LModeA package returns the message, “Unphysical parameter was detected. Molecules with unphysical parameters were selectively removed from the dataset, as they represent conformers that do not correspond to the original molecule. Figure 2 provides a graphical representation of this procedure.

Validation of quantum chemistry results

We modeled all 162,954 molecules using the B3LYP/6-31G(2df,p) level of DFT. This approach aligns with the methodology used for the QM9 dataset. We specifically focused on molecules containing more than 10 atoms. This allows for the concatenation of QM9 with QM40, creating a combined dataset with approximately 300k molecules. The chosen B3LYP/6-31G(2df,p) level has been previously validated against high-level theories (G4MP2, G4, and CBS-QB3) used in the QM9 study.

Validation of the QM40 chemical space

The chemical space of QM40 was validated using two methods. The first method involved dividing the QM40 dataset into six classes based on the number of atoms per molecule: 10-15, 15-20, 20-25, 25-30, 30-35 and 35-40. The number of molecules in each class was then calculated and visualized in Fig. 3. As shown in the figure, nearly 26% of the molecules belong to the 10-15 atom range, followed by 21% in the 25-30 atom range. The 20-25 atom range has the smallest representation, at 7%. It’s important to note that despite these variations, all classes contain over 12,000 molecules.

Fig. 3
figure 3

Number of heavy atom composition of QM40 dataset.

To further validate the chemical space of QM40, the dataset was split into 16 distinct databases based on specific bond types (CC, CH, OH, NH, etc., detailed in Table 5). For each bond type, we calculated the number of molecules containing that bond, the total number of such bonds, and the maximum, minimum, average, and standard deviation of the local vibrational stretching force constant. This analysis confirms the consistency of the data concerning the presence of different bond types in the geometries of the dataset. It also verifies that all bonds were formed exclusively by the elements C, O, N, S, Cl, F, and H. Furthermore, the top three maximum bond strengths were identified in NN triple bonds, CN triple bonds, and CC triple bonds, consistent with their experimental bond dissociation enthalpy values27,39. Conversely, the analysis revealed a low prevalence of SS, NF, and SH bonds in the QM40 dataset, suggesting a natural scarcity of these bond types in drug-like compounds40.

Table 5 Statistical analysis of QM40 bond types using bond strength as a local vibrational force constant (Ka).

Usage Notes

QM40 provides a GitHub repository. The repository includes a user-friendly Python application for generating the dataset. This application can be easily installed using common pip package managers. In addition, the repository offers a Python module specifically designed for interacting with the QM40 data users. This module provides functionalities for navigating the QM properties, geometries and bond information, extracting specific information, and even downloading subsets of interest. To ensure smooth exploration and utilization, it comes with a README file and tutorials that detail technical specifications and include usage examples.