Quantum mechanical electronic and geometric parameters for DNA k-mers as features for machine learning

Masuda, Kairi; Abdullah, Adib A.; Pflughaupt, Patrick; Sahakyan, Aleksandr B.

doi:10.1038/s41597-024-03772-5

Download PDF

Data Descriptor
Open access
Published: 22 August 2024

Quantum mechanical electronic and geometric parameters for DNA k-mers as features for machine learning

Kairi Masuda¹,
Adib A. Abdullah¹,
Patrick Pflughaupt¹ &
…
Aleksandr B. Sahakyan ORCID: orcid.org/0000-0002-8343-3594¹

Scientific Data volume 11, Article number: 911 (2024) Cite this article

2373 Accesses
4 Citations
3 Altmetric
Metrics details

Subjects

Abstract

We are witnessing a steep increase in model development initiatives in genomics that employ high-end machine learning methodologies. Of particular interest are models that predict certain genomic characteristics based solely on DNA sequence. These models, however, treat the DNA as a mere collection of four, A, T, G and C, letters, dismissing the past advancements in science that can enable the use of more intricate information from nucleic acid sequences. Here, we provide a comprehensive database of quantum mechanical (QM) and geometric features for all the permutations of 7-meric DNA in their representative B, A and Z conformations. The database is generated by employing the applicable high-cost and time-consuming QM methodologies. This can thus make it seamless to associate a wealth of novel molecular features to any DNA sequence, by scanning it with a matching k-meric window and pulling the pre-computed values from our database for further use in modelling. We demonstrate the usefulness of our deposited features through their exclusive use in developing a model for A->C mutation rates.

Genome assembly using quantum and quantum-inspired annealing

Article Open access 23 June 2021

High-throughput DNA melt measurements enable improved models of DNA folding thermodynamics

Article Open access 01 July 2025

DNA methylation in mammalian development and disease

Article 12 August 2024

Background & Summary

Machine learning techniques are now being actively pursued in all fields, including genomics. The main driver of this phenomenon is the rapid development in computer performance and efficiency as predicted by Moore’s Law¹. The explosion of digital data necessary to feed the machine learning algorithms has also been a major contributor to the ever-increasing adoption of machine learning. In the field of genomics, big data, especially from high-throughput sequencing technologies, have been utilised to develop many successful machine learning-based models to address various biological problems². Models were built to predict G-quadruplex formation³, effective gene expression⁴, splicing events^5,6, specificities of DNA- and RNA-binding proteins⁷, effects of non-coding variants⁸, epigenomic profiles⁹, transcription factor binding¹⁰, regulatory code of the accessible genome¹¹, DNA methylation¹², cancer driver genes¹³, and many other biological phenomena.

Most of these works mainly focus on the underlying oligonucleotide letter strings to devise the features for machine learning, requiring a massive amount of data to decipher and exhaust information out of nucleic acid sequences. These sequence-based initiatives, despite being successful, still overlook a decade’s worth of information and advancement accumulated on the inference of physicochemical properties of oligonucleotides in their varying sequence context and structure. Many such properties are commonly calculable from molecular modelling, at coarse and atomistic scales via molecular mechanics (MM) and quantum mechanical (QM) electronic techniques. To address this limitation, we propose incorporating features that capture the underlying electronic properties, as most molecular characteristics can be considered as derivatives of, hence, ultimately determined by, the underlying electronic/QM characteristics. Although computationally expensive, electronic, energetic, and structure-based calculations provide highly accurate results of molecular behaviours^14,15, hence, have the potential to complement standard bioinformatics techniques based on genomic sequence analysis¹⁶. Thus, by pre-calculating electronic properties for different sequence contexts, we can parameterise and integrate them into the feature generation stage of machine learning models, still keeping the model purely sequence-based in terms of the required primary information.

Electronic properties can explain important DNA behaviours. For example, guanine is susceptible to oxidative damage, which has been associated with its low ionisation potential, facilitating electron loss and oxidation^17,18. Prior works have also shown that the increased propensity to damage by reactive oxygen species (ROS) at the 5’guanine within guanine-guanine dyads is driven by the reduced ionisation potential¹⁹. Moreover, the formation of 8-oxo-guanine at Z-DNA sites, one of the three DNA conformations we have calculated the electronic properties on, has been shown to modulate gene regulation through the electronic properties of the nucleobases and their sequence context²⁰. As such, incorporating these electronic properties can potentially improve predictive modelling tasks and improve the interpretability of the subsequent model decision-making processes. Hence, this necessitates our work on developing purely sequence-driven electronic properties for standard nucleic acid conformations.

To make the above possible, here we performed large-scale semi-empirical QM calculations for all possible DNA heptamers in their three, B, A and Z, representative conformations (Fig. 1). We began with DNA 3D model development and geometry optimisation, followed by semi-empirical calculations. A number of geometric values of the optimised models were also measured and included as part of the dataset. The outcomes of these calculations can be applied as similar values to the machine learning algorithms, where their relevance and information content will be assessed accordingly during the learning process. As a proof of concept, we briefly demonstrated that our DNA heptamer semi-empirical properties along with their geometric measurements were able to predict A to C spontaneous mutation rates²¹ from DNA sequence when they were applied as sole machine learning features with no direct sequence encoding.

We believe our deposited data presented in this work, comprised of 3D geometric and physicochemical properties of 24,576 non-redundant DNA 7-mer duplex structures (not counting the shorter-span 6-mer analogues) in B, A and Z conformations (Fig. 1), will be useful and applicable to enhance the machine learning process in solving DNA-driven biological problems. Such a dataset is unique for varying oligonucleotides, even though databases of many MM and QM properties exist for other types of molecules, many reported in the same, Scientific Data, journal. Examples of other QM-based datasets published in the same spirit are physico-chemical properties of 31,618 electroactive molecules for the development of aqueous redox flow batteries²², optimised molecular geometries and thermodynamic data of more than 665,000 biologically and pharmacologically relevant molecules²³, electronic charge density of crystalline materials from Materials Project database²⁴, molecular conformations of 450,000 small- and mid-sized organic molecules²⁵, molecular geometries and spectral properties of 61,489 crystal-forming organic molecules²⁶, equilibrium conformations for small organic molecules²⁷, QM calculations of over 200,000 organic radical species and 40,000 associated closed-shell molecules²⁸, all-atom force-field parameters, molecular dynamics trajectories, QM properties, and curated physicochemical descriptors of more than 300 antimicrobial compounds²⁹, excited state information of 173,000 organic molecules³⁰, conformational energies and geometries of di- and tripeptides³¹, and QM structures and properties of 134,000 small organic molecules³².

Methods

The schematic diagram of the generation of the dataset is shown in Fig. 2. The procedure is comprised of three stages: the building of the all-atom DNA models (a), geometry optimisation (b), and feature extraction with the corresponding single-point calculations (c). In the following subsections, we describe the details of each stage. The calculations were performed on the available Linux computing clusters hosted at the MRC Weatherall Institute of Molecular Medicine, University of Oxford (256 GB of RAM, dual Intel Xeon E5-2680v3 CPUs with 24 physical cores per node), and on our laboratory workstation (512 GB of RAM, Intel Xeon W-2295 CPUs with 18 physical cores). The dataset can be accessed through our GitHub page at https://github.com/SahakyanLab/DNAkmerQM or from Zenodo³³. The R programming language³⁴ was used as a front-end programming language in this work, and the code to generate the dataset can be retrieved from https://github.com/SahakyanLab/NucleicAcidsQM.

All-atom model building for 7-mer DNA

As the maximum context span and the baseline in this work, the heptameric range of DNA was considered due to its known major influence on nucleotide and derivative properties²¹. However, our dataset also includes an analogue for the lesser, hexameric context, mainly generated for the use cases when an even-numbered range is needed for modelling. The DNA structures for all the k-mer sequence permutations in their B, A and Z conformations (Fig. 1) were generated by using the Nucleic Acid Builder (NAB) suit of programmes³⁵ (Fig. 2a). NAB provides a function that replaces base pairs of a given template structure with any desired base pair, without altering the geometries of the backbone and sugar moieties. As templates for B, A and Z conformations of double-stranded DNA, representative X-ray crystallographic structures³⁶ were used as adapted from the PDB files provided in WEB-3DNA³⁷ (Fig. 2a). The end moieties for the 1^st and 7^th positions in the DNA models were capped by hydrogen atoms (at the O5′ and O3′ positions of deoxyribose for the 5′ and 3′ ends respectively). The total number of all the permutations for 7-mer sequences of four bases is 4⁷ = 16,384. However, considering the strand symmetry of double-stranded DNA, we can reduce this number, since, for instance, the sequence 5′-AAAAATT-3′ has 5′-AATTTTT-3′ as a complementary strand, hence the double-stranded DNA model for 5′-AAAAATT-3′ is the same for 5′-AATTTTT-3′ as well. We therefore generated 8,192 DNA models for each B, A and Z conformations, resulting in a total of 24,576 DNA models for heptamers. On average, one DNA model in our generated set contains 443 atoms (281 heavy, and 162 hydrogen atoms).

Molecular mechanics optimisation of DNA

The above DNA structures were then geometry optimised via molecular mechanics (MM) force field, by using AmberTools21³⁸ (Fig. 2b). The OL15 force field, specifically tuned and well tested for DNA³⁹, was used. To account for the shielding of the negatively charged phosphate backbones, Born implicit solvation⁴⁰ was used with the water environment dielectric constant defaulted to 78.5 in AmberTools21. We needed to relax the geometries in order to remove any tension and unrealistic arrangements upon NAB-driven base replacements. However, we still wanted to preserve the conformations in their desired original, B, A or Z, state. The electronic parameters provide a useful approximation rather than a definitive representation of all possible geometric states. As such, those serve as a foundational set of electronic parameters for the standard nucleic acid conformations, potentially emulating a wealth of derivative information useful for machine learning. We therefore had to pick an appropriate number for the allowed optimisation steps. Fig. S1a1 shows a relationship between root mean squared deviation (RMSD) of 5′-AAAAAAA-3′ B-DNA, from its original NAB-generated structure, and the number of optimisation steps. RMSD calculation was done using the Bio3D library⁴¹ in R, based only on non-hydrogen atoms. We found that RMSD converged within 5000 steps. Furthermore, no major conformational change or strand separation was observed in the structures before and after the convergence (Fig. S1a2). Figure S1a3 shows the RMSDs for all the heptamers of B-DNA. Note that the sequences are numbered lexicographically, that is, 5′-AAAAAAA-3′=1, 5′-AAAAAAC-3′=2, 5′-AAAAAAG-3′=3, and so on. The results show that the RMSD values for all modelled sequences are at around 1.0 Å (with an average of 0.72 and 0.03 standard deviation). The same is true for the A and Z conformations of DNA (see Fig. S1b,c).

Semi-empirical quantum mechanics optimisation of DNA

We further optimised the DNA structures through quantum mechanics (QM) (Fig. 2b) by using the PM6-DH+ semi-empirical Hamiltonian under the restricted Hartree-Fock (RHF) approach, as implemented in MOPAC2016 programme⁴². PM6-DH+ with its correction for dispersion interactions, while benefiting from the relative low cost of the semi-empirical QM methods, has successfully reproduced electronic properties of many systems as accurately as the costly QM methods⁴³. The water environment was accounted for through the intrinsic solvation with Conductor-like Screening Model (COSMO)⁴⁴. The COSMO default 78.4 dielectric constant was used for water, as implemented in MOPAC2016. For the termination of QM optimisation, we used the energy gradient criterion in MOPAC, rather than limiting the optimisation steps. Fig. S2a1 shows the RMSD of 5′-AAAAAAA-3′ B-DNA from its initial state, as a function of energy gradient cutoff used to optimise the system and look at the structural snapshot. Similar to the MM case discussed above, we found that RMSD plateaus at around 1.0 Å even if a strict convergence criterion is applied. Fig. S2a2 shows B-DNA structure before and after optimisation until the energy gradient drops below 1.0 kcal/(mol ⋅ Å). No substantial conformational change, such as separation of strands, was observed upon such optimisation, keeping the structures within the designated B conformation. On the other hand, 10.0 kcal/(mol ⋅ Å) is recommended as the energy gradient criteria for large systems (http://openmopac.net/manual/gnorm.html), such as our heptameric double-stranded DNAs. Therefore, we adopted the maximum gradient of 10.0 kcal/(mol ⋅ Å) as our convergence criterion for the QM geometry optimisation. Similar to the MM case, compliance to low RMSD was observed for all our DNA sequences in their B (Fig. S2a3, with an average of 1.00 Å and 0.13 standard deviation). The same was true for A and Z (Fig. S2b,c) conformations as well.

Feature calculation and extraction

We next extracted the electronic and structural features from the obtained refined DNA structures (Fig. 2c1). For the electronic features, we conducted further single-point QM calculations on the optimised duplex B-, A- and Z-DNA structures. We used the same PM6-DH+ (RHF) with COSMO solvation, but with additional keywords to request more detailed outputs and a full listing of electronic parameters. From these calculations, as general features for each DNA model, heat of formation (Ehof), ionisation potential (IP), dipole moment, highest occupied molecular orbital (HOMO) energy, and lowest unoccupied molecular orbital (LUMO) energy were extracted. We also extracted Mulliken charges and populations for the constituent atoms. Since treating charges for all atoms is not realistic for many machine learning setups, we calculated the summarised maximum, minimum, and mean values of the charges and population density values for each of the base, sugar, and phosphate moieties from 1^st to 7^th nucleotides at both + and − strands. In the same manner, we extracted electrostatic potential fitted (ESP) charges and populations⁴⁵ and calculated the maximum, minimum, and mean values. Geometric parameters were calculated via Curves+ software⁴⁶, upon which inter- and intra-strand parameters were extracted for the base pair arrangements and backbone angles respectively. Further features were obtained by conducting additional single-point calculations for each strand separately, by masking the other strand in the optimised duplex B-, A- and Z-DNA structures (Fig. 2c2). The difference of Ehof between the duplex and the two separate single strands was calculated as a simple proxy for DNA hybridisation energy. We also considered the duplex state with the 4^th central base removed and replaced by a hydrogen cap. The single-point calculation was conducted after optimising only this hydrogen position, with a stricter 1.0 kcal/(mol ⋅ Å) maximum energy gradient for the convergence criterion. Then, we calculated the difference of Ehof between the complete duplex and the base-removed states, as illustrated in Fig. 2c2. This difference should be related to how the central base is stabilised, through the stacking and hydrogen bonding interactions, within the context of the whole DNA sequence.

Overall calculation costs

For our 24,576 7-mer models, the MM optimisations utilised ~650 hours of CPU time (on average 95.2 seconds per model). The QM optimisations took ~11,052 CPU hours (averaging 1,618.9 seconds per model). For the subsequent single-point calculations, it took on average 74.1 seconds per model in CPU time. Since there were five such calculations per model, it took an overall 2,529 CPU hours. This amounted to 593 CPU days of calculations, which we were able to conduct within about 3 months by utilising up to six computing nodes.

Data Records

File description

Table 1 shows the summary of the dataset obtained via the above procedure. Our dataset is comprised of 7 deposited dataset files for each k-mer range. The units of features are described in parentheses. The file “energy.txt” includes the overall parameters for the double-helical DNA in its B, A and Z states, that is Ehof (kcal/mol), dipole moment (debye), HOMO and LUMO energies (eV), and IP (eV). The file “denergy.txt” includes differences of Ehof upon de-hybridisation of the DNA, and the central base removal, calculated for B, A and Z conformations (units are the same as in “energy.txt”). The file “Mullik_Charge.txt” includes the maximum, minimum, and mean Mulliken charge values (in e units, where e = +1.602177 × 10⁻¹⁹ C) at the base, sugar and phosphate moieties for each nucleotide position in B-, A- and Z-DNA. The file “Mullik_Density.txt” similarly includes the maximum, minimum, and mean values of the Mulliken population density (dimensionless). The files “ESP_Charge.txt” and “ESP_Density.txt” contain datasets for the electrostatic potential fitted charges and populations respectively (units are the same as Mulliken charge and population). The file “Curves.txt” includes intra- and inter-strand geometric parameters for our sequences at their B, A and Z states (Å and degree are the units of distance and angle). The described dataset files include all our sequences, one row of features per sequence, with the first column indicating the sequence in the lexicographic order.

Table 1 File names, contents, and the number of features of the generated and deposited dataset.

Full size table

Table 2 shows our naming rules used to identify the features in the dataset files. Examples are described for (1) energy and difference of energy: B_ds.Ehof means heat of formation energy of duplex B-DNA. B_ds.dEhof_ds_ss means the difference of heat of formation energies between duplex B-DNA and its single-strand states. (2) As an example of charges and populations: here, the plus strand is the strand that has a given sequence, and the minus strand is the complementary strand. For example, when we consider the 5′-AAAAAAA-3′ sequence, the plus strand is the strand that has AAAAAAA nucleotides, while the minus strand is the complementary strand that has TTTTTTT nucleotides. By this rule, B_ds_strandPlus_4_phos_mean.MullikenCharge means the mean value of Mulliken charge of a phosphate part of the 4^th nucleotide in a plus strand of B-DNA. (3) An example of geometric parameters: B_ds_strandPlus_4.Curves_Xdisp means X displacement of a base of 4^th nucleotide in a plus strand of B-DNA. Note that the meaning of geometric parameters is well summarised in the 3DNA paper⁴⁷ and Fig. S3 in this work.

Table 2 Explanations of the abbreviations for the feature names used in our dataset.

Full size table

Technical Validation

Accuracy

As a proof of concept and demonstration of a potential application of our developed dataset in an actual DNA sequence-based machine learning initiative, below we demonstrate the exclusive use of DNAkmerQM features in predicting context-dependent spontaneous mutation rate constants for A to C mutation via machine learning. Our dataset and the quality of the features inside reflect the state-of-the-art semi-empirical QM methodology that can still be applied for such a large molecular system. The software and packages we used, which are AmberTools21, MOPAC2016, R language, and so on, have a long history of developments and validations in their respective publications. Furthermore, in the above analyses, we did not find any strange behaviour such as outlier values. Instead, we found that the tendencies in the dataset are not so different from our conventional knowledge. For example, the tendencies of data for B- and A-DNA (right-handed) are similar but different from data for Z-DNA (left-handed).

Applicability

To demonstrate the intended power of our dataset in an actual DNA sequence-based machine learning initiative, here we showcase the exclusive use of DNAkmerQM features in predicting context-dependent spontaneous mutation rate constants for A to C mutation via machine learning.

Construction of a dataset for machine learning

The Trek (transposon exposed k-meric mutation rate constants) dataset provides comprehensive sequence-dependent mutation rates for the human genome, as obtained from LINE-1 remnants²¹. We combined our QM datasets, which contain features related to B-DNA in “energy.txt”, “denergy.txt”, and “Mullik_Charge.txt” with A to C mutation rate constants (k_A→C) from the Trek dataset (Fig. 3a), resulting in 4096 samples with 102 features.

Development of a machine learning model

We divided this dataset into 80% for training (3279 samples) and 20% for pure test (817 samples). Next, we constructed tree-based Gradient Boosting Machine (GBM) models, by which decision trees are consecutively generated to predict the residual values of the ensemble of prior learner trees (Fig. 3b)^48,49,50. GBMs are known to exhibit superior performance, often prevailing those of neural network-based models for tabulated data^51,52. GBMs have flexible tunability by five hyperparameters (interaction depth, minimum child weight, bag fraction (sampling rate), learning rate, and the number of trees). These five hyperparameters are related to the overall architecture of the GBM model and drastically affect the performance. The development of GBMs thus involves a careful selection of the optimal combination of its hyperparameters to achieve the best performance. For this, a three-step procedure was employed in this study. (1) Construction of preliminary GBMs by a reasonable initial parameter set. (2) Feature reduction: GBMs provide the importance of features, that is, how much each feature contributes to the performance of GBMs. Based on the feature importance values of the preliminary GBM model, we excluded features that do not contribute to the model performance much. This drastically reduced the computation cost of the following procedure. (3) Grid search: after the reduction of features, we developed varying GBMs with various combinations of the five parameters. The performance for each model was measured by the root mean squared error (RMSE) from 10-fold cross-validation. We summarise the employed final hyperparameters (interaction depth = 11, the number of trees = 5000, learning late = 0.01, bag fraction = 0.8, and minimum child weight = 5), along with all the sampled ranges, in Table S1.

Validation of the machine learning model

From the production level GBM model obtained through the above procedure, we predicted A to C mutation rates from the pure test set. Fig. 3c shows a scatter plot between true values in the pure test set and the predicted values by our GBM model. The predicted mutation rate constants agreed very well with the actual values (Pearson’s R = 0.89, RMSE = 0.034). This thus demonstrates the potential of our dataset to provide a wealth of physicochemical features, which, even while used as sole features, are capable of generating a sophisticated machine learning model for a range of DNA sequence-based biological phenomena.

Usage Notes

Since the 1^st and the 7^th nucleotides are located at the edge of the DNA segments used in the modelling, the usage of their features should be preferentially avoided if used in machine learning. For example, our machine learning model mentioned above shows a lower performance when we include these edge nucleotide data as features. Differences in IP, HOMO, and LUMO (ΔIP, ΔHOMO, and ΔLUMO in “denergy.txt”) do not have clear physical meanings for varying (not same) molecular systems, and thus these features should be avoided too. However, we include them in the dataset for the estimation of IP, HOMO, and LUMO for single strands and base-deleted states as detailed in the caption of Fig. 2.

Data availability

The dataset is publicly available on GitHub (https://github.com/SahakyanLab/DNAkmerQM) and Zenodo³³ under the CC-BY license. The required code to generate the dataset is freely accessible under the CC-BY license from (https://github.com/SahakyanLab/NucleicAcidsQM).

Code availability

All software and packages used in this study are freely distributed and available through their citations brought in the text.

References

Schaller, R. Moore’s law: past, present and future. IEEE Spectrum 34, 52–59 (1997).
Article Google Scholar
Angermueller, C., Pärnamaa, T., Parts, L. & Stegle, O. Deep learning for computational biology. Molecular Systems Biology 12, 878, https://doi.org/10.15252/msb.20156651 (2016).
Article PubMed PubMed Central Google Scholar
Sahakyan, A. B. et al. Machine learning model for sequence-driven DNA G-quadruplex formation. Scientific Reports 7, 14535, https://doi.org/10.1038/s41598-017-14017-4 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Avsec, Å et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods 18, 1196–1203, https://doi.org/10.1038/s41592-021-01252-x (2021).
Article CAS PubMed PubMed Central Google Scholar
Leung, M. K. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Deep learning of the tissue-regulated splicing code. Bioinformatics 30, i121–i129, https://doi.org/10.1093/bioinformatics/btu277 (2014).
Article CAS PubMed PubMed Central Google Scholar
Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806, https://doi.org/10.1126/science.1254806 (2015).
Article MathSciNet CAS PubMed Google Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33, 831–838, https://doi.org/10.1038/nbt.3300 (2015).
Article CAS PubMed Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nature Methods 12, 931–934, https://doi.org/10.1038/nmeth.3547 (2015).
Article CAS PubMed PubMed Central Google Scholar
Toneyan, S., Tang, Z. & Koo, P. K. Evaluating deep learning for predicting epigenomic profiles. Nature Machine Intelligence 1–13 https://doi.org/10.1038/s42256-022-00570-9 (2022).
Zheng, A. et al. Deep neural networks identify sequence context features predictive of transcription factor binding. Nature Machine Intelligence 3, 172–180, https://doi.org/10.1038/s42256-020-00282-y (2021).
Article PubMed PubMed Central Google Scholar
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research 26, 990–999, https://doi.org/10.1101/gr.200535.115 (2016).
Article CAS PubMed PubMed Central Google Scholar
Angermueller, C., Lee, H. J., Reik, W. & Stegle, O. DeepCpG: Accurate prediction of single-cell DNA methylation states using deep learning. Genome Biology 18, 67, https://doi.org/10.1186/s13059-017-1189-z (2017).
Article CAS PubMed PubMed Central Google Scholar
Rogers, M. F., Gaunt, T. R. & Campbell, C. Prediction of driver variants in the cancer genome via machine learning methodologies. Briefings in Bioinformatics 22, bbaa250, https://doi.org/10.1093/bib/bbaa250/5935499 (2021).
Article PubMed Google Scholar
Chmiela, S., Sauceda, H. E., Müller, K.-R. & Tkatchenko, A. Towards exact molecular dynamics simulations with machine-learned force fields. Nature Communications 9, 3887, https://doi.org/10.1038/s41467-018-06169-2 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Kirkpatrick, J. et al. Pushing the frontiers of density functionals by solving the fractional electron problem. Science 374, 1385–1389, https://doi.org/10.1126/science.abj6511 (2021).
Article ADS CAS PubMed Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with Alphafold. Nature 596, 583–589, https://doi.org/10.1038/s41586-021-03819-2 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Russo, N., Toscano, M. & Grand, A. Theoretical determination of electron affinity and ionization potential of DNA and RNA bases. Journal of Computational Chemistry 21, 1243–1250, https://doi.org/10.1002/1096-987X(20001115)21:14 (2000).
Article CAS Google Scholar
Close, D. M. Calculation of the ionization potentials of the DNA bases in aqueous medium. J. Phys. Chem. A 108, 10376–10379, https://doi.org/10.1021/jp046660y (2004).
Article CAS Google Scholar
Saito, I. et al. Photoinduced dna cleavage via electron transfer: demonstration that guanine residues located 5’ to guanine are the most electron-donating sites. J. Am. Chem. Soc. 117, 6406–6407, https://doi.org/10.1021/ja00128a050 (1995).
Article CAS Google Scholar
Fleming, A. M., Zhu, J., Ding, Y., Esders, S. & Burrows, C. J. Oxidative modification of guanine in a potential Z-DNA-forming sequence of a gene promoter impacts gene expression. Chemical Research in Toxicology 32, 899–909, https://doi.org/10.1021/acs.chemrestox.9b00041 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sahakyan, A. B. & Balasubramanian, S. Single genome retrieval of context-dependent variability in mutation rates for human germline. BMC Genomics 18, 1–17, https://doi.org/10.1186/s12864-016-3440-5 (2017).
Article Google Scholar
Sorkun, E., Zhang, Q., Khetan, A., Sorkun, M. C. & Er, S. RedDB, a computational database of electroactive molecules for aqueous redox flow batteries. Scientific Data 9, 718, https://doi.org/10.1038/s41597-022-01832-2 (2022).
Article CAS PubMed PubMed Central Google Scholar
Isert, C., Atz, K., Jiménez-Luna, J. & Schneider, G. QMugs, quantum mechanical properties of drug-like molecules. Scientific Data 9, 273, https://doi.org/10.1038/s41597-022-01390-7 (2022).
Article CAS PubMed PubMed Central Google Scholar
Shen, J.-X. et al. A representation-independent electronic charge density database for crystalline materials. Scientific Data 9, 661, https://doi.org/10.1038/s41597-022-01746-z (2022).
Article PubMed PubMed Central Google Scholar
Axelrod, S. & Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data 9, 185, https://doi.org/10.1038/s41597-022-01288-4 (2022).
Article CAS PubMed PubMed Central Google Scholar
Stuke, A. et al. Atomic structures and orbital energies of 61,489 crystal-forming organic molecules. Scientific Data 7, 58, https://doi.org/10.1038/s41597-020-0385-y (2020).
Article CAS PubMed PubMed Central Google Scholar
Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules. Scientific Data 4, 170193, https://doi.org/10.1038/sdata2017.193 (2017).
Article CAS PubMed PubMed Central Google Scholar
St. John, P. C. et al. Quantum chemical calculations for over 200,000 organic radical species and 40,000 associated closed-shell molecules. Scientific Data 7, 244, https://doi.org/10.1038/s41597-020-00588-x (2020).
Article CAS Google Scholar
Gervasoni, S. et al. AB-DB: force-field parameters, MD trajectories, QM-based data, and descriptors of antimicrobials. Scientific Data 9, 148, https://doi.org/10.1038/s41597-022-01261-1 (2022).
Article PubMed PubMed Central Google Scholar
Liang, J. et al. QM-symex, update of the QM-sym database with excited state information for 173 kilo molecules. Scientific Data 7, 400, https://doi.org/10.1038/s41597-020-00746-1 (2020).
Article ADS PubMed PubMed Central Google Scholar
Prasad, V. K., Otero-de-la Roza, A. & DiLabio, G. A. PEPCONF, a diverse data set of peptide conformational energies. Scientific Data 6, 180310, https://doi.org/10.1038/sdata2018.310 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1, 140022, https://doi.org/10.1186/s12864-016-3440-5/sdata2014.22 (2014).
Article CAS PubMed PubMed Central Google Scholar
Masuda, K., Abdullah, A. A., Pflughaupt, P. & Sahakyan, A. B. Quantum mechanical electronic and geometric parameters for DNA k-mers as features for machine learning. Zenodo https://doi.org/10.5281/zenodo.10866166 (2024).
R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2022).
Macke, T. J. & Case, D. A. Modeling unusual nucleic acid structures (American Chemical Society, Washington, DC, USA, 1998).
Neidle, S.Oxford handbook of nucleic acid structure (Oxford University Press, Oxford, UK, 1999).
Li, S., Olson, W. K. & Lu, X. J. Web 3DNA 2.0 for the analysis, visualization, and modeling of 3D nucleic acid structures. Nucleic Acids Res. 47, W26–W34, https://doi.org/10.1093/nar/gkz394 (2019).
Article CAS PubMed PubMed Central Google Scholar
Case, D. A. et al. Amber 2021. University of California, San Francisco, USA (2021).
Zgarbová, M. et al. Refinement of the sugar–phosphate backbone torsion beta for amber force fields improves the description of Z- and B-DNA. J. Chem. Theory Comput. 11, 5723–5736, https://doi.org/10.1021/acs.jctc.5b00716 (2015).
Article CAS PubMed Google Scholar
Tsui, V. & Case, D. A. Theory and applications of the generalized Born solvation model in macromolecular simulations. Biopolymers 56, 275–291 (2001).
Article CAS Google Scholar
Grant, B. J., Rodrigues, A. P. C., ElSawy, K. M., McCammon, J. A. & Caves, L. S. D. Bio3D: an R package for the comparative analysis of protein structures. Bioinformatics 22, 2695–2696, https://doi.org/10.1093/bioinformatics/btl461 (2006).
Article CAS PubMed Google Scholar
Stewart, James J. P. MOPAC2016. Stewart Computational Chemistry, Colorado Springs, CO, USA (2016).
Korth, M. Third-generation hydrogen-bonding corrections for semiempirical qm methods and force fields. J. Chem. Theory Comput. 6, 3808–3816, https://doi.org/10.1021/ct100408b (2010).
Article CAS Google Scholar
Klamt, A. & Schüürmann, G. COSMO: a new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J. Chem. Soc. Perkin Trans. 799–805 https://doi.org/10.1039/P29930000799 (1993).
Besler, B. H., Merz Jr, K. M. & Kollman, P. A. Atomic charges derived from semiempirical methods. J. Comput. Chem. 11, 431–439, https://doi.org/10.1002/jcc.540110404 (1990).
Article CAS Google Scholar
Lavery, R., Moakher, M., Maddocks, J. H., Petkeviciute, D. & Zakrzewska, K. Conformational analysis of nucleic acids revisited: Curves+. Nucleic Acids Research 37, 5917–5929, https://doi.org/10.1093/nar/gkp608 (2009).
Article CAS PubMed PubMed Central Google Scholar
Lu, X. J. & Olson, W. K. 3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nat. Protoc. 3, 1213–1227, https://doi.org/10.1038/nprot.2008.104 (2008).
Article CAS PubMed PubMed Central Google Scholar
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining https://doi.org/10.1145/2939672.2939785 (2016).
Friedman, J. H. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 367–378, https://doi.org/10.1016/S0167-9473(01)00065-2 (2002).
Article MathSciNet Google Scholar
Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 7, 1–21, https://doi.org/10.3389/fnbot.2013.00021 (2013).
Article Google Scholar
Caruana, R. & Niculescu-Mizil, A. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd ICML https://doi.org/10.1145/1143844.1143865 (2006).
Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence 2, 56–67, https://doi.org/10.1038/s42256-019-0138-9 (2020).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

K.M. is supported by JSPS KAKENHI, Grant Number 21J10412 and JSPS Overseas Research Fellowship. A.A.A. is supported by MARA studentship. P.P. is supported by the UK Medical Research Council (MRC), Hertford College, Clarendon Fund, and Radcliffe Department of Medicine. The Sahakyan Laboratory has been supported by the UK MRC, MRC Strategic Alliance Funding (MC-UU-12025).

Author information

Authors and Affiliations

MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, Radcliffe Department of Medicine, University of Oxford, Oxford, OX3 9DS, UK
Kairi Masuda, Adib A. Abdullah, Patrick Pflughaupt & Aleksandr B. Sahakyan

Authors

Kairi Masuda
View author publications
Search author on:PubMed Google Scholar
Adib A. Abdullah
View author publications
Search author on:PubMed Google Scholar
Patrick Pflughaupt
View author publications
Search author on:PubMed Google Scholar
Aleksandr B. Sahakyan
View author publications
Search author on:PubMed Google Scholar

Contributions

K.M., A.A.A. and A.B.S. designed the research, K.M. conducted the calculations and analysed the results, A.A.A. and P.P. contributed analytical techniques to the work, A.B.S. conceived and supervised the research. All authors wrote and reviewed the manuscript.

Corresponding author

Correspondence to Aleksandr B. Sahakyan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Masuda, K., Abdullah, A.A., Pflughaupt, P. et al. Quantum mechanical electronic and geometric parameters for DNA k-mers as features for machine learning. Sci Data 11, 911 (2024). https://doi.org/10.1038/s41597-024-03772-5

Download citation

Received: 25 March 2024
Accepted: 13 August 2024
Published: 22 August 2024
Version of record: 22 August 2024
DOI: https://doi.org/10.1038/s41597-024-03772-5

This article is cited by

Prior knowledge on context-driven DNA fragmentation probabilities can improve de novo genome assembly algorithms
- Patrick Pflughaupt
- Aleksandr B. Sahakyan
BMC Bioinformatics (2025)