Background & Summary

Heteroaryl groups are functional groups in organic molecules that contain a heteroaromatic ring with at least one heteroatom, such as nitrogen, oxygen, or sulphur. They are naturally abundant and widely utilized in functionalized organic molecules. For example, pyridine, one of the most prevalent nitrogen-containing heteroarenes, is present in 54 small molecule drugs approved by the FDA between 2013 and 20231. Heteroaryls are also common structural motifs in both coupling partners and catalysts in transition metal catalysis and organocatalysis2,3,4,5. The prevalence of heteroaromatic compounds is attributed to several factors. First, different heteroarene cores have distinct electronic and steric properties that effectively alter the target compound’s chemical reactivity and biological function. One area where these properties have been leveraged is in the design of covalent modifier drugs6,7,8 with warhead reactivity modulated by heteroaryl groups; examples of this include FDA-approved drugs afatinib and selinexor, as well as other covalent modifiers such as roblitinib, nitrofuran derivative C-176, and TC9-305 (Fig. 1a)7. Second, heteroarene cores could be further functionalized with electron-donating or electron-withdrawing substituents at different sites9,10, leading to a large number of regioisomers with a substantially expanded property space of heteroaromatic compounds. Third, structurally diverse functionalized heteroarenes could be synthesized from readily available starting materials via a number of established synthetic methods, including recently developed site-selective functionalization4,9,10,11 and skeletal editing12 strategies.

Fig. 1
figure 1

Overview and background of this work. (a) Selected covalent modifiers possessing warheads modulated by heteroaryl groups7. (b) Descriptors for aryl and heteroaryl substituents. (c) This work: HArD (HeteroAryl Descriptors) database.

Quantitative description of the intrinsic steric and electronic properties of heteroaryl substituents is essential for establishing structure-activity relationships (SAR) and machine learning models for heteroaromatic compounds used in drug discovery and reaction design. DFT-computed descriptors are widely used in catalyst design13,14,15, materials science16,17, and reactivity and selectivity predictions18,19. Descriptors for heteroaromatic compounds such as HOMO/LUMO orbital coefficients/energies20,21 and atomic charges22,23 have been applied to various reaction types, including electrophilic20,22,23,24,25 and nucleophilic21,26 aromatic substitution, C–N cross-coupling27, and radical C–H functionalization28. Despite these advances, a systematic and comprehensive database that integrates various physical-organic descriptors for heteroaryl substituents is still lacking, which has hindered the development of reactivity and selectivity prediction models. In contrast to the broadly used Hammett substituent constants (σp and σm)29 to describe electronic properties of aryl substituents, similar universal electronic descriptors for heteroaryl substituents have not been developed (Fig. 1b). This is in part due to the lack of experimental data (i.e., pKa values of corresponding heteroaryl carboxylic acids)30 as well as the inherent complexity of heteroaryl groups with different ring types, heteroatom substitutions, and regioisomers. We expect that a database of steric and electronic descriptors of heteroaryl substituents could serve as a foundation for developing robust predictive models that expand to the entire chemical space of heteroaromatic compounds. These could streamline reaction and catalyst developments by enabling the rational selection of heteroaryl substituents based on their electronic and steric features, rather than relying on trial-and-error approaches. In addition, this database for intrinsic chemical reactivity factors would complement existing cheminformatics databases for heteroarene synthetic feasibility31 and ADMET properties32, which have been broadly used in drug discovery.

Here, we present HArD, the HeteroAryl Descriptors database of >31,500 heteroaryl substituents based on 238 commercially available parent heteroarene cores (Fig. 1c). To capture the structural diversity of heteroaryl groups, we included both 5- and 6-membered heteroaromatic rings as well as 5,6- and 6,6-fused ring systems with carbon, nitrogen, oxygen, and sulphur as possible heavy atoms in the ring scaffold (Fig. 2a). Each parent heteroarene was functionalized with commonly used electron-withdrawing and electron-donating substituents to give monosubstituted heteroaryl groups (Fig. 2b). For each heteroaryl substituent, 49 DFT-computed electronic, steric, and geometrical descriptors and 16 fingerprint-type descriptors were included (Fig. 2c,d). This database includes computed Hammett-type substituent constants for heteroaryls (σHet), which would allow straightforward extensions of existing SAR and ML models of aryl compounds based on Hammett substituent constants (σp and σm) into previously unexplored space of heteroaryl-containing compounds. These newly developed σHet electronic parameters were computed based on pKa values of corresponding heteroaryl carboxylic acids (Fig. 3a), in analogy to the original definition of Hammett constants for aryl substituents to enable backward compatibility. In addition, other previously used descriptors, such as HOMO/LUMO coefficients, HOMO/LUMO energies, and partial atomic charges have also been computed for all heteroaryl groups in the database. Overall, HArD not only bridges a critical gap in the quantitative characterization of heteroaryl substituents but also provides a practical tool to design and predict the properties of these key building blocks in drug discovery, catalysis, and materials science.

Fig. 2
figure 2

Workflow to generate the database. The HArD database was created by collecting heteroaryl cores, enumerating possible substituents, and performing high-throughput DFT calculations to provide a set of steric and electronic descriptors. (a) 238 commercially available N-, O-, and S-containing heteroarenes from Reaxys®. (b) SMILES enumeration via RDKit to form approximately 31,500 monosubstituted groups. (c) High-throughput DFT calculations for various descriptors. (d) Descriptors included in the HArD database.

Fig. 3
figure 3

Overview of selected electronic and steric descriptors. (a) Definition of Hammett-type substituent constant (σHet) for heteroaryl groups as an electronic descriptor. (b) Distribution of σHet in the database. (c) σHet of selected heteroaryl groups. (d) Examples demonstrating different steric and electronic properties between two heteroaryl regioisomers.

Methods

Establishing the heteroaryl library

Parent heteroarene cores were selected based on commercially available unsubstituted heteroaromatic compounds with 5- and 6-membered rings, as well as 5,6- and 6,6-fused ring systems from the Reaxys® database (reaxys.com) (Fig. 2a). Only compounds with C, N, O, and S atoms in the heteroaromatic rings were included. A total of 238 unsubstituted parent heteroarenes were selected, including 23 five-membered heteroarenes, 9 six-membered heteroarenes, 157 5,6-fused rings, 47 6,6-fused rings, plus benzene and naphthalene. This resulted in 812 regioisomers of unsubstituted heteroaryl groups. Next, each unsubstituted heteroaryl group was functionalized using the RDKit33 “ReactionFromSmarts” function to substitute a C–H bond on the heteroaromatic ring with a substituent to generate monosubstituted heteroaryl groups. The substituents used include 12 common electron-donating and electron-withdrawing groups—NMe2, NH2, OH, OMe, Me, TMS, F, Cl, Br, Ac, CN, and NO2. This resulted in approximately 31,500 unique heteroaryl groups (Fig. 2b). To calculate the steric and electronic properties of each heteroaryl group (ArHet), SMILES strings of three compounds were used, including ArHet–H, ArHet–CO2H, and ArHet\({{\rm{CO}}}_{2}^{-}\). The RDKit Experimental-Torsion Distance Geometry (ETDG) method34 was used to generate 3D structures as Gaussian 1635 input files for subsequent DFT calculations.

Density functional theory (DFT) calculations

Geometries of all structures were optimized using the dispersion-corrected36,37 B3LYP-D3(BJ) functional38,39 with the 6–31 + G(d) basis set using the Gaussian 16 program35 (Fig. 2c). Vibrational frequency calculations were performed at the same level of theory as the geometry optimization to confirm that each structure is a local minimum (i.e., with no imaginary frequencies). Single-point energy calculations were carried out using the M06-2X functional40 with the 6–31 + G(d) basis set. Solvation energy corrections were calculated using the SMD solvation model41 in single-point energy calculations with water as the solvent. Carboxylic acids (ArHet–CO2H) and carboxylate anions (ArHet\({{\rm{CO}}}_{2}^{-}\)) may have several conformers depending on whether the carboxylic acid or carboxylate group is coplanar with the heteroaromatic ring. The “SetDihedralDeg” function in RDKit was used to generate conformers of carboxylic acids and carboxylate anions by rotating about the Cipso − Ccarbonyl bond. Only the lowest energy conformer of each structure was used to compute the reported properties. The Automated Quantum Mechanical Environments (AQME) software42 was used in post-processing to check for self-consistent field (SCF) and geometry optimization convergence errors and imaginary frequencies. Calculations with convergence errors were resubmitted by using the intermediate structure during the previous geometry optimization with the lowest root-mean-square gradient as the input geometry. In cases where imaginary frequencies were present, the calculations were adjusted by slightly perturbing the geometry and resubmitted with the keyword “opt = (calcfc,maxstep = 5)”. This automated process was repeated twice, and any calculations still showing errors after the attempted recalculations were not included in the final database.

Descriptor acquisition

Hammett-type substituent constants for heteroaryl groups (σHet)

Hammett-type substituent constants for heteroaryls were calculated from the difference between the aqueous pKa values of the corresponding heteroaryl carboxylic acid, pKa(Het), and benzoic acid, pKa(Ph), as a reference (Fig. 3a).

$${\sigma }_{{\rm{Het}}}=\log \left(\frac{{K}_{{\rm{a}}}\left({\rm{Het}}\right)}{{K}_{{\rm{a}}}\left({\rm{Ph}}\right)}\right)=p{K}_{a}\left({\rm{Ph}}\right)-p{K}_{a}\left({\rm{Het}}\right)$$

The pKa values for benzoic acid and each heteroaryl carboxylic acid were calculated from

$${{\rm{Ar}}}_{{\rm{Het}}}-{{\rm{CO}}}_{2}{\rm{H}}({aq})\,\mathop{\to }\limits^{\triangle {G}_{{aq}}}{{\rm{Ar}}}_{{\rm{Het}}}-{{\rm{CO}}}_{2}^{-}\left({aq}\right)+{{\rm{H}}}^{+}\left({aq}\right)$$
$$p{K}_{a}=\frac{\triangle {G}_{{aq}}}{2.303{RT}}=\frac{\triangle {G}_{{{Ar}}_{{Het}}-{{CO}}_{2}^{-}}+\triangle {G}_{{H}^{+}}-\triangle {G}_{{{Ar}}_{{Het}}-{{CO}}_{2}H}}{2.303{RT}}$$

where R is the gas constant and T is 298.15 K. The value −270.29 kcal/mol was used for the Gibbs free energy of a proton in aqueous solution (\(\triangle {G}_{{H}^{+}}\)). This is calculated from the sum of the gas-phase free energy of proton (−6.28 kcal/mol), its hydration free energy43 (−265.9 kcal/mol), and a +1.89 kcal/mol correction for standard state conversion. The Gibbs free energies (∆G) of the carboxylate anion and the carboxylic acid were calculated using DFT at the M06-2X/6-31+G(d)/SMD(sSAS,H2O)//B3LYP-D3(BJ)/6-31+G(d) level of theory under standard conditions (i.e., 298.15 K, 1 mol/L). A modified scaled solvent-accessible surface (sSAS) approach was used in the SMD solvation model because it has been demonstrated to improve the accuracy of pKa calculations of carboxylic acids44. The computed σ values of substituted aryl groups have a good linear correlation with the experimentally derived σ values in the literature (R2 = 0.87; MAE = 0.11). Nonetheless, due to the error of the pKa calculations, we note that σHet values should be compared with the computed σ values of substituted aryl groups, which are also included in our database, rather than the experimental σ values when directly comparing the electronic properties of heteroaryl and aryl substituents.

Similar to Hammett constants for substituted aryl groups, a negative σHet value indicates a more electron-donating heteroaryl compared to phenyl, whereas a positive σHet indicates a more electron-withdrawing heteroaryl. While traditional aryl Hammett substituent constants often fall in a range of approximately −1 to +1, σHet values exhibit a broader range (Fig. 3b,c), showcasing the electronic diversity of heteroaryl groups. Regioisomers of the same parent heteroarene could offer significant variance in their σHet values (e.g., 0.91, 0.71, and 1.33 for 2-pyridyl, 3-pyridyl, and 4-pyridyl, respectively, Fig. 3c). Further, while 5-membered rings such as pyrrole are known to often be more electron-donating than 6-membered rings such as pyridine, these rings can be tuned using electron-withdrawing groups to alter their electronic properties (Fig. 3c).

Other electronic descriptors

Electronic properties such as HOMO and LUMO energies, total dipole moments, and quadrupole moments were extracted from Gaussian 16 single-point energy output files using the Python-based cclib library. From these values, the chemical potential, HOMO-LUMO gap, global electrophilicity, and global nucleophilicity were derived. The chemical potential was calculated from the average of the HOMO and LUMO energies45, while the HOMO-LUMO gap is the energy difference between these orbitals. Global electrophilicity (electrophilicity index) was determined using the formula \(\omega =\frac{{\mu }^{2}}{2\eta }\) where μ is the chemical potential and η is the HOMO-LUMO gap46. Global nucleophilicity (nucleophilicity index) was computed as the inverse of global electrophilicity (\(N=\frac{1}{\omega }\)). The HOMO and LUMO coefficients of the ipso carbon were extracted directly from the Gaussian 16 outputs (keyword: pop = (orbitals = 10, ThreshOrbitals = 5)). Partial atomic charges were computed using Natural Population Analysis (NPA)47, Hirshfeld48, and Charge Model 5 (CM5)49 charge schemes for the ipso carbon and the sum of the atomic charges of the heteroaryl group in the ArHet–H, ArHet–CO2H, and ArHet\({{\rm{CO}}}_{2}^{-}\) compounds.

Steric and geometrical descriptors

All steric descriptors were computed using the MORFEUS program50. These include Sterimol parameters, buried volume, distal volume, Cipso–H bond length in ArHet–H, and Cipso–Ccarbonyl bond lengths in ArHet–CO2H and ArHet\({{\rm{CO}}}_{2}^{-}\). Sterimol parameters, developed by Verloop, describe substituent size51,52. The Sterimol length (L) is the vector length from the hydrogen on the ipso carbon of ArHet–H through the carbon to the tangent of the van der Waals (vdW) surface. B1 and B5 are the minimum and maximum widths, defined by the shortest and longest vectors from the ipso carbon to the vdW surface and perpendicular to L. Buried volume was originally developed to quantify the steric hindrance caused by ligands in transition metal complexes53,54. Here, the fraction buried volume (VBur) was calculated from the percentage of space occupied by the ArHet substituent within a sphere of 3.5 Å radius centred on the ipso carbon (Bondi radii, 0.10 Å mesh spacing, and excluding hydrogen atoms) (Fig. 3d). The distal volume describes the volume occupied by ArHet outside of this sphere. Using this approach, the computed steric descriptors could distinguish between different regioisomers (e.g., 3-bromo-2-pyridyl vs. 3-bromo-4-pyridyl, Fig. 3d). The universal quantitative dispersion descriptor (Pint) was derived by constructing a molecular vdW surface and calculating dispersion coefficients using Grimme’s D3 dispersion correction method36,55. The solvent-accessible surface area (SASA) of the ipso carbon of ArHet was determined using the double cubic lattice method (DCLM) algorithm, which applies a constant surface density of points using a 1.4 Å probe56. The volume enclosed by this solvent-accessible surface is also computed57.

HOMA aromaticity descriptor

The Harmonic Oscillator Model of Aromaticity (HOMA) descriptor58,59 was calculated using the bond lengths of the heteroaromatic rings (\({R}_{i}\)) from the DFT-optimized geometries:

$${\rm{HOMA}}=1-\frac{\alpha }{n}\mathop{\sum }\limits_{i=1}^{n}{({R}_{i}-{R}_{{\rm{opt}}})}^{2}$$

where Ropt is the optimal bond length for a reference aromatic bond60, n the total number of bonds within the ring, and α serves as a normalization factor60 to scale the result to a value between 1 and 0, where 1 indicates perfectly aromatic benzene and 0 represents the hypothetical Kekulé structure of a nonaromatic 1,3,5-cyclohexatriene ring. The Ropt and α values for CC, CN, CO, CS, NN, and NO bonds were taken from the literature60. For the NS bond, the optimal bond length Ropt = 1.61 Å was also taken from the literature61, whereas the \(\alpha \) (71.875 Å−2) was calculated according to standard procedures58.

Fingerprint descriptors

In addition to DFT-computed descriptors, we included fingerprint-type descriptors relevant to drug discovery applications using RDKit33. These include the Wildman-Crippen partition coefficient (Log P)62, topological polar surface area (TPSA)63, number of hydrogen bond donors/acceptors, molecular weight, number of heavy atoms, and fraction of sp2- and sp3-hybridized carbons.

Data Records

The HeteroAryl Descriptors (HArD) database, along with a Python script for data processing, is available on a publicly accessible FigShare repository64 and GitHub repository (github.com/turkiAlturaifi/HArD). The repository includes the scripts used to generate the database, as well as an Excel file (hard.xlsx) and a database file (hard.db) listing the SMILES representations of monosubstituted heteroaryls along with their associated descriptors. The Excel file (hard.xlsx) contains two sheets. The “database” sheet includes 72 columns, 65 of which correspond to molecular descriptors: 38 electronic, 11 geometric/steric, and 16 fingerprint-type descriptors. The remaining columns provide structural identity information, including the molecule ID, the name of the parent heteroarene, and SMILES strings for the heteroaryl group (ArHet), the unsubstituted heteroarene (ArHet–H), as well as ArHet–CO2H and ArHet\({{\rm{CO}}}_{2}^{-}\) species used to compute the σHet descriptors. The “descriptors” sheet provides detailed descriptions and units for each of the 65 descriptors. The repository is organized into two main folders: (i) the database processing folder, which contains a Python script (hard.py) for end-users to perform SMILES-based searches of the database and extract descriptor data from the database file (hard.db); and (ii) the database generation folder, which includes scripts and files for developers to further extend the existing data points, including scripts to filter SMILES strings from Reaxys® query search, generate substituted heteroaryls, perform high-throughput calculations, analyse the results, and extract descriptors. Additionally, the Cartesian coordinates of all optimized geometries (provided as XYZ files) are available on FigShare64. Finally, we provide a simple website (hard.pengliugroup.com) to search for the descriptors by SMILES strings or via a graphical interface (see Usage notes).

Technical Validation

In our automated DFT-calculation/descriptor extraction workflow, several validation checks were performed, including post-processing using the AQME software to address convergence issues (vide supra) and connectivity validation to exclude intrinsically unstable compounds. Similar to traditional Hammett substituent constants, which only included meta- and para-substituted benzenes to avoid influences of steric effects of ortho substituents, we excluded σHet values for all heteroaryl groups with another substituent at an ortho position and those with A1,3-type interactions (e.g., 4-benzothiophenyl with another substituent at the 3-position). These procedures ensure that the reported σHet values describe electronic properties only.

Next, we analysed all DFT-computed descriptors in the database using both linear and non-linear dimensionality reduction techniques, specifically Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP)65 (Fig. 4a,b). In the PCA analysis, the principal components with the greatest variances consist of atomic charge descriptors (PC1, explaining 28% of variance) and quadrupole moments, dispersion, and steric (PC2, 19% of variance). The PCA plots of PC1 versus PC2 show broader distributions of properties across each heteroaryl class (purple, red, yellow, and green colours for 5-membered, 6-membered, and 5,6- and 6,6-fused rings, respectively) than those of aryl groups (black colour) (Fig. 4a). Similarly, the UMAP projection (Fig. 4b) revealed that many heteroaryl groups occupy distinct property space not accessible by aryl substituents. Next, we used z-scores (standard scores) to illustrate the differences among subsets of the database, including aryls, 5- and 6-membered rings, 5,6- and 6,6-fused rings, unsubstituted heteroaryls, and monosubstituted heteroaryls with either electron-donating groups (EDGs) or electron-withdrawing groups (EWGs). For each steric and electronic descriptor shown in Fig. 4c, the z-score for each subset was calculated from the average descriptor value of the subset (x), the average descriptor value of the entire dataset (μ), and the standard deviation of the entire dataset (σ). The z-score analysis indicates that, in many cases, different subsets of the database have distinct steric and electronic properties. For example, 5-membered rings are less sterically hindered than other subsets as indicated by their lower z-scores for fraction buried volume (−0.94) and Sterimol length (−1.20). Additionally, these different electronic descriptors do not strongly correlate with each other and thus could provide complementary descriptions of different types of electronic effects on reactivity.

Fig. 4
figure 4

Statistical analysis of the heteroaryl database. (a) Principal component analysis (PCA). (b) Uniform manifold approximation and projection (UMAP) of DFT-computed descriptors showing broader distributions of heteroaryl properties than those of aryl groups. (c) z-score analysis to indicate the normalized average deviation of subsets of data points from the average values of the entire dataset.

Heteroaromatic compounds are known to have distinct electronic and steric properties affected by their heteroarene cores66. We selected a subset of common heteroaryl groups and plotted their average steric and electronic properties based on the heteroarene core (Fig. 5). We chose a commonly-used steric descriptor, fraction buried volume (VBur), and our newly developed electronic descriptor, the Hammett-type substituent constant (σHet), to illustrate the properties of the heteroaryl groups. Each point represents the average value of all heteroaryls with the same heteroarene core shown in the top panel of Fig. 5. This plot revealed several general trends that could be qualitatively validated by previous experimental observations. In terms of steric effects, five-membered heteroaryls are generally less hindered due to the smaller size of the ring67, and sulphur-containing heteroaryls often have a larger fraction buried volume due to the longer C–S bond (e.g., compare thiophenyl 10 with other five-membered heteroaryls). As expected, substituting a C–H moiety in phenyl or naphthyl with a nitrogen atom decreases the fraction buried volume (e.g., benzene 14 > pyridine 15 > pyridazine 16 > triazines 18 and 19). In terms of electronic properties, five-membered heteroaryls exhibited more diverse electronic effects than other ring sizes. Based on σHet values, pyrrole (1) is more electron-donating than benzene (14), consistent with its higher reactivity in electrophilic aromatic substitution reactions68. On the other hand, 1,3,4-thiadiazole (13) and 1,2,4-oxadiazole (9) are among the most electron-withdrawing heteroaryls, which is consistent with previous experimental observations69,70.

Fig. 5
figure 5

Steric and electronic properties of heteroaryls based on the heteroarene core. Visualization of fraction buried volume (a steric descriptor) versus the Hammett-type substituent constant (an electronic descriptor) describes the intrinsic properties of each type of heteroaryl group.

With a wide array of DFT-computed electronic, steric, and geometrical descriptors for diverse heteroaryl groups in the HArD database, we set out to explore whether the computed descriptors correlate with previously reported experimental reactivity and selectivity data. In 2022, the Leitch group published an extensive set of experimentally determined free energies of activation (ΔGSNAr) of nucleophilic aromatic substitution (SNAr) of benzyl alcohol reacting with 2-chloropyridines, 4-chloropyridines, and other chloro-substituted (hetero)aryl compounds (Fig. 6a)21. In another recent work, the Baud group performed systematic kinetic studies of 2-sulphonylpyrimidines as thiol-reactive covalent warheads that can undergo SNAr reactions with biological thiols to achieve selective protein arylation (Fig. 6b)71. The authors achieved a wide range of reactivity for these warheads with the glutathione (GSH) nucleophile by altering the substituents at the 4- and 5-positions of the pyrimidinyl group and exchanging the pyrimidine ring for different parent heteroaryl groups. Both sets of experimental reactivity data strongly correlate with the DFT-computed CM5 charge of the heteroaryl group, q(Het)-CM5 (Fig. 6a,b), illustrating the capability of the q(Het)-CM5 descriptor for quantitative SNAr reactivity prediction. Next, we examined whether the computed electronic descriptors correlate with experimental site-selectivity trends in radical-mediated heteroaryl C–H functionalization from Baran et al. (Fig. 6c)10. The computed ipso carbon LUMO coefficients in the HArD database could qualitatively predict the preferred sites for alkyl radical addition in the majority of the examples studied. Taken together, these preliminary examinations suggest that the computed descriptors in the HArD database could potentially be applied in various types of reactivity and selectivity prediction models.

Fig. 6
figure 6

Utilizing heteroaryl descriptors to predict experimental reactivity and selectivity. (a) Reactivity prediction of nucleophilic aromatic substitution with chloro-substituted heteroaryl electrophiles with experimental reactivity data from Leitch et al.21 q(Het)-CM5: Charge Model 5 (CM5) charge of the heteroaryl group. (b) Reactivity prediction of 2-sulphonylpyrimidine warheads with experimental thiol reactivity data from Baud et al.71. (c) Site-selectivity prediction for radical-mediated C–H functionalization of electron-deficient heteroarenes with selectivity reported by Baran et al.10.

Usage Notes

The database can be accessed in three ways: (i) via the interactive interface provided on the website (hard.pengliugroup.com) to search descriptors by entering a SMILES string or drawing the molecule in the JSME editor72 (Fig. 7), (ii) by downloading the Excel file (hard.xlsx) from the repository, which contains descriptors that can be utilized with various data analysis software libraries, and (iii) by using the Python script (hard.py) to perform searches by SMILES and by similarity, which is recommended when searching for a large list of SMILES to study correlation with reactivity/selectivity or building machine learning models. README files are provided to provide guidance for reproducing and expanding this database. Example bash and Slurm scripts for high-throughput calculations on HPC systems are also included.

Fig. 7
figure 7

Annotated screenshots of hard.pengliugroup.com. A structure search can be performed by either using a SMILES string for the heteroarene or drawing the molecule directly in the editor. The site then returns all regioisomers of the specified heteroaryl group, along with their computed descriptors.