Table 2 Summary of the results for optimal feature choice using PLS, RFR, SVR with linear and RBF kernels, and GPR with the RBF kernel.

From: Aqueous pKa prediction for tautomerizable compounds using equilibrium bond lengths

Property/Metric

Marvin

PLS

RFR

SVR [linear]

SVR [RBF]

GPR [RBF]

Features used

C–O, C=C, C–C, C=O

C–O, C–C, C=O

C–O, C=O

C–O, C–C, C=O

C–O, C–C C=O

   

Max depth = 6

C = 1000

C = 1000

ℓ = −8.21,

−6.150,

−12.851

Hyperparameters

LV = 3

    
     

ε = 0.1

 
   

nest = 25

ε = 0.01

  
     

γ = 5

 

MAE (7-fold CV) (train)

0.41

0.46

0.43

0.40

0.30

RMSEE (7-fold CV) (train)

0.53

0.57

0.57

0.53

0.39

MAE (test)

1.21 (4.70)

0.31

0.39

0.29

0.29

0.43

RMSEP (test)

1.63 (6.32)

0.36

0.49

0.40

0.36

0.59

s.d. (test)

1.12 (4.32)

0.19

0.31

0.28

0.22

0.36

r2 obs vs pred (test)

0.61 (0.55)

0.86

0.74

0.90

0.86

0.67

  1. The “Marvin” column corresponds to statistics for predictions made without considering tautomers/resonance (without parentheses), and the values in parentheses correspond to the predictions made with consideration of tautomers/resonance. The “features used” row lists the combination of features that minimized the RMSEE of the training set for each method. These features were subsequently used in the model used to predict for test set compounds. The row labelled “hyperparameters” lists the values obtained through minimization of RMSEE of the training set during 7-fold cross-validation (RFR and SVR). For PLS the number of latent variables (LV) was varied up to the number of features and the final number chosen on the basis of minimizing the RMSEE of the training set, which is also shown. For the GPR model, feature selection as carried out using 7-fold validation of each combination/subset of features using the training set and 100 restarts were used to locate the global maximum log likelihood of the y-values. The MAE, RMSEP, standard deviation of absolute errors (s.d.) and r2 of observed vs predicted values are shown for the test set.