Table 1 Performance of baseline, D-MPNN, combined ML + QM and TD-DFT approaches to calculate transition energies (in eV)

From: DyeDactic workflow to predict halochromism of biosynthetic colourants

ID

Method

R2

MAE

R2 (sys. corr.)

MAE (sys. cor.)

R2 (sys. corr.)c

MAE (sys. cor.)c

1

Baseline HOMO-LUMO gap (GFN2-xTB)

−6.27

1.271

0.277

0.308

2

Application of a published D-MPNN trained only on artificial colourants

0.181

0.321

0.324

0.267

6

D-MPNN trained using the collected natural colourantsa

0.581

0.184

3

Linear regression + QM descriptorsb

⍵B97X-D4

0.443

0.191

4

⍵B97X-D4 + CPCM

0.441

0.191

5

TD/TDA-DFT calculation

PBE0

0.430

0.278

0.504

0.244

0.650

0.201

6

PBE0 + CPCM

0.450

0.263

0.509

0.247

0.653

0.205

7

⍵B97X-D4

−0.798

0.615

0.619

0.209

0.766

0.166

8

⍵B97X-D4 + CPCM

−0.302

0.514

0.630

0.208

0.776

0.164

9

⍵B97X-D4 + TDA

−1.79

0.779

0.600

0.214

0.754

0.169

10

⍵B97X-D4 + TDA + CPCM

−1.01

0.659

0.623

0.210

0.772

0.164

11

BMK

0.037

0.423

0.569

0.227

0.727

0.181

12

BMK + CPCM

0.290

0.364

0.577

0.229

0.724

0.184

13

CAM-B3LYP

0.247

0.347

0.566

0.226

0.727

0.180

14

CAM-B3LYP + CPCM

0.434

0.299

0.579

0.227

0.728

0.182

15

M06-2X

−0.11

0.46

0.596

0.217

0.757

0.172

16

M06-2X + CPCM

0.204

0.382

0.607

0.217

0.759

0.171

17

B2PLYP

0.461

0.262

0.553

0.222

0.717

0.175

18

B2PLYP + CPCM

0.551

0.220

0.593

0.212

0.757

0.163

19

SCS-PBE-QIDH

0.269

0.346

0.597

0.207

0.761

0.159

20

SCS-PBE-QIDH + CPCM

0.518

0.255

0.657d

0.183

0.822

0.133

21

SCS-⍵PBEPP86

0.404

0.304

0.603

0.206

0.764

0.158

22

SCS-⍵PBEPP86 + CPCM

0.605

0.220

0.657

0.183

0.816

0.135

  1. Determination coefficient (R2) and mean absolute error (MAE) are used to rank the performance of the approaches before and after the removal of systematic error components. The last two columns on the right are used to compare the ML model trained on the cleaned dataset obtained based on the five-fold cross-validation and TD-DFT calculation results.
  2. aCalculated in cross-validation for a cleaned natural colourants dataset containing 595 compounds versus 647 molecules in the original set.
  3. b10-fold CV.
  4. cDataset with the outliers removed which was used to train the ML models.
  5. dBold font marks the best performing techniques.