Table 2 Number of active and inactive compounds and year threshold used for the time split. ChEMBL data were temporally split into training, update1, update2 and holdout set based on the publication year. Models for the micro nucleus test and liver toxicity endpoint were trained on public data while the inhouse data were split into update and holdout set based on the internal measurement date.

From: Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data

Target (ID)

Training set

Update1 set

Update2 set

Holdout set

Thresh*

Inactive

Active

Thresh*

Inactive

Active

Thresh*

Inactive

Active

Thresh*

Inactive

Active

CHEMBL220

2014

802

840

2016

211

248

2017

217

138

2020

104

113

CHEMBL4078

2014

1031

1008

2015

259

275

2016

267

202

2020

499

270

CHEMBL5763

2015

1125

600

2016

302

75

2017

307

95

2020

137

114

CHEMBL203

2012

1660

433

2014

526

213

2016

428

291

2020

341

167

CHEMBL206

2006

437

325

2012

117

63

2016

114

97

2020

158

105

CHEMBL279

2010

1955

649

2013

523

307

2014

618

137

2020

686

299

CHEMBL230

2010

475

542

2013

218

78

2015

237

80

2020

218

172

CHEMBL340

2012

1272

496

2014

439

153

2015

341

59

2020

449

107

CHEMBL240

2012

797

1938

2014

301

413

2016

265

526

2020

238

498

CHEMBL2039

2014

710

645

2015

189

192

2017

380

212

2020

134

72

CHEMBL222

2009

231

673

2011

61

227

2015

40

206

2020

74

54

CHEMBL228

2009

242

858

2011

97

373

2014

31

235

2020

79

196

Micro nucleus test

-

1475

316

2005

70

134

2020

98

50

Liver toxicity

-

247

445

2011

42

48

2020

35

15

  1. *Thresh: Data points published (ChEMBL) or measured (micro nucleus test, liver toxicity) until this year threshold are included in the corresponding subset.