Table 1 Combined results for all compounds in the TTD database for beam size B = 1000 and the different transformer models

From: Exhaustive local chemical space exploration using a transformer model

TASK

VALIDITY

UNIQUENESS

TOP IDENTICAL

RANK SCORE

CORRELATION

  

P

C

NS

P

C

NS

  

\({\mathcal{D}}\), λ = 0

1.00

1.00

0.97

0.60

0.04

0.04

0.62

0.24 ± 0.27

0.37 ± 0.13

\({\mathcal{D}}\), λ = 10

0.99

1.00

0.95

0.53

0.31

0.30

0.93

0.35 ± 0.25

0.56 ± 0.17

\({{\mathcal{D}}}^{c}\), λ = 0

1.00

1.00

0.97

0.59

0.06

0.07

0.66

0.29 ± 0.25

0.39 ± 0.14

\({{\mathcal{D}}}^{c}\), λ = 10

0.99

1.00

0.95

0.53

0.31

0.31

0.93

0.44 ± 0.24

0.60 ± 0.19

  1. A higher value is the better for all the columns, and the best results are highlighted in bold. \({\mathcal{D}}\) and \({{\mathcal{D}}}^{c}\) represent the training sets generated with ECFP4 fingerprints without and with counts, respectively. The sub-columns P, C, and NS under UNIQUENESS and TOP IDENTICAL denotes different type of post-processing applied to the generated target compounds. The sub-columns are fraction unique SMILES strings (P), fraction unique SMILES strings after canonicalization (C), and fraction unique SMILES strings after removing stereo-chemical information and canonicalization (NS). λ = 0 denotes the absence and λ = 10 the presence of the regularization term when training the transformer models. Best results are highlighted in bold.