Table 3 Comparison of SELFIES, SMILES, t-SMILES and fragSMILES, across different augmentation levels and based on various properties, for a set of generated strings (using a ChEMBL subset, across five cross-validation folds)

From: fragSMILES as a chemical string notation for advanced fragment and chirality representation

 

6000 (x5 fold) sampled strings

6000 (x5 fold) sampled novel molecules

6000 (x5 fold) sampled strings (chiral set)

Notation

Validity (↑)

Uniqueness (↑)

Novelty (↑)

FCD•101 (↓)

ΔlogP•101 (↓)

ΔSA•102 (↓)

ΔQED•102 (↓)

ΔMW (↓)

Invalidity (↓)

Validity (↑)

Uniqueness (↑)

Novelty (↑)

SMILES 1×

4930 ± 70* (82%)*

4920 ± 70* (100%)

4770 ± 60* (97%)

8 ± 1*

0.8 ± 0.3

5 ± 3

2 ± 1

14 ± 4

400 ± 40* (22%)*

1370 ± 40 (78%)*

1370 ± 40 (100%)

1320 ± 40* (96%)

SELFIES 1×

6000 ± 0* (100%)*

5999 ± 2* (100%)*

5971 ± 2* (100%)*

55 ± 2*

2.0 ± 0.9

74 ± 4*

1.9 ± 0.3

5 ± 3

670 ± 40* (37%)*

1150 ± 20* (63%)*

1150 ± 20* (100%)*

1140 ± 20* (99%)*

t-SMILES 1×

6000 ± 0* (100%)*

5880 ± 10* (98%)*

5860 ± 10* (100%)*

15.6 ± 0.8*

2 ± 1

5 ± 1

3.8 ± 0.5*

38 ± 3*

1010 ± 50* (55%)*

830 ± 50* (45%)*

830 ± 50* (100%)*

830 ± 50* (100%)*

fragSMILES 1×

5280 ± 20 (88%)

5270 ± 30 (100%)

5110 ± 40 (97%)

6.9 ± 0.5

1.1 ± 0.6

5 ± 3

1 ± 1

9 ± 5

330 ± 30 (19%)

1440 ± 70 (81%)

1440 ± 60 (100%)

1400 ± 60 (97%)

SMILES 5×

5300 ± 40* (88%)*

5300 ± 40* (100%)*

5280 ± 40 (100%)*

9.9 ± 0.7*

1.1 ± 0.4

6 ± 2

2 ± 2

15 ± 9

320 ± 50 (17%)*

1500 ± 100 (83%)*

1500 ± 100 (100%)*

1500 ± 100 (100%)*

SELFIES 5×

6000 ± 0* (100%)*

6000 ± 0* (100%)*

5997 ± 1* (100%)*

34 ± 1*

1.2 ± 0.5

53 ± 2*

1.7 ± 0.5

5 ± 2

520 ± 40* (27%)*

1380 ± 80* (73%)*

1380 ± 80* (100%)*

1370 ± 80* (100%)*

t-SMILES 5×

6000 ± 0* (100%)*

5930 ± 10* (99%)*

5880 ± 10* (99%)*

13.7 ± 0.6*

1.4 ± 0.6

5 ± 2

3 ± 1*

36 ± 4*

1000 ± 100* (53%)*

890 ± 60* (47%)*

890 ± 60* (100%)*

880 ± 60* (99%)*

fragSMILES 5×

5420 ± 60 (90%)

5410 ± 60 (100%)

5300 ± 60 (98%)

7.2 ± 0.6

1.5 ± 0.7

5 ± 2

1.5 ± 0.7

7 ± 4

290 ± 30 (15%)

1700 ± 100 (85%)

1700 ± 100 (100%)

1600 ± 100 (98%)

  1. For each metric, the string sampling strategy is reported. (FCD = Fréchet ChemNet Distance; logP = octanol-water partitioning coefficient, SA = Synthetic Accessibility; QED = Quantitative Estimation of Drug-likeness; MW = molecular weight; Δ = Wasserstein-1 distance to the training set). * Indicates statistically significant differences (t-test, α = 0.05) with relative values of fragSMILES notation. The best value of each metric is indicated in boldface.