Table 3 Comparison of SELFIES, SMILES, t-SMILES and fragSMILES, across different augmentation levels and based on various properties, for a set of generated strings (using a ChEMBL subset, across five cross-validation folds)

From: fragSMILES as a chemical string notation for advanced fragment and chirality representation

	6000 (x5 fold) sampled strings			6000 (x5 fold) sampled novel molecules					6000 (x5 fold) sampled strings (chiral set)
Notation	Validity (↑)	Uniqueness (↑)	Novelty (↑)	FCD•10¹ (↓)	ΔlogP•10¹ (↓)	ΔSA•10² (↓)	ΔQED•10² (↓)	ΔMW (↓)	Invalidity (↓)	Validity (↑)	Uniqueness (↑)	Novelty (↑)
SMILES 1×	4930 ± 70* (82%)*	4920 ± 70* (100%)	4770 ± 60* (97%)	8 ± 1*	0.8 ± 0.3	5 ± 3	2 ± 1	14 ± 4	400 ± 40* (22%)*	1370 ± 40 (78%)*	1370 ± 40 (100%)	1320 ± 40* (96%)
SELFIES 1×	6000 ± 0* (100%)*	5999 ± 2* (100%)*	5971 ± 2* (100%)*	55 ± 2*	2.0 ± 0.9	74 ± 4*	1.9 ± 0.3	5 ± 3	670 ± 40* (37%)*	1150 ± 20* (63%)*	1150 ± 20* (100%)*	1140 ± 20* (99%)*
t-SMILES 1×	6000 ± 0* (100%)*	5880 ± 10* (98%)*	5860 ± 10* (100%)*	15.6 ± 0.8*	2 ± 1	5 ± 1	3.8 ± 0.5*	38 ± 3*	1010 ± 50* (55%)*	830 ± 50* (45%)*	830 ± 50* (100%)*	830 ± 50* (100%)*
fragSMILES 1×	5280 ± 20 (88%)	5270 ± 30 (100%)	5110 ± 40 (97%)	6.9 ± 0.5	1.1 ± 0.6	5 ± 3	1 ± 1	9 ± 5	330 ± 30 (19%)	1440 ± 70 (81%)	1440 ± 60 (100%)	1400 ± 60 (97%)
SMILES 5×	5300 ± 40* (88%)*	5300 ± 40* (100%)*	5280 ± 40 (100%)*	9.9 ± 0.7*	1.1 ± 0.4	6 ± 2	2 ± 2	15 ± 9	320 ± 50 (17%)*	1500 ± 100 (83%)*	1500 ± 100 (100%)*	1500 ± 100 (100%)*
SELFIES 5×	6000 ± 0* (100%)*	6000 ± 0* (100%)*	5997 ± 1* (100%)*	34 ± 1*	1.2 ± 0.5	53 ± 2*	1.7 ± 0.5	5 ± 2	520 ± 40* (27%)*	1380 ± 80* (73%)*	1380 ± 80* (100%)*	1370 ± 80* (100%)*
t-SMILES 5×	6000 ± 0* (100%)*	5930 ± 10* (99%)*	5880 ± 10* (99%)*	13.7 ± 0.6*	1.4 ± 0.6	5 ± 2	3 ± 1*	36 ± 4*	1000 ± 100* (53%)*	890 ± 60* (47%)*	890 ± 60* (100%)*	880 ± 60* (99%)*
fragSMILES 5×	5420 ± 60 (90%)	5410 ± 60 (100%)	5300 ± 60 (98%)	7.2 ± 0.6	1.5 ± 0.7	5 ± 2	1.5 ± 0.7	7 ± 4	290 ± 30 (15%)	1700 ± 100 (85%)	1700 ± 100 (100%)	1600 ± 100 (98%)

For each metric, the string sampling strategy is reported. (FCD = Fréchet ChemNet Distance; logP = octanol-water partitioning coefficient, SA = Synthetic Accessibility; QED = Quantitative Estimation of Drug-likeness; MW = molecular weight; Δ = Wasserstein-1 distance to the training set). * Indicates statistically significant differences (t-test, α = 0.05) with relative values of fragSMILES notation. The best value of each metric is indicated in boldface.

Back to article page

Table 3 Comparison of SELFIES, SMILES, t-SMILES and fragSMILES, across different augmentation levels and based on various properties, for a set of generated strings (using a ChEMBL subset, across five cross-validation folds)

Search

Quick links