Table 3 Results for the distribution-learning benchmarks on ChEMBL using GPT

From: t-SMILES: a fragment-based molecular representation framework for de novo ligand design

	Model	Valid	Unique	Novelty	KLD	FCD	Nov./Uni.
Baseline Graph	Graph MCTS^46,56	1.000	1.000	0.994	0.522	0.015	N/A
	hG2G¹⁸	1.000	0.995	0.940	0.888	0.506	N/A
	MGM⁴⁷	0.849	1.000	0.722	0.987	0.845	N/A
Baseline SMILES	LSTM^40,46	0.959	1.000	0.912	0.991	0.913	N/A
	CharacterVAE^2,46	0.870	0.999	0.974	0.982	0.863	N/A
	AAE⁴⁶	0.822	1.000	0.998	0.886	0.529	N/A
	ORGAN^3,46	0.379	0.841	0.687	0.267	0.000	N/A
	Transformer Reg^31,47	0.961	1.000	0.846	0.977	0.883	N/A
	MolGPT⁴	0.981	0.998	1.000	0.992	0.907	N/A
	FASMIFRA⁴⁵	1.000	0.994	0.702	0.959	0.814	N/A
String	SMILES_[R10]	0.980	0.979	0.907	0.992	0.906	0.926
	DSMILES_[R10]	0.898	0.897	0.836	0.989	0.893	0.933
	DSMILES_[R15]	0.910	0.908	0.845	0.992	0.896	0.930
	SELFIES_[R10]	1.000	1.000	0.958	0.979	0.857	0.959
	SELFIES_[R15]	1.000	0.999	0.953	0.983	0.865	0.954
t-SMILES Family	TS_Vanilla_[R10]	1.000	0.999	0.914	0.993	0.901	0.915
	TS_Vanilla_[R15]	1.000	0.998	0.907	0.994	0.907	0.909
	TSSA_J_[R10]	1.000	0.993	0.969	0.971	0.712	0.975
	TSSA_B_[R20]	1.000	0.995	0.956	0.972	0.708	0.961
	TSSA_M_[R50]	1.000	0.996	0.970	0.982	0.808	0.974
	TSSA_S_[R50]	1.000	0.998	0.977	0.966	0.795	0.979
	TSSA_HJBMSV_[R20]	1.000	0.998	0.970	0.964	0.825	0.971
	TSDY_B_[R15]	1.000	0.999	0.960	0.977	0.854	0.961
	TSDY_M_[R15]	1.000	0.998	0.970	0.960	0.852	0.972
	TSDY_S_[R15]	1.000	0.999	0.955	0.982	0.878	0.956
	TSDY_HBV_[R15]	1.000	0.999	0.943	0.988	0.897	0.944
	TSDY_HMV_[R10]	1.000	0.998	0.962	0.973	0.872	0.963
	TSDY_HSV_[R10]	1.000	0.999	0.950	0.985	0.891	0.951
	TSDY_HBMSV_[R10]	1.000	0.999	0.964	0.973	0.883	0.966
	TSID_B_[R10]	1.000	0.999	0.941	0.989	0.909	0.942
	TSID_M_[R10]	1.000	0.998	0.942	0.968	0.892	0.945
	TSID_S_[R10]	1.000	0.999	0.933	0.991	0.909	0.935
	TSID_HBV_[R15]	1.000	0.999	0.941	0.989	0.883	0.941
	TSID_HBMSV_[R10]	1.000	0.999	0.953	0.982	0.893	0.954

The results of ORGAN^3,46, LSTM^40,46, CharacterVAE^2,46, AAE⁴⁶ and Graph MCTS^46,56 are taken from GuacaMol⁴⁶, Transformer Reg^31,47 and MGM⁴⁷ are taken from ref. ⁴⁷, MolGPT⁴ is taken ref. ⁴, FASMIFRA⁴⁵ is taken from its reference, the results of hgraph2graph¹⁸ is calculated by us. CReM³⁸ is not included as a baseline due to its nearly zero FCD score, even though its novelty score is close to 1. All other models are trained by us. Models based on TSSA, TSDY and TSID are trained in different epochs, with “R” indicating the number of training rounds, such as “[R10]”. The letter “H” in t-SMILES code names indicates a hybrid model, while the letters “J”, “B”, “M”, and “S” indicate fragmentation algorithm: JTVAE, BRICS, MMPA, and Scaffolds, “V” indicates TS_Vanilla code. “KLD” stands for Kullback–Leibler divergence. “FCD“ represents Fréchet ChemNet Distance. “Nov./Uni.” represents the ratio of a novelty score to a uniqueness score. Refer to SI Table 8 for repeatability.

Back to article page

Table 3 Results for the distribution-learning benchmarks on ChEMBL using GPT

Search

Quick links