Table 3 Results for the distribution-learning benchmarks on ChEMBL using GPT

From: t-SMILES: a fragment-based molecular representation framework for de novo ligand design

 

Model

Valid

Unique

Novelty

KLD

FCD

Nov./Uni.

Baseline Graph

Graph MCTS46,56

1.000

1.000

0.994

0.522

0.015

N/A

hG2G18

1.000

0.995

0.940

0.888

0.506

N/A

MGM47

0.849

1.000

0.722

0.987

0.845

N/A

Baseline SMILES

LSTM40,46

0.959

1.000

0.912

0.991

0.913

N/A

CharacterVAE2,46

0.870

0.999

0.974

0.982

0.863

N/A

AAE46

0.822

1.000

0.998

0.886

0.529

N/A

ORGAN3,46

0.379

0.841

0.687

0.267

0.000

N/A

Transformer Reg31,47

0.961

1.000

0.846

0.977

0.883

N/A

MolGPT4

0.981

0.998

1.000

0.992

0.907

N/A

FASMIFRA45

1.000

0.994

0.702

0.959

0.814

N/A

String

SMILES_[R10]

0.980

0.979

0.907

0.992

0.906

0.926

DSMILES_[R10]

0.898

0.897

0.836

0.989

0.893

0.933

DSMILES_[R15]

0.910

0.908

0.845

0.992

0.896

0.930

SELFIES_[R10]

1.000

1.000

0.958

0.979

0.857

0.959

SELFIES_[R15]

1.000

0.999

0.953

0.983

0.865

0.954

t-SMILES Family

TS_Vanilla_[R10]

1.000

0.999

0.914

0.993

0.901

0.915

TS_Vanilla_[R15]

1.000

0.998

0.907

0.994

0.907

0.909

TSSA_J_[R10]

1.000

0.993

0.969

0.971

0.712

0.975

TSSA_B_[R20]

1.000

0.995

0.956

0.972

0.708

0.961

TSSA_M_[R50]

1.000

0.996

0.970

0.982

0.808

0.974

TSSA_S_[R50]

1.000

0.998

0.977

0.966

0.795

0.979

TSSA_HJBMSV_[R20]

1.000

0.998

0.970

0.964

0.825

0.971

TSDY_B_[R15]

1.000

0.999

0.960

0.977

0.854

0.961

TSDY_M_[R15]

1.000

0.998

0.970

0.960

0.852

0.972

TSDY_S_[R15]

1.000

0.999

0.955

0.982

0.878

0.956

TSDY_HBV_[R15]

1.000

0.999

0.943

0.988

0.897

0.944

TSDY_HMV_[R10]

1.000

0.998

0.962

0.973

0.872

0.963

TSDY_HSV_[R10]

1.000

0.999

0.950

0.985

0.891

0.951

TSDY_HBMSV_[R10]

1.000

0.999

0.964

0.973

0.883

0.966

TSID_B_[R10]

1.000

0.999

0.941

0.989

0.909

0.942

TSID_M_[R10]

1.000

0.998

0.942

0.968

0.892

0.945

TSID_S_[R10]

1.000

0.999

0.933

0.991

0.909

0.935

TSID_HBV_[R15]

1.000

0.999

0.941

0.989

0.883

0.941

TSID_HBMSV_[R10]

1.000

0.999

0.953

0.982

0.893

0.954

  1. The results of ORGAN3,46, LSTM40,46, CharacterVAE2,46, AAE46 and Graph MCTS46,56 are taken from GuacaMol46, Transformer Reg31,47 and MGM47 are taken from ref. 47, MolGPT4 is taken ref. 4, FASMIFRA45 is taken from its reference, the results of hgraph2graph18 is calculated by us. CReM38 is not included as a baseline due to its nearly zero FCD score, even though its novelty score is close to 1. All other models are trained by us. Models based on TSSA, TSDY and TSID are trained in different epochs, with “R” indicating the number of training rounds, such as “[R10]”. The letter “H” in t-SMILES code names indicates a hybrid model, while the letters “J”, “B”, “M”, and “S” indicate fragmentation algorithm: JTVAE, BRICS, MMPA, and Scaffolds, “V” indicates TS_Vanilla code. “KLD” stands for Kullback–Leibler divergence. “FCD“ represents Fréchet ChemNet Distance. “Nov./Uni.” represents the ratio of a novelty score to a uniqueness score. Refer to SI Table 8 for repeatability.