Table 1 Comparison of classification accuracy (AUC-ROC) between fine-tuned MLM-FG and existing pre-trained/self-supervised baselines on multiple classification benchmarks
From: Pre-trained molecular language models with random functional group masking
 | BBBP | BACE | ClinTox | Tox21 | SIDER | HIV | MUV |
---|---|---|---|---|---|---|---|
No. molecules | 2039 | 1513 | 1478 | 7831 | 1427 | 41,127 | 93,087 |
No. prediction tasks | 1 | 1 | 2 | 12 | 27 | 1 | 17 |
Pre-trained models from existing literature | |||||||
MolCLR-gin | 0.9307 | 0.7873 | 0.8005 | 0.7644 | 0.5826 | 0.7768 | 0.7386 |
MolCLR-gcn | 0.8432 | 0.7194 | 0.7997 | 0.7179 | 0.5353 | 0.7616 | 0.6701 |
GROVER-base | 0.9022 | 0.7700 | 0.6847 | 0.7187 | 0.5579 | 0.6950 | 0.6265 |
GROVER-large | 0.8861 | 0.7795 | 0.6082 | 0.7155 | 0.5283 | 0.6956 | 0.5132 |
GEM | 0.9103 | 0.8603 | 0.8506 | 0.7791 | 0.6279 | 0.7500 | 0.7253 |
MoLFormer | 0.9037 | 0.8275 | 0.9451 | 0.7734 | 0.5826 | 0.7630 | 0.7599 |
MoLFormer and RoBERTa models without pre-training | |||||||
MoLFormer (from scratch) | 0.8636 | 0.7728 | 0.7317 | 0.7461 | 0.5667 | 0.6991 | 0.6863 |
RoBERTa (from scratch) | 0.8711 | 0.7445 | 0.8858 | 0.7369 | 0.5285 | 0.5575 | 0.6674 |
RoBERTa models pre-trained by random subsequence masking | |||||||
RoBERTa (10M, rand. subseq) | 0.8572 | 0.8253 | 0.9284 | 0.7533 | 0.6111 | 0.7006 | 0.6234 |
RoBERTa (20M, rand. subseq) | 0.9068 | 0.8135 | 0.9011 | 0.7635 | 0.5799 | 0.7477 | 0.6481 |
RoBERTa (100M, rand. subseq) | 0.9048 | 0.8248 | 0.9167 | 0.7852 | 0.5860 | 0.7683 | 0.6909 |
MoLFormer and RoBERTa models pre-trained by MLM-FG | |||||||
MLM-FG (MoLFormer, 10M) | 0.8980 | 0.8044 | 0.9669 | 0.7765 | 0.5811 | 0.7633 | 0.6829 |
MLM-FG (MoLFormer, 20M) | 0.8976 | 0.8088 | 0.9436 | 0.7793 | 0.5992 | 0.7801 | 0.7185 |
MLM-FG (MoLFormer, 100M) | 0.9055 | 0.8040 | 0.9270 | 0.7893 | 0.5786 | 0.7690 | 0.6017 |
MLM-FG (RoBERTa, 10M) | 0.8870 | 0.8265 | 0.9258 | 0.7545 | 0.6054 | 0.7106 | 0.6103 |
MLM-FG (RoBERTa, 20M) | 0.9378 | 0.8458 | 0.8919 | 0.7603 | 0.5908 | 0.7594 | 0.6428 |
MLM-FG (RoBERTa, 100M) | 0.9237 | 0.7981 | 0.9606 | 0.7896 | 0.6042 | 0.7807 | 0.7990 |