Figure 2

MolT5 and custom tokenizers: MolT5 tokenizer uses the default English language tokenization and splits the input text into subwords. The intuition is that SMILES strings are composed of characters typically found in English text, and pretraining on large-scale English corpora may be helpful. On the other hand, the custom tokenizer method utilizes the grammar of SMILES and decomposes the input into grammatically valid components.