Pre-training data quality has been shown to be paramount to the outcome of foundation models. With that in mind, the authors present a family of molecular encoder-decoder foundation models, SMI-TED289M, which are pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem and demonstrate their performance in providing state-of-the-art results for different tasks, including reaction-yield prediction.
- Eduardo Soares
- Emilio Vital Brazil
- Kristin Schmidt