Extended Data Table 2 Data-efficiency comparison of best-performing GPT-3-based approaches with best-performing baselines

From: Leveraging large language models for predictive chemistry

For the best comparison, we also split into (pre-trained) deep-learning (DL)-based baselines (here, MolCLR⁶⁸, ModNet⁶⁹, CrabNet⁶², and TabPFN⁷⁰) and baselines not using (pre-trained) deep-learning approaches (n-Gram, Gaussian Process Regression, XGBoost, random forests, automated machine learning optimized for materials science²⁵) on hand-tuned feature sets. For the analysis in this table, we fit the learning curves for the GPT-3 models and for the baselines and measure where the learning curves intersect, that is, we determine the factor of how much more (or less) data we would need to make the best baseline perform equal to the GPT-3 models in the low-data regime of the learning curves. Full learning curves for all models can be found in Supplementary Note 6. In parentheses, we mention the baseline we considered for each case study. In doing so, we use the following acronyms: t for TabPFN⁷⁰, m for MolCLR⁶⁸, n for n-Gram, g for GPR⁷¹, x for XGBoost⁷² on molecular descriptors such as fragprints⁷¹, xmo for XGBoost model similar to the one in Moosavi et al.⁷³, xj for an XGBoost model similar to the one in Jablonka et al.⁴⁵, mo for the atom-centered model from Moosavi et al.⁷⁴, c for CrabNet⁶², prf for the random forest model reported by Pei et al.²⁴, a for automatminer²⁵, mod for ModNet⁶⁹, drfp for differentiable reaction fingerprints⁷⁵ as input for a GPR⁷¹. For the case studies on reaction datasets, we did not consider a deep learning baseline. There are several caveats to this analysis. First, focusing on the low-data regime might not always be the most relevant perspective. Second, we only focus on the binary classification setting in this table. Third, we focus on the F₁ macro score for this table (all cases are class-balanced). Fourth, we consider the performance of the GPT-3 model for ten training data points as a reference. We provide more details in Supplementary Note 6. The version of GPT-3 we utilized in this work has been trained on data up to Oct 2019 that mostly comes from web scraping (Common Crawl⁷⁶ and WebText⁷⁷) along with books corpora and Wikipedia. Structured datasets, however, have not been part of the training. Also, note that our approach works well on representations that have not been used for the original datasets (for example, SELFIES, InChI). For the case studies on reaction datasets, we did not consider a deep learning baseline, hence the corresponding values have been omitted in the table. For computing the table, we utilized data reported in Refs. ^{78,79,80,81,82,83,84,85,86,87,88}.

Back to article page

Extended Data Table 2 Data-efficiency comparison of best-performing GPT-3-based approaches with best-performing baselines

Search

Quick links