Extended Data Table 2 Data-efficiency comparison of best-performing GPT-3-based approaches with best-performing baselines

From: Leveraging large language models for predictive chemistry

  1. For the best comparison, we also split into (pre-trained) deep-learning (DL)-based baselines (here, MolCLR68, ModNet69, CrabNet62, and TabPFN70) and baselines not using (pre-trained) deep-learning approaches (n-Gram, Gaussian Process Regression, XGBoost, random forests, automated machine learning optimized for materials science25) on hand-tuned feature sets. For the analysis in this table, we fit the learning curves for the GPT-3 models and for the baselines and measure where the learning curves intersect, that is, we determine the factor of how much more (or less) data we would need to make the best baseline perform equal to the GPT-3 models in the low-data regime of the learning curves. Full learning curves for all models can be found in Supplementary Note 6. In parentheses, we mention the baseline we considered for each case study. In doing so, we use the following acronyms: t for TabPFN70, m for MolCLR68, n for n-Gram, g for GPR71, x for XGBoost72 on molecular descriptors such as fragprints71, xmo for XGBoost model similar to the one in Moosavi et al.73, xj for an XGBoost model similar to the one in Jablonka et al.45, mo for the atom-centered model from Moosavi et al.74, c for CrabNet62, prf for the random forest model reported by Pei et al.24, a for automatminer25, mod for ModNet69, drfp for differentiable reaction fingerprints75 as input for a GPR71. For the case studies on reaction datasets, we did not consider a deep learning baseline. There are several caveats to this analysis. First, focusing on the low-data regime might not always be the most relevant perspective. Second, we only focus on the binary classification setting in this table. Third, we focus on the F1 macro score for this table (all cases are class-balanced). Fourth, we consider the performance of the GPT-3 model for ten training data points as a reference. We provide more details in Supplementary Note 6. The version of GPT-3 we utilized in this work has been trained on data up to Oct 2019 that mostly comes from web scraping (Common Crawl76 and WebText77) along with books corpora and Wikipedia. Structured datasets, however, have not been part of the training. Also, note that our approach works well on representations that have not been used for the original datasets (for example, SELFIES, InChI). For the case studies on reaction datasets, we did not consider a deep learning baseline, hence the corresponding values have been omitted in the table. For computing the table, we utilized data reported in Refs. 78,79,80,81,82,83,84,85,86,87,88.