Introduction

The ability to predict or, at least, estimate the yields of organic reactions would be of tremendous value for synthetic chemistry, limiting the number of unproductive experiments, minimizing the use of solvents and reagents, and lowering the overall monetary and environmental cost of chemical production. Not surprisingly, there have been many attempts to develop algorithms for that purpose. In our own work, we evaluated both thermodynamic models based on optimized free-energy group contributions (assuming thermodynamic control)1 as well as machine-learning, ML, methods2,3; others have since focused on various ML approaches. Despite some early optimism4,5, subsequent studies revealed relatively low correlations between experimental and predicted yield values – not only in collections of diverse reaction types2,6 but also within larger sets of same-type reactions3,6,7,8. Pondering the reason for this unsatisfactory performance, we observe that all efforts to date learned on full, substrate-to-product reactions, as typically reported in the literature and/or electronic notebooks. Such full-reaction data does not capture reactions’ mechanistic intricacies – in particular, it has no explicit knowledge of possible side reactions that can lead to undesired outcomes and lower the yields. Some of this knowledge could, in principle, be captured through adequately large9 numbers of examples of failed reactions but these are typically not published, and the distributions of yields in literature datasets are heavily skewed toward higher values (with a mean approaching 80%2).

Given these limitations, we recently began to teach the computer mechanistic transformations which, when applied to desired substrates, propagate large networks of mechanistic steps. In ref. 10., we encoded some 400 mechanistic steps specific to carbocations, and used the network approach to predict the mechanisms of complex carbocationic rearrangements. Therein, we parametrized the heights of kinetic barriers (based on quantum-mechanical calculations) and used this knowledge of kinetics to predict products’ distributions and yields. More recently, we deployed a much larger set of ~8000 general-scope mechanistic transforms (cf. below) and applied them to multiple small-molecule substrates. This effort was intended to trace mechanistic pathways defining new multicomponent reactions, MCRs, which are particularly appealing because they offer high atom-economy, minimize separation and purification operations, and can yield complex scaffolds that are often less prone to follow-up or side reactions than non-MCR reactions. Indeed, in ref. 11. we described how such analyses enable systematic discovery of plausible MCRs candidates, of which multiple we validated by experiment. An essential part of this effort has been the ability to estimate the yields of these MCRs – would multicomponent substrate mixtures result in a mixture of low-yielding products, or would they lead to a major product in good yield? Unfortunately, given the number and diversity of mechanistic steps in the 8000 set, QM calculations of kinetic barriers have proven prohibitive – instead, we pursued and describe here a physical-organic approach in which kinetics of mechanistic steps is approximated by using nucleophilicity and electrophilicity indices and linear free-energy relationships. We train this model on the mechanistic networks of 20 known MCRs (chosen to span a wide range of yields, Fig. 1a, c) and then apply it to predict the yields of our newly-discovered MCRs (Fig. 1b, d) that not only use different substrates but are also based on unprecedented mechanisms. Despite such fundamental mechanistical differences, the model transfers between the training and testing MCRs, achieving similar– and in the light of previous effort, quite satisfactory – performance metrics (e.g., mean absolute errors, MAE = 10.5% and 7.3%, respectively). These results suggest that mechanistic-level approach to yield estimation may be a useful alternative to models derived from full reaction data, although – as we also emphasize in our discussion – it is certainly pending future extensions of the 8000 rule set and validation on larger sets of mechanistic networks.

Fig. 1: Multicomponent reactions, MCRs, used to train and test the model.
figure 1

a Schemes of 20 MCRs used to train the yield-prediction model. Percent numbers are highest/optimized yields for these reactions reported in the references whose numbers are given in square brackets. b Schemes of 10 MCR and one-pot reactions discovered by the Allchemy algorithm, validated by experiment, and reported by us in ref. 11. Percentage values are experimental yields. These reactions were used to test the yield-prediction model. They were committed to validation based on mechanistic novelty, conciseness of synthesis (compared to traditional routes for making similar targets), and, in several cases, for potential applicability. For example, MCR at the top of the left column in b (57% yield) leads to a scaffold that can be further reacted with phenyl hydrazine to give substituted pyrazoles, popular motifs of many drugs. Third-from-the top entry in the same column (also 57%), produces scaffolds akin to oblongolide natural products studied as potential algicide, herbicide55, and antiviral56 agents. The fourth-from-the-top entry (82%) leads to branched diallylic ethers. After metathesis (not compatible with one-pot conditions), they can cyclize into enol ether scaffolds used in various medicinal syntheses57,58,59,60. The bottom MCR in the same column (31%) gives a substituted hexahydro-2(3H)-benzofuranone, which is a motif found in various natural products and bioactive compounds61. Turning to the right column, the second entry from the top (34%) produces a spiro system akin to that found in some drugs and bioactive agents62,63,64,65. The MCR just below it (97%) gives a 1-(1-cyclohexenyl)naphthalene, atropisomeric scaffold familiar from various types of drugs66,67. Reaction marked by a yellow star is discussed in the main text; reactions marked by pink stars are discussed in Supplementary Section S1. Distribution of experimental yields in c, the literature-derived training set from a (blue bars), and d our testing set from b (pink bars).

Results

Choice of MCRs for training and testing

We began by selecting a set of MCRs on which to parametrize the model. As we showed in ref. 11, there are only ~630 distinct MCR types (differing in the reaction core) and the majority of those are reported with good yields which, as we argued before2,3, severely limits the predictivity of data-derived models. Accordingly, we sought a training set of mechanistically diverse MCRs (Fig. 1a and, for mechanistic details, refs. 12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31) for which the highest, optimized yields reported in the literature are roughly uniformly distributed between 33 and 100% (Fig. 1c). The size of this set (20 MCRs) was determined by the availability of published examples reporting lower yields (eight examples with yields ≤ 55%). As the test set, we used the aforementioned 10 MCRs and one-pot sequences recently discovered11 by the Allchemy algorithm and subsequently validated by experiment (Fig. 1b). The yields of these test reactions also span a broad range of values (Fig. 1d).

Mechanistic transforms and networks

For each of these MCRs, we used mechanistic transforms (expert-coded in the SMARTS notation, akin to our previous works on retro- and forward- synthesis32,33,34,35,36,37) to propagate mechanistic networks from substrates to products. As detailed in refs. works10,11, these transforms are roughly at the level of the so-called arrow-pushing steps and encompass a broad range of chemistries (though not yet exhaustive, see later in the text and the Methods section of ref. 11.). Each transform is broader than any specific literature precedent and delineates the scope of substituents admissible or prohibited at various positions. It is also accompanied by information about general reaction conditions (strongly basic, basic, mildly basic, neutral, mildly acidic, strongly acidic, and information if reaction requires Lewis acid), solvent class (protic-aprotic and/or polar-nonpolar), temperature range (<−20 °C, −20 to 20 °C, r.t., 40 to 150 °C, > 150 °C), water tolerance (yes, no, water is required), typical speeds (very slow, slow, fast, very fast, and unknown if conflicting literature data have been reported), and more.

Propagation of the mechanistic networks starts with a set of substrates (denoted as synthetic generation G0 in Fig. 2a) either specified by the user or systematically selected from an expert-curated collection of ca. 2400 simple molecules featuring combinations of functional groups promoting diverse modes of reactivity (for detailed list, see ref. 11). To these substrates, the algorithm applies the matching mechanistic transforms – under all possible conditions – to create the first-generation, G1, of products and by-products, which are then iteratively reacted10,11 to give generations G2 and higher (here, up to G8) to reach the reported product (Fig. 2a and https://mcrchampionship.allchemy.net for examples studied here; interactive version for these and other mechanistic calculations available at https://mech.allchemy.net).

Fig. 2: Forward and sideways expansion of mechanistic networks.
figure 2

a The algorithm described in ref. 11 first applies ~8000 mechanistic, SMARTS-encoded transforms to the reaction substrates (here, benzyl isocyanide, phenylphosphinic acid, and phenylpropionaldehyde) corresponding to the bottom row of molecule nodes in the zero-th synthetic generation, G0. Matching transforms are applied to these starting materials to generate intermediates in generation G1, then to G0 and G1 species to generate G2, and so on. The network shown is expanded to G5 with the reaction product colored blue and overlaid on the network. During network expansion, transforms under all possible conditions are applied. However, for a valid reaction sequence, the individual mechanistic steps must be matching and must meet several conditions. For instance, as detailed in ref. 11, such sequences cannot combine solvents of different classes (categorized as protic/aprotic and polar/non-polar), cannot combine steps requiring oxidative and reductive conditions, may apply water-sensitive steps only before water-requiring ones, cannot toggle between basic and acidic conditions, and more. Of note, some transforms may have more than one categorization and, if so, these are considered as logical alternatives when considering step-matching along a sequence. In the current example, the thicker blue line traces the only matching mechanistic pathway connecting the starting materials and the products. b The mechanism corresponding to this matching pathway is shown as Allchemy’s screenshot and agrees with the pathway proposed in ref. 18. The major MCR pathway thus identified is the first level of analysis (Level 1). c Once found, this Level 1 solution is expanded sideways to account for the by-products of individual mechanistic steps as well as side reactions possible under similar reaction conditions (Level 2). The analysis can be expanded to higher levels, to account for further reactions of by-products and products of side reactions (see scheme in Fig. 3). The network presented in the panel corresponds to Level 3 analysis. d The same graph as in (c) but redrawn as a bipartite graph with molecule nodes represented as circles and reaction nodes, as diamonds.

Within the networks thus constructed, the algorithm traces the substrates-to-product sequences of mechanistic steps that are mutually compatible (see caption to Fig. 2a and, for detailed description, ref. 11). For each of the MCRs, the pathway fulfilling this mutual compatibility of steps and also closest-matching the class of literature-reported conditions is taken as Level 1 of our network analysis (Fig. 2b). Importantly, for the MCRs studied here, the sequences the algorithm concatenates from individual mechanistic steps agree with the mechanisms proposed by the authors in the original publications12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31. Next, the Level 1 routes are expanded sideways to account for by-products of the main-route and products of any side-reactions possible under the main-pathway conditions (Level 2 of analysis). Level 3 accounts for reactions, in which species from Level 2 react between themselves or with species from Level 1. Such expansion can then be iterated to higher levels and is illustrated schematically in Fig. 3 and in Fig. 2c, d for a specific MCR example. In Fig. 2c, a simple molecular graph format is used (with one type of nodes corresponding to molecules), whereas in Fig. 2d the so-called bi-partite graph is applied in which there are two types of nodes: one type corresponding to molecules and the other, to reactions in which these molecules engage. As we discussed elsewhere38, the bipartite representation is better suited to capture all causal relationships between substrates and products and is the one used in kinetic analyses to which we now turn.

Fig. 3: Scheme of a side-expansion of an MCR pathway.
figure 3

The main MCR pathway is colored in blue. This route must meet the requirements of Level 1 analysis – that is, conditions of all individual steps must be mutually compatible (different steps may not intermingle oxidative and reductive conditions, solvents should belong to the same class, etc.; for details, see ref. 11). Level 2 corresponds to reactions branching-out from the main path and leading to by-products (grey) or products of competing/side reactions possible under the same class of reaction conditions (red). Reactions between species from Level 2 or between species from Level 2 and Level 1 lead to products at Level 3 (light-orange). This expansion can be iteratively continued. For instance, products at Level 4 (dark-orange) form when species from Level 3 react between themselves or with molecules from Levels 1 or 2. Naturally, only reactions compatible with the conditions of the main MCR pathway are considered (e.g., if the multicomponent mixture is under basic conditions, there is no point to analyze side reactions that require, say, acidic conditions).

Development of the kinetic model

With the ultimate objective of estimating MCR yields, it is first necessary to approximate the equilibrium constants or kinetic rate constants of the individual steps in a mechanistic graph expanded to some Level n. Since published kinetic data are extremely sparse, it is impossible to assign experimental values to the vast majority of steps – hence, we have aimed to develop a heuristic model based on the extension of Mayr’s nucleophilicity and electrophilicity indices39,40 and linear free-energy relationships.

To begin with and as discussed in detail in ref. 11, individual mechanistic steps can have multiple classes of plausible conditions assigned to them (e.g., Diels-Alder cycloaddition can be carried out either under neutral conditions at high temperature or at lower temperatures using a Lewis acid catalyst). With this in mind, if multiple steps along the main/Level 1 MCR route have overlapping condition ranges as logical alternatives, we choose the conditions that minimize the overall number of condition changes along the route. For example, if some Step 1 could be carried under neutral or mildly acidic conditions and subsequent Step 2 requires either mildly acidic or acidic conditions, the Step 1/Step 2 sequence is assigned the common, mildly acidic conditions. Along the entire route, such unification of conditions is performed using a greedy algorithm on topologically sorted sequence of steps (topological sorting was performed with Kahn’s algorithm implemented in networkX library41).

Next, we analyze the entire Level-n mechanistic graph and identify all acid-base equilibria. We stipulate that (i) under basic conditions, protonated forms cannot exist and, (ii) conversely, under acidic conditions, deprotonated forms are to be excluded from the graph. Naturally, these are simplifications, and in reality the fraction of protonated/deprotonated forms is not binary and depends on the specific pKa values, the number of acid/base equivalents used, and the solvent environment. For all other acid-base equilibria and in an effort to minimize the number of free parameters, we make a very rough assumption that their equilibrium constants, K, are always the same. The value of K is a parameter for the global optimization over all 20 MCRs in the training set. Similarly, for all tautomerizations, we assume one global equilibrium constant, Ktau, to be optimized. Note that the treatment of tautomers as separate species within the reaction graph is necessitated by the SMILES/SMARTS notation that considers tautomers as distinct structures. The limitations of this notation also require that resonance structures of enolate anions be treated as two separate entities (C- and O-anions, here taken in 1:1 ratio). Reaction classes categorized as in-principle reversible but, for some particular substrates, resulting in aromatization (e.g., beta-elimination leading to a thiazole) are assigned as irreversible.

Furthermore, the graph is simplified by preventing reactions of some very reactive species (e.g., acyl chlorides, lithoorganic compounds) present in earlier synthetic generations with species formed in later generations. For instance, if acyl chloride is present in G0 and can engage in reactions with some other species from the same generation, it cannot react with species formed in G1 or higher. Colloquially put, such reactive species are not allowed to sit around and wait until multiple other steps happen and some downstream reaction partners present themselves – instead, they react as rapidly as they can with immediately available suitors. Of course, if in the experimental procedure some reagents are added only at later stages of the sequence, this delayed addition is reflected in the structure of the graph (connecting the incoming chemical to the step in which it is used).

Regioselectivity of C-H deprotonation is assessed for motifs prone to non-equivalent deprotonation (asymmetric ketones, 1,3-di(thio)carbonyls, 1,3-ketoesters or other active methylene compounds) using pKa values pre-calculated using the graph convolutional neural network pKa predictor we described in ref. 42. If one equivalent of base is used, deprotonation and subsequent reaction are allowed only at the most acidic position. With excess base and possible formation of a di-anion, the reaction is allowed to proceed at the second most acidic locus.

With the graph thus processed, we proceed to assign kinetic rates to individual steps. We do this by extending the popular Mayr’s reactivity parameters available from39,40. According to the so-called Mayr–Patz equation39, the logarithms of a second-order reaction rate constant at 20 oC can be related to the nucleophilicity parameter N and electrophilicity parameter E as

$$\log k=s(N+E)$$
(1)

where N varies between –8.8 and +32, E between −30 and +8, and s is a nucleophile-dependent slope parameter. Here, we do not use the values of slope parameters, \(s\), from Mayr’s tables because they can depend quite strongly on specific solvents (and temperatures). Moreover, in some cases, the values for the tabulated examples for specific substrates lead to problematic predictions – e.g., predicting that \(s\) be higher for a reaction between aniline and an aldehyde than for the reaction between a primary amine with the same aldehyde. In light of these problems, for all reactions, we set the value of the slope parameter to an arbitrary value, say \(s\) = 2.303, such that the Eq. 1 simplifies to a natural logarithm more commonly used in linear free energy relationships, LFER, we will use later:

$${{{\mathrm{ln}}}}\;k=N+E$$
(2)

Since Mayr’s collection encompasses only 1273 specific nucleophiles and 344 electrophiles, it is very unlikely that they would coincide with the species along our pathways. To remedy this, for any nucleophile-electrophile transformation present in a given mechanistic network, we search Mayr’s compendium for compounds that share the same reacting groups and are the most similar (by Tanimoto similarity based on ECFP4 fingerprints) to the substrates of our particular transformation (Fig. 4); if there are no examples with matching reactive groups in the Mayr set, we assign to such reactions a default value of the rate constant, \({{{\mathrm{ln}}}}\;k=1\). We strive to select parameters for solvents that are closest-matching to those predicted by our algorithm. Specifically, if for a given nucleophile and electrophile, multiple Mayr’s data are available in different solvents, we retain only those entries that match the predicted class of solvents (polar/nonpolar, protic/aprotic). Then, we take the solvent that is common to the electrophile and nucleophile (rarely, if multiple such common solvents are available, we chose the one most popular in Mayr’s tables). If there are no common solvents for a given nucleophile/electrophile pair, the algorithm selects Mayr’s data corresponding to solvents that are most similar (with similarity defined as a number of common qualifiers, e.g., MeOH and EtOH have two common qualifiers, ‘protic_solvents’ and ‘alcohols’, whereas MeOH and water have only one, ‘protic_solvents’). If Mayr’s data are available only for solvents not matching the predicted solvent class, we take \({{{\mathrm{ln}}}}\;k=1\). Finally, we limit the sum \(N+E\) to some maximal absolute value later taken as model’s free parameter; this is to avoid extremely high or low reaction rates (which may be unphysical, especially given that they extrapolate to our system(s) from some specific solvent and only one temperature tabulated by Mayr).

Fig. 4: A similarity-based assignment of N,E parameters to molecules not present in the Mayr’s collection.
figure 4

The example is part of one of the MCRs from our training set24 (not all mechanistic steps are shown). As the E parameter for the second substrate for imine formation is not directly available, the algorithm selects the most similar entry in Mayr’s compendium (here, 1-benzylpiperidin-4-one with E = −18.4).

Obviously, such Mayr-like rates provide only a very rough approximation – indeed, when the parameters of the model defined thus far were optimized against the 20 literature MCRs, the correlations between calculated and experimental yields were very poor (vide infra). This called for the inclusion of additional terms to better approximate the rate constants, in effect following the LFER philosophy known and developed for decades43,44. In this spirit, we treat the rate constant of step i as a linear combination of several heuristic corrections,

$${{{\mathrm{ln}}}}\;{k}_{{{{\rm{i}}}}}={{{\mathrm{ln}}}}\;{k}_{{{{\rm{i}}}}}^{{{{\rm{Mayr}}}}}+\sum {{{\rm{corrections}}}}({r}_{{{{\rm{i}}}}})$$
(3)

These corrections are of eight types. First, as mentioned above and detailed in ref. 11, all mechanistic transforms come along with the general classification of their rates (e.g., very slow, slow, fast) to which we assign numerical values (to be optimized) defining correction \({r}_{{{{\rm{i}}}}}^{{{{\rm{rate\; class}}}}}\). Second, some transforms are known to proceed as side-reactions but are never dominant (and do not proceed in high yields). For instance, propargyl organolithium compounds can eliminate to allenes but allenes are virtually never purposefully prepared in this manner. Such reactions as well as quenching reactions have their own, lower values of \({r}_{{{{\rm{i}}}}}^{{{{\rm{class}}}}}\) correction. This type of down-correction also applies to reactions which, in order to proceed in good yields, require specific reagents not present in the reaction mixture (e.g., during thiol alkylation, a potential side reaction is disulfide formation – however, it can occur in good yield only if oxygen is present). Third, we penalize (by parameter \({r}_{{{{\rm{i}}}}}^{{{{\rm{water}}}}}\) assigned a value lower than a default value for other reactions) those side steps that ideally require aqueous conditions (e.g., hydrolysis) but, in reality, can use only small amounts of water supplied to them as a by-product of some other reaction in the graph (say, imine formation). Fourth, there is also a penalizing conditions correction, \({r}_{{{{\rm{i}}}}}^{{{{\rm{cond}}}}}\), for side-reactions which can no longer take place after a putative change of conditions along the main MCR pathway). For instance, if a MCR is started under basic conditions but, at some later point, the reaction mixture is neutralized, then deprotonation side-reactions are not possible after this change of conditions. Fifth, there is a global correction \({r}_{{{{\rm{i}}}}}^{{{{\rm{rev}}}}}\) assigned to reversible reactions such as imine formation or Michael addition of a thiol (i.e., equilibria other than acid-base and tautomerizations discussed earlier). The rationale for this term is that it is often possible to adjust the reaction conditions such as to shift the equilibrium in the desired direction. Sixth, correction \({r}_{{{{\rm{i}}}}}^{{{{\rm{ring}}}}}\) promotes intramolecular reactions forming 3, 5, 6 or 7-membered rings. Seventh, rate of bimolecular reactions is scaled based on the concentration of the non-limiting substrate, \({r}_{{{{\rm{i}}}}}^{{{{\rm{bimolecular}}}}}\). In practice, this means that if one molecule can react in two bimolecular reactions, characterized by the same class of rate, reaction with the more abundant second substrate is favored. Eight and last, we implement a correction inspired by the so-called Evans–Polanyi rule stipulating that in a series of homologous reactions, activation energy is proportional to reaction enthalpy \(\Delta H\) (thus promoting exothermic reactions, \(\Delta H\) < 0). In our case, we take \({r}_{{{{\rm{i}}}}}^{{{{\rm{Polanyi}}}}}\) as proportional to \(\exp (-\Delta H)\) where enthalpy is approximated by reaction energy \(\Delta E\) (calculated at the PM6 level in MOPAC45), and extreme values of \(\Delta E\) (with threshold being model’s free parameter) are rejected to avoid artifacts of, for instance, reactions with strong solvation effects not captured by gas-phase energetic calculations.

Having defined the rate constants, we quantify the changes in the extent \({\xi }_{{{{\rm{i}}}}}\) of reaction i (Eq. 4) and concentrations \({C}_{{{{\rm{x}}}}}\) of species \({{{\rm{x}}}}\) (Eq. 5). In both equations, \({\nu }_{{{{\rm{x}}}},{{{\rm{i}}}}}\) stands for the stoichiometric coefficient of compound x in reaction i.

$$d{\xi }_{{{{\rm{i}}}}}=({k}_{{{{\rm{i}}}}}/{\nu }_{{{{\rm{s}}}},{{{\rm{i}}}}}){{dC}}_{{{{\rm{substr}}}},{{{\rm{i}}}}}$$
(4)
$${{dC}}_{{{{\rm{x}}}}}={\sum}_{{{{\rm{incoming}}}},{{{\rm{i}}}}}{\nu }_{{{{\rm{x}}}},{{{\rm{i}}}}}d{\xi }_{{{{\rm{i}}}}}-{\sum}_{{{{\rm{outgoing}}}},{{{\rm{j}}}}}{\nu }_{{{{\rm{x}}}},{{{\rm{j}}}}}d{\xi }_{{{{\rm{j}}}}}$$
(5)

For a given set of model’s 23 free parameters, these equations are numerically integrated using final difference method. Integration is initiated for the concentrations of the starting materials equal to their stoichiometric coefficients in a given MCR, and concentrations of all other species assigned to zero. The yield for a particular MCR is then calculated as the product-to-initial substrate concentration ratio at the end of integration. The length of integration is defined as some global parameter N (to be optimized) multiplied by the number of steps in a given reaction pathway.

Parameters’ optimization

The model’s 23 parameters are optimized on the training set of 20 reaction networks underlying the known MCRs (from Fig. 1a; in total, spanning 993 mechanistic steps). This optimization (i) aims to maximize the correlation coefficient between the calculated and experimental yields; and (ii) is performed using the Bayesian optimization algorithm with the Gaussian process as a surrogate model, as implemented in OpenBox library46. Specifically, we used expected improvement as an acquisition function and radial basis function (RBF) kernel. Search space comprised of continuous variables along with their ranges and starting values (see argsParser.py for details of space definition) as well as constraints defining relations between variables (e.g., to enforce that “Fast” reaction is faster than “Slow,” and “Slow” is faster than “Very Slow”; function buildOpenBoxSpaceFromDict in optscan.py file). For each model considered, five independent runs were performed for at least 100 optimization steps each. Each optimization aimed to maximize the coefficient of determination of linear regression (without intercept, i.e., passing through (0,0)) between experimentally reported yields and yields predicted by the model (see optFunctionOpenBox function in optscan.py file for implementation details). The model offering the highest correlation was then taken. With the parameters thus optimized, the model was used to estimate the yields of 10 MCRs from the test set (Fig. 1b).

Model’s performance and limitations

As shown in Fig. 5 and further quantified in Fig. 6, optimization on Level 3 networks yields Pearson correlation coefficient, \({\rho }_{{{{\rm{train}}}}}^{2}=\) 0.800, coefficient of determination, \({R}_{{{{\rm{train}}}}}^{2}\,\)= 0.432 and mean absolute error, \({{{\rm{MAE}}}}=10.5\%\). On the test set, the model achieves \({\rho }_{{{{\rm{test}}}}}^{2}=\) 0.861, \({R}_{{{{\rm{test}}}}}^{2}\) = 0.852, and \({{{\rm{MAE}}}}=7.3\%\). This means that the model can extrapolate well to unseen mechanisms and reaction types. Indeed, Fig. 7 shows that only 32 reaction classes are common between 115 classes in the training-set MCRs and 71 classes in the test set (for the list of reaction classes, see Supplementary Section S2). Of note, the performance of the model is likely as good as it can get given the small size of the training dataset and model’s assumptions. This is corroborated by the analyses detailed in Supplementary Section S1 in which we compared the best model described here with the ensemble of 331 models that showed similar performance. Indeed, the differences between (i) the predicted yields averaged over the 331-model ensemble and (ii) predicted yields from the best model are small (Supplementary Fig. S1), and so are standard deviations of yield predicted from the ensemble of models. Also, Supplementary Fig. S2 evidences that these standard deviations do not correlate with experimental yields, suggesting that all models have low diversity (i.e., give similar errors/results on new data).

Fig. 5: Correlation between experimental and predicted reaction yields.
figure 5

The training set consists of mechanistic reaction networks for 20 MCRs reported in the literature; the test set comprises mechanistic networks for 10 new, computer-discovered (and experimentally validated) MCRs and one-pot reactions described in ref. 11. Summary of reactions is shown in Fig. 1a, b. The trend line (orange) is a fit to all data points, trend lines fitted to training and test sets are largely overlapping and are not shown for clarity. The red line shows the ideal relationship between experimental and predicted yields (y = x).

Fig. 6: Performance of the full model and sub-models without selected correction(s).
figure 6

a Mean Absolute Error (MAE) of yield prediction, b square of the Pearson correlation coefficient (\({\rho }^{2}\)), c Coefficient of determination (R2).

Fig. 7: Reaction types used in test and training sets.
figure 7

83 reaction types were unique to the training set, 39 reaction types were present only in the test set, and 32 reaction types were common to both sets. Examples of reaction types from each set are listed next to the corresponding sectors of the pie chart. For the full list of reaction types, see Supplementary Section S2.

Next, we investigated which parameters of the model are crucial to its performance. As already mentioned, \({{\mathrm{ln}}}{{k}}_{{{{\rm{i}}}}}^{{{{\rm{Mayr}}}}}\) term by itself performs poorly – on the training set, it achieves \({\rho }_{{{{\rm{train}}}}}^{2}=\) 0.678, \({R}_{{{{\rm{train}}}}}^{2}\,\)= −1.10 and \({{{\rm{MAE}}}}=22.1\%\), and on the test set, \({\rho }_{{{{\rm{test}}}}}^{2}=\) 0.404, \({R}_{{{{\rm{test}}}}}^{2}\) = −0.059 and \({{{\rm{MAE}}}}=20.7\%\). This is quantified by the second-to-the-left set of bars in the histogram in Fig. 6. The remaining pairs of bars in this figure are for the full model with individual correction terms removed. As seen, the most important correction is \({r}_{{{{\rm{i}}}}}^{{{{\rm{water}}}}}\), penalizing steps which require stoichiometric water but are part of pathways not using aqueous conditions. Without the correction, the model achieves only \({\rho }_{{{{\rm{train}}}}}^{2}=\) 0.601, \({R}_{{{{\rm{train}}}}}^{2}\,\)= −0.439 and \({{{\rm{MAE}}}}=17.9\%\) and, on the test set, \({\rho }_{{{{\rm{test}}}}}^{2}=\) 0.462, \({R}_{{{{\rm{test}}}}}^{2}\) = 0.166 and \({{{\rm{MAE}}}}=17.3\%\). In turn, contribution from \({r}_{{{{\rm{i}}}}}^{{{{\rm{class}}}}}\) is important for better model generalization. Without this correction, performance on the training set is comparable to that of the full model but correlation on the test set is significantly worse. On the flipside of the coin, some of the corrections may be spurious, implying that models with lower numbers of parameters work equally well. For instance, models without \({r}_{{{{\rm{i}}}}}^{{{{\rm{Polanyi}}}}}\) or \({r}_{{{{\rm{i}}}}}^{{{{\rm{cond}}}}}\) perform comparably to the full model.

We also analyzed how detailed the mechanistic knowledge has to be to assure accurate yield predictions. On one hand, training the full model (i.e., with all corrections) only on the main reaction pathways with immediate side-reactions (Level 2, L2) is insufficient as the metrics of accuracy (\({\rho }_{{{{\rm{train}}}}}^{2}=\) 0.345, \({R}_{{{{\rm{train}}}}}^{2}\,\)= −0.127 and \({{{\rm{MAE}}}}=16.1\%;\) \({\rho }_{{{{\rm{test}}}}}^{2}=\)0.042, \({R}_{{{{\rm{test}}}}}^{2}\) = −0.590 and \({{{\rm{MAE}}}}=27.1\%\)) are much lower than for the Level 3, L3 analysis described above. On the other hand, expansion to Level 4, L4, allowing for downstream reactions of the L3 species, also worsens the performance accuracy (\({\rho }_{{{{\rm{train}}}}}^{2}=\) 0.709, \({R}_{{{{\rm{train}}}}}^{2}\,\)= 0.053 and \({{{\rm{MAE}}}}=14.1\%;\) \({\rho }_{{{{\rm{test}}}}}^{2}=\)0.436, \({R}_{{{{\rm{test}}}}}^{2}\) = −0.061 and \({{{\rm{MAE}}}}=20.5\%\)). We believe this effect can be reasonably explained in terms of model’s inherent error (due to the simplified treatment of kinetics) propagating with the rapidly increasing sizes of the networks. In fact, the L4 networks are ca. 80% larger than L3 ones (Fig. 8).

Fig. 8: Mean sizes of mechanistic networks at different levels of calculations for test and training sets.
figure 8

Data were derived from bipartite mechanistic graphs in which both reactions and compounds are represented as nodes whereas edges are connections between compounds and reactions (see Fig. 2d). Networks from the test set are always bigger than those from the training set and the difference grows with calculation level (up to 15 % for L2, 3-23 % for L3, and 36-55 % for L4).

While this trend could be expected, it also points to the main drawback of the mechanistic approach – namely, that it can be quite sensitive to some parts of the mechanistic picture missing. For example, had the algorithm used to construct the mechanistic networks not known substitution of bromide with thiolate, it would not be able to predict the formation of 2-[(2-methoxy-2-oxoethyl)sulfanyl]hept-2-enoate by-product in the beta-elimination step of the last reaction in Fig. 1b (marked with a yellow star), and would grossly overestimate its yield (81% instead of 47%). In this respect, we emphasize that even though our current 8000-set covers a broad range of mechanistic steps (acid-base catalyzed, substitutions, eliminations, additions, rearrangements, pericyclic reactions, basic transformations catalyzed by transition metals), it is not without notable omissions. For instance, radical mechanistic steps are still to be included since their proper generalization (from specific precedents47 into transforms applicable to different scaffolds) is challenging. This effort may require a separate study, akin to our recent work on carbocationic rearrangements10. Also, the currently available data is insufficient to predict how reaction rates depend on specific choices of catalysts. In the fullness of time, this information may become available either from high-end quantum-mechanical calculations or from the promising marriage of experiments with AI48,49,50,51,52,53,54.

Discussion

In summary, the approach we described is a union of computer-assisted analysis of mechanistic reaction networks with rate approximations and corrections grounded in physical-organic chemistry. The model as-is works well in predicting yields of MCRs based on reactions between various nucleophiles and electrophiles, and transfers well between training and test sets based on markedly different and diverse reactions mechanisms. Recognizing that more work is needed to incorporate other classes of reactivities, we feel that approaches like this one should be pursued as an alternative to chemical AI methods based on full-reaction data which, despite being trained on thousands to millions of literature examples, do not offer satisfactory accuracy of yield prediction.

Methods

The optimization procedure to identify the best model was repeated 10 times from different random starting parameters. This was done using Bayesian optimization with two surrogate models: (i) Gaussian Process (GP), and (ii) probabilistic random forest (PRF). During these optimization campaigns, 331 sets of parameters with R2 > 0.40 were found. Average predicted yields from this ensemble were then compared against those from the best model with the differences being generally small (2.7 % average over all reactions, see Supplementary Fig. S1). To quantify model’s uncertainty, we considered the standard deviation of yield prediction between individual models within the ensemble of all 331 models. For the test and training sets, the standard deviations were small, 2% and 5%, respectively. Moreover, standard deviation and prediction error were not correlated (Supplementary Fig. S2). We also verified that all models have comparable accuracy, and the results from the ensemble are not significantly better than those of a single model. This suggests that diversity between models is low, and all optimizations result in models with similar accuracies/errors on the test data. Colloquially put, the performance metrics described in the main text are likely as good as they can get, given the dataset and model’s key assumptions.