Abstract
This work describes estimation of yields of complex, multicomponent reactions (MCRs) based on the modeled networks of mechanistic steps spanning both the main reaction pathway as well as immediate and downstream side reactions. Because experimental values of the kinetic rate constants for individual mechanistic transforms are extremely sparse, these constants are approximated here using Mayr’s nucleophilicity and electrophilicity parameters fine-tuned by correction terms grounded in linear free-energy relationships. With this formalism, the model trained on the mechanistic networks of only 20 – but mechanistically- and yield-diverse MCRs – transfers well to newly discovered MCRs that are based on markedly different mechanisms and types of individual mechanistic transforms. These results suggest that mechanistic-level approach to yield estimation may be a useful alternative to models that are derived from full-reaction data and lack information about yield-lowering side reactions.
Similar content being viewed by others
Introduction
The ability to predict or, at least, estimate the yields of organic reactions would be of tremendous value for synthetic chemistry, limiting the number of unproductive experiments, minimizing the use of solvents and reagents, and lowering the overall monetary and environmental cost of chemical production. Not surprisingly, there have been many attempts to develop algorithms for that purpose. In our own work, we evaluated both thermodynamic models based on optimized free-energy group contributions (assuming thermodynamic control)1 as well as machine-learning, ML, methods2,3; others have since focused on various ML approaches. Despite some early optimism4,5, subsequent studies revealed relatively low correlations between experimental and predicted yield values – not only in collections of diverse reaction types2,6 but also within larger sets of same-type reactions3,6,7,8. Pondering the reason for this unsatisfactory performance, we observe that all efforts to date learned on full, substrate-to-product reactions, as typically reported in the literature and/or electronic notebooks. Such full-reaction data does not capture reactions’ mechanistic intricacies – in particular, it has no explicit knowledge of possible side reactions that can lead to undesired outcomes and lower the yields. Some of this knowledge could, in principle, be captured through adequately large9 numbers of examples of failed reactions but these are typically not published, and the distributions of yields in literature datasets are heavily skewed toward higher values (with a mean approaching 80%2).
Given these limitations, we recently began to teach the computer mechanistic transformations which, when applied to desired substrates, propagate large networks of mechanistic steps. In ref. 10., we encoded some 400 mechanistic steps specific to carbocations, and used the network approach to predict the mechanisms of complex carbocationic rearrangements. Therein, we parametrized the heights of kinetic barriers (based on quantum-mechanical calculations) and used this knowledge of kinetics to predict products’ distributions and yields. More recently, we deployed a much larger set of ~8000 general-scope mechanistic transforms (cf. below) and applied them to multiple small-molecule substrates. This effort was intended to trace mechanistic pathways defining new multicomponent reactions, MCRs, which are particularly appealing because they offer high atom-economy, minimize separation and purification operations, and can yield complex scaffolds that are often less prone to follow-up or side reactions than non-MCR reactions. Indeed, in ref. 11. we described how such analyses enable systematic discovery of plausible MCRs candidates, of which multiple we validated by experiment. An essential part of this effort has been the ability to estimate the yields of these MCRs – would multicomponent substrate mixtures result in a mixture of low-yielding products, or would they lead to a major product in good yield? Unfortunately, given the number and diversity of mechanistic steps in the 8000 set, QM calculations of kinetic barriers have proven prohibitive – instead, we pursued and describe here a physical-organic approach in which kinetics of mechanistic steps is approximated by using nucleophilicity and electrophilicity indices and linear free-energy relationships. We train this model on the mechanistic networks of 20 known MCRs (chosen to span a wide range of yields, Fig. 1a, c) and then apply it to predict the yields of our newly-discovered MCRs (Fig. 1b, d) that not only use different substrates but are also based on unprecedented mechanisms. Despite such fundamental mechanistical differences, the model transfers between the training and testing MCRs, achieving similar– and in the light of previous effort, quite satisfactory – performance metrics (e.g., mean absolute errors, MAE = 10.5% and 7.3%, respectively). These results suggest that mechanistic-level approach to yield estimation may be a useful alternative to models derived from full reaction data, although – as we also emphasize in our discussion – it is certainly pending future extensions of the 8000 rule set and validation on larger sets of mechanistic networks.
a Schemes of 20 MCRs used to train the yield-prediction model. Percent numbers are highest/optimized yields for these reactions reported in the references whose numbers are given in square brackets. b Schemes of 10 MCR and one-pot reactions discovered by the Allchemy algorithm, validated by experiment, and reported by us in ref. 11. Percentage values are experimental yields. These reactions were used to test the yield-prediction model. They were committed to validation based on mechanistic novelty, conciseness of synthesis (compared to traditional routes for making similar targets), and, in several cases, for potential applicability. For example, MCR at the top of the left column in b (57% yield) leads to a scaffold that can be further reacted with phenyl hydrazine to give substituted pyrazoles, popular motifs of many drugs. Third-from-the top entry in the same column (also 57%), produces scaffolds akin to oblongolide natural products studied as potential algicide, herbicide55, and antiviral56 agents. The fourth-from-the-top entry (82%) leads to branched diallylic ethers. After metathesis (not compatible with one-pot conditions), they can cyclize into enol ether scaffolds used in various medicinal syntheses57,58,59,60. The bottom MCR in the same column (31%) gives a substituted hexahydro-2(3H)-benzofuranone, which is a motif found in various natural products and bioactive compounds61. Turning to the right column, the second entry from the top (34%) produces a spiro system akin to that found in some drugs and bioactive agents62,63,64,65. The MCR just below it (97%) gives a 1-(1-cyclohexenyl)naphthalene, atropisomeric scaffold familiar from various types of drugs66,67. Reaction marked by a yellow star is discussed in the main text; reactions marked by pink stars are discussed in Supplementary Section S1. Distribution of experimental yields in c, the literature-derived training set from a (blue bars), and d our testing set from b (pink bars).
Results
Choice of MCRs for training and testing
We began by selecting a set of MCRs on which to parametrize the model. As we showed in ref. 11, there are only ~630 distinct MCR types (differing in the reaction core) and the majority of those are reported with good yields which, as we argued before2,3, severely limits the predictivity of data-derived models. Accordingly, we sought a training set of mechanistically diverse MCRs (Fig. 1a and, for mechanistic details, refs. 12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31) for which the highest, optimized yields reported in the literature are roughly uniformly distributed between 33 and 100% (Fig. 1c). The size of this set (20 MCRs) was determined by the availability of published examples reporting lower yields (eight examples with yields ≤ 55%). As the test set, we used the aforementioned 10 MCRs and one-pot sequences recently discovered11 by the Allchemy algorithm and subsequently validated by experiment (Fig. 1b). The yields of these test reactions also span a broad range of values (Fig. 1d).
Mechanistic transforms and networks
For each of these MCRs, we used mechanistic transforms (expert-coded in the SMARTS notation, akin to our previous works on retro- and forward- synthesis32,33,34,35,36,37) to propagate mechanistic networks from substrates to products. As detailed in refs. works10,11, these transforms are roughly at the level of the so-called arrow-pushing steps and encompass a broad range of chemistries (though not yet exhaustive, see later in the text and the Methods section of ref. 11.). Each transform is broader than any specific literature precedent and delineates the scope of substituents admissible or prohibited at various positions. It is also accompanied by information about general reaction conditions (strongly basic, basic, mildly basic, neutral, mildly acidic, strongly acidic, and information if reaction requires Lewis acid), solvent class (protic-aprotic and/or polar-nonpolar), temperature range (<−20 °C, −20 to 20 °C, r.t., 40 to 150 °C, > 150 °C), water tolerance (yes, no, water is required), typical speeds (very slow, slow, fast, very fast, and unknown if conflicting literature data have been reported), and more.
Propagation of the mechanistic networks starts with a set of substrates (denoted as synthetic generation G0 in Fig. 2a) either specified by the user or systematically selected from an expert-curated collection of ca. 2400 simple molecules featuring combinations of functional groups promoting diverse modes of reactivity (for detailed list, see ref. 11). To these substrates, the algorithm applies the matching mechanistic transforms – under all possible conditions – to create the first-generation, G1, of products and by-products, which are then iteratively reacted10,11 to give generations G2 and higher (here, up to G8) to reach the reported product (Fig. 2a and https://mcrchampionship.allchemy.net for examples studied here; interactive version for these and other mechanistic calculations available at https://mech.allchemy.net).
a The algorithm described in ref. 11 first applies ~8000 mechanistic, SMARTS-encoded transforms to the reaction substrates (here, benzyl isocyanide, phenylphosphinic acid, and phenylpropionaldehyde) corresponding to the bottom row of molecule nodes in the zero-th synthetic generation, G0. Matching transforms are applied to these starting materials to generate intermediates in generation G1, then to G0 and G1 species to generate G2, and so on. The network shown is expanded to G5 with the reaction product colored blue and overlaid on the network. During network expansion, transforms under all possible conditions are applied. However, for a valid reaction sequence, the individual mechanistic steps must be matching and must meet several conditions. For instance, as detailed in ref. 11, such sequences cannot combine solvents of different classes (categorized as protic/aprotic and polar/non-polar), cannot combine steps requiring oxidative and reductive conditions, may apply water-sensitive steps only before water-requiring ones, cannot toggle between basic and acidic conditions, and more. Of note, some transforms may have more than one categorization and, if so, these are considered as logical alternatives when considering step-matching along a sequence. In the current example, the thicker blue line traces the only matching mechanistic pathway connecting the starting materials and the products. b The mechanism corresponding to this matching pathway is shown as Allchemy’s screenshot and agrees with the pathway proposed in ref. 18. The major MCR pathway thus identified is the first level of analysis (Level 1). c Once found, this Level 1 solution is expanded sideways to account for the by-products of individual mechanistic steps as well as side reactions possible under similar reaction conditions (Level 2). The analysis can be expanded to higher levels, to account for further reactions of by-products and products of side reactions (see scheme in Fig. 3). The network presented in the panel corresponds to Level 3 analysis. d The same graph as in (c) but redrawn as a bipartite graph with molecule nodes represented as circles and reaction nodes, as diamonds.
Within the networks thus constructed, the algorithm traces the substrates-to-product sequences of mechanistic steps that are mutually compatible (see caption to Fig. 2a and, for detailed description, ref. 11). For each of the MCRs, the pathway fulfilling this mutual compatibility of steps and also closest-matching the class of literature-reported conditions is taken as Level 1 of our network analysis (Fig. 2b). Importantly, for the MCRs studied here, the sequences the algorithm concatenates from individual mechanistic steps agree with the mechanisms proposed by the authors in the original publications12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31. Next, the Level 1 routes are expanded sideways to account for by-products of the main-route and products of any side-reactions possible under the main-pathway conditions (Level 2 of analysis). Level 3 accounts for reactions, in which species from Level 2 react between themselves or with species from Level 1. Such expansion can then be iterated to higher levels and is illustrated schematically in Fig. 3 and in Fig. 2c, d for a specific MCR example. In Fig. 2c, a simple molecular graph format is used (with one type of nodes corresponding to molecules), whereas in Fig. 2d the so-called bi-partite graph is applied in which there are two types of nodes: one type corresponding to molecules and the other, to reactions in which these molecules engage. As we discussed elsewhere38, the bipartite representation is better suited to capture all causal relationships between substrates and products and is the one used in kinetic analyses to which we now turn.
The main MCR pathway is colored in blue. This route must meet the requirements of Level 1 analysis – that is, conditions of all individual steps must be mutually compatible (different steps may not intermingle oxidative and reductive conditions, solvents should belong to the same class, etc.; for details, see ref. 11). Level 2 corresponds to reactions branching-out from the main path and leading to by-products (grey) or products of competing/side reactions possible under the same class of reaction conditions (red). Reactions between species from Level 2 or between species from Level 2 and Level 1 lead to products at Level 3 (light-orange). This expansion can be iteratively continued. For instance, products at Level 4 (dark-orange) form when species from Level 3 react between themselves or with molecules from Levels 1 or 2. Naturally, only reactions compatible with the conditions of the main MCR pathway are considered (e.g., if the multicomponent mixture is under basic conditions, there is no point to analyze side reactions that require, say, acidic conditions).
Development of the kinetic model
With the ultimate objective of estimating MCR yields, it is first necessary to approximate the equilibrium constants or kinetic rate constants of the individual steps in a mechanistic graph expanded to some Level n. Since published kinetic data are extremely sparse, it is impossible to assign experimental values to the vast majority of steps – hence, we have aimed to develop a heuristic model based on the extension of Mayr’s nucleophilicity and electrophilicity indices39,40 and linear free-energy relationships.
To begin with and as discussed in detail in ref. 11, individual mechanistic steps can have multiple classes of plausible conditions assigned to them (e.g., Diels-Alder cycloaddition can be carried out either under neutral conditions at high temperature or at lower temperatures using a Lewis acid catalyst). With this in mind, if multiple steps along the main/Level 1 MCR route have overlapping condition ranges as logical alternatives, we choose the conditions that minimize the overall number of condition changes along the route. For example, if some Step 1 could be carried under neutral or mildly acidic conditions and subsequent Step 2 requires either mildly acidic or acidic conditions, the Step 1/Step 2 sequence is assigned the common, mildly acidic conditions. Along the entire route, such unification of conditions is performed using a greedy algorithm on topologically sorted sequence of steps (topological sorting was performed with Kahn’s algorithm implemented in networkX library41).
Next, we analyze the entire Level-n mechanistic graph and identify all acid-base equilibria. We stipulate that (i) under basic conditions, protonated forms cannot exist and, (ii) conversely, under acidic conditions, deprotonated forms are to be excluded from the graph. Naturally, these are simplifications, and in reality the fraction of protonated/deprotonated forms is not binary and depends on the specific pKa values, the number of acid/base equivalents used, and the solvent environment. For all other acid-base equilibria and in an effort to minimize the number of free parameters, we make a very rough assumption that their equilibrium constants, K, are always the same. The value of K is a parameter for the global optimization over all 20 MCRs in the training set. Similarly, for all tautomerizations, we assume one global equilibrium constant, Ktau, to be optimized. Note that the treatment of tautomers as separate species within the reaction graph is necessitated by the SMILES/SMARTS notation that considers tautomers as distinct structures. The limitations of this notation also require that resonance structures of enolate anions be treated as two separate entities (C- and O-anions, here taken in 1:1 ratio). Reaction classes categorized as in-principle reversible but, for some particular substrates, resulting in aromatization (e.g., beta-elimination leading to a thiazole) are assigned as irreversible.
Furthermore, the graph is simplified by preventing reactions of some very reactive species (e.g., acyl chlorides, lithoorganic compounds) present in earlier synthetic generations with species formed in later generations. For instance, if acyl chloride is present in G0 and can engage in reactions with some other species from the same generation, it cannot react with species formed in G1 or higher. Colloquially put, such reactive species are not allowed to sit around and wait until multiple other steps happen and some downstream reaction partners present themselves – instead, they react as rapidly as they can with immediately available suitors. Of course, if in the experimental procedure some reagents are added only at later stages of the sequence, this delayed addition is reflected in the structure of the graph (connecting the incoming chemical to the step in which it is used).
Regioselectivity of C-H deprotonation is assessed for motifs prone to non-equivalent deprotonation (asymmetric ketones, 1,3-di(thio)carbonyls, 1,3-ketoesters or other active methylene compounds) using pKa values pre-calculated using the graph convolutional neural network pKa predictor we described in ref. 42. If one equivalent of base is used, deprotonation and subsequent reaction are allowed only at the most acidic position. With excess base and possible formation of a di-anion, the reaction is allowed to proceed at the second most acidic locus.
With the graph thus processed, we proceed to assign kinetic rates to individual steps. We do this by extending the popular Mayr’s reactivity parameters available from39,40. According to the so-called Mayr–Patz equation39, the logarithms of a second-order reaction rate constant at 20 oC can be related to the nucleophilicity parameter N and electrophilicity parameter E as
where N varies between –8.8 and +32, E between −30 and +8, and s is a nucleophile-dependent slope parameter. Here, we do not use the values of slope parameters, \(s\), from Mayr’s tables because they can depend quite strongly on specific solvents (and temperatures). Moreover, in some cases, the values for the tabulated examples for specific substrates lead to problematic predictions – e.g., predicting that \(s\) be higher for a reaction between aniline and an aldehyde than for the reaction between a primary amine with the same aldehyde. In light of these problems, for all reactions, we set the value of the slope parameter to an arbitrary value, say \(s\) = 2.303, such that the Eq. 1 simplifies to a natural logarithm more commonly used in linear free energy relationships, LFER, we will use later:
Since Mayr’s collection encompasses only 1273 specific nucleophiles and 344 electrophiles, it is very unlikely that they would coincide with the species along our pathways. To remedy this, for any nucleophile-electrophile transformation present in a given mechanistic network, we search Mayr’s compendium for compounds that share the same reacting groups and are the most similar (by Tanimoto similarity based on ECFP4 fingerprints) to the substrates of our particular transformation (Fig. 4); if there are no examples with matching reactive groups in the Mayr set, we assign to such reactions a default value of the rate constant, \({{{\mathrm{ln}}}}\;k=1\). We strive to select parameters for solvents that are closest-matching to those predicted by our algorithm. Specifically, if for a given nucleophile and electrophile, multiple Mayr’s data are available in different solvents, we retain only those entries that match the predicted class of solvents (polar/nonpolar, protic/aprotic). Then, we take the solvent that is common to the electrophile and nucleophile (rarely, if multiple such common solvents are available, we chose the one most popular in Mayr’s tables). If there are no common solvents for a given nucleophile/electrophile pair, the algorithm selects Mayr’s data corresponding to solvents that are most similar (with similarity defined as a number of common qualifiers, e.g., MeOH and EtOH have two common qualifiers, ‘protic_solvents’ and ‘alcohols’, whereas MeOH and water have only one, ‘protic_solvents’). If Mayr’s data are available only for solvents not matching the predicted solvent class, we take \({{{\mathrm{ln}}}}\;k=1\). Finally, we limit the sum \(N+E\) to some maximal absolute value later taken as model’s free parameter; this is to avoid extremely high or low reaction rates (which may be unphysical, especially given that they extrapolate to our system(s) from some specific solvent and only one temperature tabulated by Mayr).
The example is part of one of the MCRs from our training set24 (not all mechanistic steps are shown). As the E parameter for the second substrate for imine formation is not directly available, the algorithm selects the most similar entry in Mayr’s compendium (here, 1-benzylpiperidin-4-one with E = −18.4).
Obviously, such Mayr-like rates provide only a very rough approximation – indeed, when the parameters of the model defined thus far were optimized against the 20 literature MCRs, the correlations between calculated and experimental yields were very poor (vide infra). This called for the inclusion of additional terms to better approximate the rate constants, in effect following the LFER philosophy known and developed for decades43,44. In this spirit, we treat the rate constant of step i as a linear combination of several heuristic corrections,
These corrections are of eight types. First, as mentioned above and detailed in ref. 11, all mechanistic transforms come along with the general classification of their rates (e.g., very slow, slow, fast) to which we assign numerical values (to be optimized) defining correction \({r}_{{{{\rm{i}}}}}^{{{{\rm{rate\; class}}}}}\). Second, some transforms are known to proceed as side-reactions but are never dominant (and do not proceed in high yields). For instance, propargyl organolithium compounds can eliminate to allenes but allenes are virtually never purposefully prepared in this manner. Such reactions as well as quenching reactions have their own, lower values of \({r}_{{{{\rm{i}}}}}^{{{{\rm{class}}}}}\) correction. This type of down-correction also applies to reactions which, in order to proceed in good yields, require specific reagents not present in the reaction mixture (e.g., during thiol alkylation, a potential side reaction is disulfide formation – however, it can occur in good yield only if oxygen is present). Third, we penalize (by parameter \({r}_{{{{\rm{i}}}}}^{{{{\rm{water}}}}}\) assigned a value lower than a default value for other reactions) those side steps that ideally require aqueous conditions (e.g., hydrolysis) but, in reality, can use only small amounts of water supplied to them as a by-product of some other reaction in the graph (say, imine formation). Fourth, there is also a penalizing conditions correction, \({r}_{{{{\rm{i}}}}}^{{{{\rm{cond}}}}}\), for side-reactions which can no longer take place after a putative change of conditions along the main MCR pathway). For instance, if a MCR is started under basic conditions but, at some later point, the reaction mixture is neutralized, then deprotonation side-reactions are not possible after this change of conditions. Fifth, there is a global correction \({r}_{{{{\rm{i}}}}}^{{{{\rm{rev}}}}}\) assigned to reversible reactions such as imine formation or Michael addition of a thiol (i.e., equilibria other than acid-base and tautomerizations discussed earlier). The rationale for this term is that it is often possible to adjust the reaction conditions such as to shift the equilibrium in the desired direction. Sixth, correction \({r}_{{{{\rm{i}}}}}^{{{{\rm{ring}}}}}\) promotes intramolecular reactions forming 3, 5, 6 or 7-membered rings. Seventh, rate of bimolecular reactions is scaled based on the concentration of the non-limiting substrate, \({r}_{{{{\rm{i}}}}}^{{{{\rm{bimolecular}}}}}\). In practice, this means that if one molecule can react in two bimolecular reactions, characterized by the same class of rate, reaction with the more abundant second substrate is favored. Eight and last, we implement a correction inspired by the so-called Evans–Polanyi rule stipulating that in a series of homologous reactions, activation energy is proportional to reaction enthalpy \(\Delta H\) (thus promoting exothermic reactions, \(\Delta H\) < 0). In our case, we take \({r}_{{{{\rm{i}}}}}^{{{{\rm{Polanyi}}}}}\) as proportional to \(\exp (-\Delta H)\) where enthalpy is approximated by reaction energy \(\Delta E\) (calculated at the PM6 level in MOPAC45), and extreme values of \(\Delta E\) (with threshold being model’s free parameter) are rejected to avoid artifacts of, for instance, reactions with strong solvation effects not captured by gas-phase energetic calculations.
Having defined the rate constants, we quantify the changes in the extent \({\xi }_{{{{\rm{i}}}}}\) of reaction i (Eq. 4) and concentrations \({C}_{{{{\rm{x}}}}}\) of species \({{{\rm{x}}}}\) (Eq. 5). In both equations, \({\nu }_{{{{\rm{x}}}},{{{\rm{i}}}}}\) stands for the stoichiometric coefficient of compound x in reaction i.
For a given set of model’s 23 free parameters, these equations are numerically integrated using final difference method. Integration is initiated for the concentrations of the starting materials equal to their stoichiometric coefficients in a given MCR, and concentrations of all other species assigned to zero. The yield for a particular MCR is then calculated as the product-to-initial substrate concentration ratio at the end of integration. The length of integration is defined as some global parameter N (to be optimized) multiplied by the number of steps in a given reaction pathway.
Parameters’ optimization
The model’s 23 parameters are optimized on the training set of 20 reaction networks underlying the known MCRs (from Fig. 1a; in total, spanning 993 mechanistic steps). This optimization (i) aims to maximize the correlation coefficient between the calculated and experimental yields; and (ii) is performed using the Bayesian optimization algorithm with the Gaussian process as a surrogate model, as implemented in OpenBox library46. Specifically, we used expected improvement as an acquisition function and radial basis function (RBF) kernel. Search space comprised of continuous variables along with their ranges and starting values (see argsParser.py for details of space definition) as well as constraints defining relations between variables (e.g., to enforce that “Fast” reaction is faster than “Slow,” and “Slow” is faster than “Very Slow”; function buildOpenBoxSpaceFromDict in optscan.py file). For each model considered, five independent runs were performed for at least 100 optimization steps each. Each optimization aimed to maximize the coefficient of determination of linear regression (without intercept, i.e., passing through (0,0)) between experimentally reported yields and yields predicted by the model (see optFunctionOpenBox function in optscan.py file for implementation details). The model offering the highest correlation was then taken. With the parameters thus optimized, the model was used to estimate the yields of 10 MCRs from the test set (Fig. 1b).
Model’s performance and limitations
As shown in Fig. 5 and further quantified in Fig. 6, optimization on Level 3 networks yields Pearson correlation coefficient, \({\rho }_{{{{\rm{train}}}}}^{2}=\) 0.800, coefficient of determination, \({R}_{{{{\rm{train}}}}}^{2}\,\)= 0.432 and mean absolute error, \({{{\rm{MAE}}}}=10.5\%\). On the test set, the model achieves \({\rho }_{{{{\rm{test}}}}}^{2}=\) 0.861, \({R}_{{{{\rm{test}}}}}^{2}\) = 0.852, and \({{{\rm{MAE}}}}=7.3\%\). This means that the model can extrapolate well to unseen mechanisms and reaction types. Indeed, Fig. 7 shows that only 32 reaction classes are common between 115 classes in the training-set MCRs and 71 classes in the test set (for the list of reaction classes, see Supplementary Section S2). Of note, the performance of the model is likely as good as it can get given the small size of the training dataset and model’s assumptions. This is corroborated by the analyses detailed in Supplementary Section S1 in which we compared the best model described here with the ensemble of 331 models that showed similar performance. Indeed, the differences between (i) the predicted yields averaged over the 331-model ensemble and (ii) predicted yields from the best model are small (Supplementary Fig. S1), and so are standard deviations of yield predicted from the ensemble of models. Also, Supplementary Fig. S2 evidences that these standard deviations do not correlate with experimental yields, suggesting that all models have low diversity (i.e., give similar errors/results on new data).
The training set consists of mechanistic reaction networks for 20 MCRs reported in the literature; the test set comprises mechanistic networks for 10 new, computer-discovered (and experimentally validated) MCRs and one-pot reactions described in ref. 11. Summary of reactions is shown in Fig. 1a, b. The trend line (orange) is a fit to all data points, trend lines fitted to training and test sets are largely overlapping and are not shown for clarity. The red line shows the ideal relationship between experimental and predicted yields (y = x).
83 reaction types were unique to the training set, 39 reaction types were present only in the test set, and 32 reaction types were common to both sets. Examples of reaction types from each set are listed next to the corresponding sectors of the pie chart. For the full list of reaction types, see Supplementary Section S2.
Next, we investigated which parameters of the model are crucial to its performance. As already mentioned, \({{\mathrm{ln}}}{{k}}_{{{{\rm{i}}}}}^{{{{\rm{Mayr}}}}}\) term by itself performs poorly – on the training set, it achieves \({\rho }_{{{{\rm{train}}}}}^{2}=\) 0.678, \({R}_{{{{\rm{train}}}}}^{2}\,\)= −1.10 and \({{{\rm{MAE}}}}=22.1\%\), and on the test set, \({\rho }_{{{{\rm{test}}}}}^{2}=\) 0.404, \({R}_{{{{\rm{test}}}}}^{2}\) = −0.059 and \({{{\rm{MAE}}}}=20.7\%\). This is quantified by the second-to-the-left set of bars in the histogram in Fig. 6. The remaining pairs of bars in this figure are for the full model with individual correction terms removed. As seen, the most important correction is \({r}_{{{{\rm{i}}}}}^{{{{\rm{water}}}}}\), penalizing steps which require stoichiometric water but are part of pathways not using aqueous conditions. Without the correction, the model achieves only \({\rho }_{{{{\rm{train}}}}}^{2}=\) 0.601, \({R}_{{{{\rm{train}}}}}^{2}\,\)= −0.439 and \({{{\rm{MAE}}}}=17.9\%\) and, on the test set, \({\rho }_{{{{\rm{test}}}}}^{2}=\) 0.462, \({R}_{{{{\rm{test}}}}}^{2}\) = 0.166 and \({{{\rm{MAE}}}}=17.3\%\). In turn, contribution from \({r}_{{{{\rm{i}}}}}^{{{{\rm{class}}}}}\) is important for better model generalization. Without this correction, performance on the training set is comparable to that of the full model but correlation on the test set is significantly worse. On the flipside of the coin, some of the corrections may be spurious, implying that models with lower numbers of parameters work equally well. For instance, models without \({r}_{{{{\rm{i}}}}}^{{{{\rm{Polanyi}}}}}\) or \({r}_{{{{\rm{i}}}}}^{{{{\rm{cond}}}}}\) perform comparably to the full model.
We also analyzed how detailed the mechanistic knowledge has to be to assure accurate yield predictions. On one hand, training the full model (i.e., with all corrections) only on the main reaction pathways with immediate side-reactions (Level 2, L2) is insufficient as the metrics of accuracy (\({\rho }_{{{{\rm{train}}}}}^{2}=\) 0.345, \({R}_{{{{\rm{train}}}}}^{2}\,\)= −0.127 and \({{{\rm{MAE}}}}=16.1\%;\) \({\rho }_{{{{\rm{test}}}}}^{2}=\)0.042, \({R}_{{{{\rm{test}}}}}^{2}\) = −0.590 and \({{{\rm{MAE}}}}=27.1\%\)) are much lower than for the Level 3, L3 analysis described above. On the other hand, expansion to Level 4, L4, allowing for downstream reactions of the L3 species, also worsens the performance accuracy (\({\rho }_{{{{\rm{train}}}}}^{2}=\) 0.709, \({R}_{{{{\rm{train}}}}}^{2}\,\)= 0.053 and \({{{\rm{MAE}}}}=14.1\%;\) \({\rho }_{{{{\rm{test}}}}}^{2}=\)0.436, \({R}_{{{{\rm{test}}}}}^{2}\) = −0.061 and \({{{\rm{MAE}}}}=20.5\%\)). We believe this effect can be reasonably explained in terms of model’s inherent error (due to the simplified treatment of kinetics) propagating with the rapidly increasing sizes of the networks. In fact, the L4 networks are ca. 80% larger than L3 ones (Fig. 8).
Data were derived from bipartite mechanistic graphs in which both reactions and compounds are represented as nodes whereas edges are connections between compounds and reactions (see Fig. 2d). Networks from the test set are always bigger than those from the training set and the difference grows with calculation level (up to 15 % for L2, 3-23 % for L3, and 36-55 % for L4).
While this trend could be expected, it also points to the main drawback of the mechanistic approach – namely, that it can be quite sensitive to some parts of the mechanistic picture missing. For example, had the algorithm used to construct the mechanistic networks not known substitution of bromide with thiolate, it would not be able to predict the formation of 2-[(2-methoxy-2-oxoethyl)sulfanyl]hept-2-enoate by-product in the beta-elimination step of the last reaction in Fig. 1b (marked with a yellow star), and would grossly overestimate its yield (81% instead of 47%). In this respect, we emphasize that even though our current 8000-set covers a broad range of mechanistic steps (acid-base catalyzed, substitutions, eliminations, additions, rearrangements, pericyclic reactions, basic transformations catalyzed by transition metals), it is not without notable omissions. For instance, radical mechanistic steps are still to be included since their proper generalization (from specific precedents47 into transforms applicable to different scaffolds) is challenging. This effort may require a separate study, akin to our recent work on carbocationic rearrangements10. Also, the currently available data is insufficient to predict how reaction rates depend on specific choices of catalysts. In the fullness of time, this information may become available either from high-end quantum-mechanical calculations or from the promising marriage of experiments with AI48,49,50,51,52,53,54.
Discussion
In summary, the approach we described is a union of computer-assisted analysis of mechanistic reaction networks with rate approximations and corrections grounded in physical-organic chemistry. The model as-is works well in predicting yields of MCRs based on reactions between various nucleophiles and electrophiles, and transfers well between training and test sets based on markedly different and diverse reactions mechanisms. Recognizing that more work is needed to incorporate other classes of reactivities, we feel that approaches like this one should be pursued as an alternative to chemical AI methods based on full-reaction data which, despite being trained on thousands to millions of literature examples, do not offer satisfactory accuracy of yield prediction.
Methods
The optimization procedure to identify the best model was repeated 10 times from different random starting parameters. This was done using Bayesian optimization with two surrogate models: (i) Gaussian Process (GP), and (ii) probabilistic random forest (PRF). During these optimization campaigns, 331 sets of parameters with R2 > 0.40 were found. Average predicted yields from this ensemble were then compared against those from the best model with the differences being generally small (2.7 % average over all reactions, see Supplementary Fig. S1). To quantify model’s uncertainty, we considered the standard deviation of yield prediction between individual models within the ensemble of all 331 models. For the test and training sets, the standard deviations were small, 2% and 5%, respectively. Moreover, standard deviation and prediction error were not correlated (Supplementary Fig. S2). We also verified that all models have comparable accuracy, and the results from the ensemble are not significantly better than those of a single model. This suggests that diversity between models is low, and all optimizations result in models with similar accuracies/errors on the test data. Colloquially put, the performance metrics described in the main text are likely as good as they can get, given the dataset and model’s key assumptions.
Data availability
The reaction networks data generated in this study (along with condition classification, rate categorization, and optimized rate parameters) have been deposited in the publicly available repository https://zenodo.org/records/13381381. All data are also available from the corresponding authors upon request. User manuals are available in Supplementary Section S3.
Code availability
Codes for the optimization of kinetic parameters and for the calculation of yields are deposited at https://zenodo.org/records/13381381. These codes are accompanied by all 30 digitized mechanistic networks we describe in the text (of course, other networks can be used as input). Codes for network expansion and analysis, from ref. 11, are deposited at https://zenodo.org/records/13381201. Interactive Allchemy MECH web-app is freely available at https://mech.allchemy.net (given server capacity, to twenty concurrent academic users on a rolling basis and two-week slots).
References
Emami, F. S. et al. A priori estimation of organic reaction yields. Angew. Chem. Int. Ed. 54, 10797–10801 (2015).
Skoraczyński, G. et al. Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient? Sci. Rep. 7, 3582 (2017).
Beker, W. et al. Machine learning may sometimes simply capture literature popularity trends: A case study of heterocyclic Suzuki-Miyaura coupling. J. Am. Chem. Soc. 144, 4819–4827 (2022).
Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 360, 186–190 (2018).
Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Mach. Learn. Sci. Technol. 2, 015016 (2021).
Saebi, M. et al. On the use of real-world datasets for reaction yield prediction. Chem. Sci. 14, 4997–5005 (2023).
Liu, Z., Moroz, Y. S. & Isayev, O. The challenge of balancing model sensitivity and robustness in predicting yields: a benchmarking study of amide coupling reactions. Chem. Sci. 14, 10835–10846 (2023).
Fitzner, M., Wuitschik, G., Koller, R., Adam, J.-M. & Schindler, T. Machine learning C–N couplings: Obstacles for a general-purpose reaction yield prediction. ACS Omega 8, 3017–3025 (2023).
Strieth-Kalthoff, F. et al. Machine learning for chemical reactivity: The importance of failed experiments. Angew. Chem. Int. Ed. 61, e202204647 (2022).
Klucznik, T. et al. Computational prediction of complex cationic rearrangement outcomes. Nature 625, 508–515 (2024).
Roszak, R. et al. Systematic, computational discovery of multicomponent and one-pot reactions. Nat. Commun. (2024) In press.
Kolb, J., Beck, B. & Dömling, A. Simultaneous assembly of the β-lactam and thiazole moiety by a new multicomponent reaction. Tetrahedron Lett. 43, 6897–6901 (2002).
Polikarchuk, V. et al. Novel variants of the multicomponent reaction for the synthesis of 1,2,4-triazolo[1,5-а]pyrimidines and pyrido[3,4-е][1,2,4]triazolo[1,5-а]pyrimidines. Chem. Heterocycl. Compd. 56, 1054–1061 (2020).
Boddaert, T., Coquerel, Y. & Rodriguez, J. N-heterocyclic carbene-catalyzed Michael additions of 1,3-dicarbonyl compounds. Chemistry 17, 2266–2271 (2011).
Marcaccini, S. et al. One-pot synthesis of quinolin-2-(1H)-ones via tandem Ugi–Knoevenagel condensations. Tetrahedron Lett. 45, 3999–4001 (2004).
Marcos, C. F., Marcaccini, S., Pepino, R., Polo, C. & Torroba, T. Studies on isocyanides and related compounds; A facile synthesis of functionalized 3(2H)-pyridazinones via Ugi four-component condensation. Synthesis 2003, 0691–0694 (2003).
Nguyen, H. H., Palazzo, T. A. & Kurth, M. J. Facile one-pot assembly of imidazotriazolobenzodiazepines via indium(III)-catalyzed multicomponent reactions. Org. Lett. 15, 4492–4495 (2013).
Soeta, T., Matsuzaki, S. & Ukaji, Y. A one-pot O-phosphinative Passerini/Pudovik reaction: Efficient synthesis of highly functionalized α-(phosphinyloxy)amide derivatives. Chem. Eur. J. 20, 5007–5012 (2014).
Pirrung, M. C. & Sarma, K. D. Multicomponent reactions are accelerated in water. J. Am. Chem. Soc. 126, 444–445 (2004).
Yoshida, H. et al. Three-component coupling of arynes and organic bromides. Angew. Chem. Int. Ed. 50, 9676–9679 (2011).
Wang, R., Tian, P. & Lin, G. Stereoselective total synthesis of tubulysin V. Chin. J. Chem. 31, 40–48 (2013).
Opatz, T. & Ferenc, D. An unexpected three-component condensation leading to amino- (3-oxo-2,3-dihydro-1H-isoindol-1-ylidene)- acetonitriles. J. Org. Chem. 69, 8496–8499 (2004).
Okuma, K., Hino, H., Sou, A., Nagahora, N. & Shioji, K. Cascade approach to trichloroalkyl phenyl ethers from benzyne, epoxides, and chloroform. Chem. Lett. 38, 1030–1031 (2009).
Barrow, J. C. et al. Discovery and X-ray crystallographic analysis of a spiropiperidine iminohydantoin inhibitor of beta-secretase. J. Med. Chem. 51, 6259–6262 (2008).
Onitsuka, K., Suzuki, S. & Takahashi, S. A novel route to 2,3-disubstituted indoles via palladium-catalyzed three-component coupling of aryl iodide, o-alkenylphenyl isocyanide and amine. Tetrahedron Lett. 43, 6197–6199 (2002).
Tietze, L. F., Böhnke, N. & Dietz, S. Synthesis of the deoxyaminosugar (+)-D-forosamine via a novel domino-Knoevenagel-hetero-Diels-Alder reaction. Org. Lett. 11, 2948–2950 (2009).
Kulkarni, A. M., Pandit, K. S., Chavan, P. V., Desai, U. V. & Wadgaonkar, P. P. Cobalt ferrite nanoparticles: a magnetically separable and reusable catalyst for Petasis-Borono–Mannich reaction. RSC Adv. 5, 70586–70594 (2015).
Keating, T. A. & Armstrong, R. W. A Remarkable two-step synthesis of diverse 1,4-benzodiazepine-2,5-diones using the Ugi four-component condensation. J. Org. Chem. 61, 8935–8939 (1996).
Kim, J. W. & Chung, Y. K. Pauson-Khand reaction catalyzed by Co4(CO)12. Synthesis 1998, 142–144 (1998).
Betancort, J. M., Sakthivel, K., Thayumanavan, R., Tanaka, F. & Barbas, C. F. III Catalytic direct asymmetric Michael reactions: Addition of unmodified ketone and aldehyde donors to alkylidene malonates and nitro olefins. Synthesis 2004, 1509–1521 (2004).
Zeng, Y. et al. Silver-mediated trifluoromethylation-iodination of arynes. J. Am. Chem. Soc. 135, 2955–2958 (2013).
Szymkuć, S. et al. Computer-assisted synthetic planning: The end of the beginning. Angew. Chem. Int. Ed. 55, 5904–5937 (2016).
Klucznik, T. et al. Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory. Chem 4, 522–532 (2018).
Mikulak-Klucznik, B. et al. Computational planning of the synthesis of complex natural products. Nature 588, 83–88 (2020).
Wołos, A. et al. Synthetic connectivity, emergence, and self-regeneration in the network of prebiotic chemistry. Science 369, aaw1955 (2020).
Wołos, A. et al. Computer-designed repurposing of chemical wastes into drugs. Nature 604, 668–676 (2022).
Molga, K., Gajewska, E. P., Szymkuć, S. & Grzybowski, B. A. The logic of translating chemical knowledge into machine-processable forms: a modern playground for physical-organic chemistry. React. Chem. Eng. 4, 1506–1521 (2019).
Gothard, C. M. et al. Rewiring chemistry: algorithmic discovery and experimental validation of one-pot reactions in the network of organic chemistry. Angew. Chem. Int. Ed. 51, 7922–7927 (2012).
Mayr, H. & Patz, M. Scales of nucleophilicity and electrophilicity: A system for ordering polar organic and organometallic reactions. Angew. Chem. Int. Ed. 33, 938–957 (1994).
Mayr’s Database of Reactivity Parameters - Start page Available at: https://www.cup.lmu.de/oc/mayr/reaktionsdatenbank/. (Accessed: 6th December 2023).
Hagberg, A., Schult, D., Swart, P. & Hagberg, J. M. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (eds Varoquaux, G. et al.) 11–15 (2008).
Roszak, R., Beker, W., Molga, K. & Grzybowski, B. A. Rapid and accurate prediction of pKa values of C–H acids using graph convolutional neural networks. J. Am. Chem. Soc. 141, 17142–17149 (2019).
Hammett, L. P. The effect of structure upon the reactions of organic compounds. Benzene derivatives. J. Am. Chem. Soc. 59, 96–103 (1937).
Edwards, J. O. Correlation of relative rates and equilibria with a double basicity scale. J. Am. Chem. Soc. 76, 1540–1547 (1954).
Moussa, J. E., Steward, J. J. P. MOPAC software https://doi.org/10.5281/zenodo.6511958.
Jiang, H. et al. OpenBox: A Python toolkit for generalized black-box optimization. arXiv [cs.LG] http://arxiv.org/abs/2304.13339 (2023).
Tavakoli, M., Chiu, Y. T. T., Baldi, P., Carlton, A. M. & Van Vranken, D. RMechDB: A public database of elementary radical reaction steps. J. Chem. Inf. Model. 63, 1114–1123 (2023).
Reid, J. P. & Sigman, M. S. Holistic prediction of enantioselectivity in asymmetric catalysis. Nature 571, 343–348 (2019).
Dotson, J. J. et al. Data-driven multi-objective optimization tactics for catalytic asymmetric reactions using bisphosphine ligands. J. Am. Chem. Soc. 145, 110–121 (2023).
Gensch, T. et al. A Comprehensive discovery platform for organophosphorus ligands for catalysis. J. Am. Chem. Soc. 144, 1205–1217 (2022).
Zahrt, A. F. et al. Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning. Science 363, eaau5631 (2019).
Rinehart, N. I. et al. A machine-learning tool to predict substrate-adaptive conditions for Pd-catalyzed C-N couplings. Science 381, 965–972 (2023).
Tsuji, N. et al. Predicting highly enantioselective catalysts using tunable fragment descriptors. Angew. Chem. Int. Ed. 62, e202218659 (2023).
Hueffel, J. A. et al. Accelerated dinuclear palladium catalyst identification through unsupervised machine learning. Science 374, 1134–1140 (2021).
Dai, J. et al. New oblongolides isolated from the endophytic fungus Phomopsis sp. from Melilotus dentata from the shores of the Baltic Sea. Eur. J. Org. Chem. 2005, 4009–4016 (2005).
Bunyapaiboonsri, T., Yoiprommarat, S., Srikitikulchai, P., Srichomthong, K. & Lumyong, S. Oblongolides from the endophytic fungus Phomopsis sp. BCC 9789. J. Nat. Prod. 73, 55–59 (2010).
Ireland, R. E., Armstrong, I., Lebreton, J. D. J., Meissner, R. S. & Rizzacasa, M. A. Convergent synthesis of polyether ionophore antibiotics: synthesis of the spiroketal and tricyclic glycal segments of monensin. J. Am. Chem. Soc. 115, 7152–7165 (1993).
Danishefsky, S. J., DeNinno, S. & Lartey, P. A concise and stereoselective route to the predominant stereochemical pattern of the tetrahydropyranoid antibiotics: an application to indanomycin. J. Am. Chem. Soc. 109, 2082–2089 (1987).
Parker, K. A. & Georges, A. T. Reductive aromatization of quinols: synthesis of the C-arylglycoside nucleus of the papulacandins and chaetiacandin. Org. Lett. 2, 497–499 (2000).
Gurjar, M. K., Krishna, L. M., Reddy, B. S. & Chorghade, M. S. A versatile approach to anti-asthmatic compound CMI-977 and its six-membered analogue. Synthesis 2000, 557–560 (2000).
Hur, J., Jang, J. & Sim, J. A Review of the pharmacological activities and recent synthetic advances of γ-butyrolactones. Int. J. Mol. Sci. 22, 2769 (2021).
Chavan, S. R. et al. Iminosugars spiro-linked with morpholine-fused 1,2,3-triazole: Synthesis, conformational analysis, glycosidase inhibitory activity, antifungal assay, and docking studies. ACS Omega 2, 7203–7218 (2017).
Tanaka, N. et al. Isolation and structures of attenols A and B. Novel bicyclic triols from the Chinese bivalve Pinna attenuata. Chem. Lett. 28, 1025–1026 (1999).
Chen, D. et al. Discovery, structural insight, and bioactivities of BY27 as a selective inhibitor of the second bromodomains of BET proteins. Eur. J. Med. Chem. 182, 111633 (2019).
Teiji, K. et al. Multi-cyclic cinnamide derivatives. Patent US 2007219181A1 (2007).
Banwell, M. G. et al. Small molecule glycosaminoglycan mimetics. Patent WO 2006135973A1 (2006).
Mattson, R. J. & Catt, J. D. Piperazinyl-cyclohexanes and cyclohexenes. Patent US 6153611A (2000).
Acknowledgements
Development of the MECH module within the Allchemy platform (by R.R., A.W., S.S.) was supported by internal funds of Allchemy, Inc. Analysis of pathways and writing of the paper by B.A.G. was supported by the Institute for Basic Science, Korea (Project Code IBS-R020-D1).
Author information
Authors and Affiliations
Contributions
S.S., A.W. selected the literature data, assisted in developing the algorithm, formulated chemical heuristics, and analyzed the results, R.R. developed the algorithm. B.A.G. conceived and supervised research and wrote the paper with help from other authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare the following competing interests: S.S., A.W., R.R., and B.A.G. are consultants and/or stakeholders of Allchemy, Inc. Allchemy software and its MECH module is property of Allchemy, Inc. USA. All queries about access options to Allchemy, including academic collaborations, should be sent to saraszymkuc@allchemy.net.
Peer review
Peer review information
Nature Communications thanks Xin Hong and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Szymkuć, S., Wołos, A., Roszak, R. et al. Estimation of multicomponent reactions’ yields from networks of mechanistic steps. Nat Commun 15, 10286 (2024). https://doi.org/10.1038/s41467-024-54550-1
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-024-54550-1
This article is cited by
-
Robot-assisted mapping of chemical reaction hyperspaces and networks
Nature (2025)
-
Systematic, computational discovery of multicomponent and one-pot reactions
Nature Communications (2024)










