Estimation of multicomponent reactions’ yields from networks of mechanistic steps

Szymkuć, Sara; Wołos, Agnieszka; Roszak, Rafał; Grzybowski, Bartosz A.

doi:10.1038/s41467-024-54550-1

Download PDF

Article
Open access
Published: 27 November 2024

Estimation of multicomponent reactions’ yields from networks of mechanistic steps

Nature Communications volume 15, Article number: 10286 (2024) Cite this article

4751 Accesses
7 Citations
1 Altmetric
Metrics details

Subjects

Abstract

This work describes estimation of yields of complex, multicomponent reactions (MCRs) based on the modeled networks of mechanistic steps spanning both the main reaction pathway as well as immediate and downstream side reactions. Because experimental values of the kinetic rate constants for individual mechanistic transforms are extremely sparse, these constants are approximated here using Mayr’s nucleophilicity and electrophilicity parameters fine-tuned by correction terms grounded in linear free-energy relationships. With this formalism, the model trained on the mechanistic networks of only 20 – but mechanistically- and yield-diverse MCRs – transfers well to newly discovered MCRs that are based on markedly different mechanisms and types of individual mechanistic transforms. These results suggest that mechanistic-level approach to yield estimation may be a useful alternative to models that are derived from full-reaction data and lack information about yield-lowering side reactions.

Exploring catalytic reaction networks with machine learning

Article 26 January 2023

Systematic, computational discovery of multicomponent and one-pot reactions

Article Open access 27 November 2024

Organic reaction mechanism classification using machine learning

Article 25 January 2023

Introduction

The ability to predict or, at least, estimate the yields of organic reactions would be of tremendous value for synthetic chemistry, limiting the number of unproductive experiments, minimizing the use of solvents and reagents, and lowering the overall monetary and environmental cost of chemical production. Not surprisingly, there have been many attempts to develop algorithms for that purpose. In our own work, we evaluated both thermodynamic models based on optimized free-energy group contributions (assuming thermodynamic control)¹ as well as machine-learning, ML, methods^2,3; others have since focused on various ML approaches. Despite some early optimism^4,5, subsequent studies revealed relatively low correlations between experimental and predicted yield values – not only in collections of diverse reaction types^2,6 but also within larger sets of same-type reactions^3,6,7,8. Pondering the reason for this unsatisfactory performance, we observe that all efforts to date learned on full, substrate-to-product reactions, as typically reported in the literature and/or electronic notebooks. Such full-reaction data does not capture reactions’ mechanistic intricacies – in particular, it has no explicit knowledge of possible side reactions that can lead to undesired outcomes and lower the yields. Some of this knowledge could, in principle, be captured through adequately large⁹ numbers of examples of failed reactions but these are typically not published, and the distributions of yields in literature datasets are heavily skewed toward higher values (with a mean approaching 80%²).

Given these limitations, we recently began to teach the computer mechanistic transformations which, when applied to desired substrates, propagate large networks of mechanistic steps. In ref. ¹⁰., we encoded some 400 mechanistic steps specific to carbocations, and used the network approach to predict the mechanisms of complex carbocationic rearrangements. Therein, we parametrized the heights of kinetic barriers (based on quantum-mechanical calculations) and used this knowledge of kinetics to predict products’ distributions and yields. More recently, we deployed a much larger set of ~8000 general-scope mechanistic transforms (cf. below) and applied them to multiple small-molecule substrates. This effort was intended to trace mechanistic pathways defining new multicomponent reactions, MCRs, which are particularly appealing because they offer high atom-economy, minimize separation and purification operations, and can yield complex scaffolds that are often less prone to follow-up or side reactions than non-MCR reactions. Indeed, in ref. ¹¹. we described how such analyses enable systematic discovery of plausible MCRs candidates, of which multiple we validated by experiment. An essential part of this effort has been the ability to estimate the yields of these MCRs – would multicomponent substrate mixtures result in a mixture of low-yielding products, or would they lead to a major product in good yield? Unfortunately, given the number and diversity of mechanistic steps in the 8000 set, QM calculations of kinetic barriers have proven prohibitive – instead, we pursued and describe here a physical-organic approach in which kinetics of mechanistic steps is approximated by using nucleophilicity and electrophilicity indices and linear free-energy relationships. We train this model on the mechanistic networks of 20 known MCRs (chosen to span a wide range of yields, Fig. 1a, c) and then apply it to predict the yields of our newly-discovered MCRs (Fig. 1b, d) that not only use different substrates but are also based on unprecedented mechanisms. Despite such fundamental mechanistical differences, the model transfers between the training and testing MCRs, achieving similar– and in the light of previous effort, quite satisfactory – performance metrics (e.g., mean absolute errors, MAE = 10.5% and 7.3%, respectively). These results suggest that mechanistic-level approach to yield estimation may be a useful alternative to models derived from full reaction data, although – as we also emphasize in our discussion – it is certainly pending future extensions of the 8000 rule set and validation on larger sets of mechanistic networks.

**Fig. 1: Multicomponent reactions, MCRs, used to train and test the model.**

Results

Choice of MCRs for training and testing

We began by selecting a set of MCRs on which to parametrize the model. As we showed in ref. ¹¹, there are only ~630 distinct MCR types (differing in the reaction core) and the majority of those are reported with good yields which, as we argued before^2,3, severely limits the predictivity of data-derived models. Accordingly, we sought a training set of mechanistically diverse MCRs (Fig. 1a and, for mechanistic details, refs. ^{12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}) for which the highest, optimized yields reported in the literature are roughly uniformly distributed between 33 and 100% (Fig. 1c). The size of this set (20 MCRs) was determined by the availability of published examples reporting lower yields (eight examples with yields ≤ 55%). As the test set, we used the aforementioned 10 MCRs and one-pot sequences recently discovered¹¹ by the Allchemy algorithm and subsequently validated by experiment (Fig. 1b). The yields of these test reactions also span a broad range of values (Fig. 1d).

Mechanistic transforms and networks

For each of these MCRs, we used mechanistic transforms (expert-coded in the SMARTS notation, akin to our previous works on retro- and forward- synthesis^{32,33,34,35,36,37}) to propagate mechanistic networks from substrates to products. As detailed in refs. works^10,11, these transforms are roughly at the level of the so-called arrow-pushing steps and encompass a broad range of chemistries (though not yet exhaustive, see later in the text and the Methods section of ref. ¹¹.). Each transform is broader than any specific literature precedent and delineates the scope of substituents admissible or prohibited at various positions. It is also accompanied by information about general reaction conditions (strongly basic, basic, mildly basic, neutral, mildly acidic, strongly acidic, and information if reaction requires Lewis acid), solvent class (protic-aprotic and/or polar-nonpolar), temperature range (<−20 °C, −20 to 20 °C, r.t., 40 to 150 °C, > 150 °C), water tolerance (yes, no, water is required), typical speeds (very slow, slow, fast, very fast, and unknown if conflicting literature data have been reported), and more.

Propagation of the mechanistic networks starts with a set of substrates (denoted as synthetic generation G₀ in Fig. 2a) either specified by the user or systematically selected from an expert-curated collection of ca. 2400 simple molecules featuring combinations of functional groups promoting diverse modes of reactivity (for detailed list, see ref. ¹¹). To these substrates, the algorithm applies the matching mechanistic transforms – under all possible conditions – to create the first-generation, G₁, of products and by-products, which are then iteratively reacted^10,11 to give generations G₂ and higher (here, up to G₈) to reach the reported product (Fig. 2a and https://mcrchampionship.allchemy.net for examples studied here; interactive version for these and other mechanistic calculations available at https://mech.allchemy.net).

**Fig. 2: Forward and sideways expansion of mechanistic networks.**

Within the networks thus constructed, the algorithm traces the substrates-to-product sequences of mechanistic steps that are mutually compatible (see caption to Fig. 2a and, for detailed description, ref. ¹¹). For each of the MCRs, the pathway fulfilling this mutual compatibility of steps and also closest-matching the class of literature-reported conditions is taken as Level 1 of our network analysis (Fig. 2b). Importantly, for the MCRs studied here, the sequences the algorithm concatenates from individual mechanistic steps agree with the mechanisms proposed by the authors in the original publications^{12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}. Next, the Level 1 routes are expanded sideways to account for by-products of the main-route and products of any side-reactions possible under the main-pathway conditions (Level 2 of analysis). Level 3 accounts for reactions, in which species from Level 2 react between themselves or with species from Level 1. Such expansion can then be iterated to higher levels and is illustrated schematically in Fig. 3 and in Fig. 2c, d for a specific MCR example. In Fig. 2c, a simple molecular graph format is used (with one type of nodes corresponding to molecules), whereas in Fig. 2d the so-called bi-partite graph is applied in which there are two types of nodes: one type corresponding to molecules and the other, to reactions in which these molecules engage. As we discussed elsewhere³⁸, the bipartite representation is better suited to capture all causal relationships between substrates and products and is the one used in kinetic analyses to which we now turn.

**Fig. 3: Scheme of a side-expansion of an MCR pathway.**

Development of the kinetic model

With the ultimate objective of estimating MCR yields, it is first necessary to approximate the equilibrium constants or kinetic rate constants of the individual steps in a mechanistic graph expanded to some Level n. Since published kinetic data are extremely sparse, it is impossible to assign experimental values to the vast majority of steps – hence, we have aimed to develop a heuristic model based on the extension of Mayr’s nucleophilicity and electrophilicity indices^39,40 and linear free-energy relationships.

To begin with and as discussed in detail in ref. ¹¹, individual mechanistic steps can have multiple classes of plausible conditions assigned to them (e.g., Diels-Alder cycloaddition can be carried out either under neutral conditions at high temperature or at lower temperatures using a Lewis acid catalyst). With this in mind, if multiple steps along the main/Level 1 MCR route have overlapping condition ranges as logical alternatives, we choose the conditions that minimize the overall number of condition changes along the route. For example, if some Step 1 could be carried under neutral or mildly acidic conditions and subsequent Step 2 requires either mildly acidic or acidic conditions, the Step 1/Step 2 sequence is assigned the common, mildly acidic conditions. Along the entire route, such unification of conditions is performed using a greedy algorithm on topologically sorted sequence of steps (topological sorting was performed with Kahn’s algorithm implemented in networkX library⁴¹).

Next, we analyze the entire Level-n mechanistic graph and identify all acid-base equilibria. We stipulate that (i) under basic conditions, protonated forms cannot exist and, (ii) conversely, under acidic conditions, deprotonated forms are to be excluded from the graph. Naturally, these are simplifications, and in reality the fraction of protonated/deprotonated forms is not binary and depends on the specific pKa values, the number of acid/base equivalents used, and the solvent environment. For all other acid-base equilibria and in an effort to minimize the number of free parameters, we make a very rough assumption that their equilibrium constants, K, are always the same. The value of K is a parameter for the global optimization over all 20 MCRs in the training set. Similarly, for all tautomerizations, we assume one global equilibrium constant, K_tau, to be optimized. Note that the treatment of tautomers as separate species within the reaction graph is necessitated by the SMILES/SMARTS notation that considers tautomers as distinct structures. The limitations of this notation also require that resonance structures of enolate anions be treated as two separate entities (C- and O-anions, here taken in 1:1 ratio). Reaction classes categorized as in-principle reversible but, for some particular substrates, resulting in aromatization (e.g., beta-elimination leading to a thiazole) are assigned as irreversible.

Furthermore, the graph is simplified by preventing reactions of some very reactive species (e.g., acyl chlorides, lithoorganic compounds) present in earlier synthetic generations with species formed in later generations. For instance, if acyl chloride is present in G₀ and can engage in reactions with some other species from the same generation, it cannot react with species formed in G₁ or higher. Colloquially put, such reactive species are not allowed to sit around and wait until multiple other steps happen and some downstream reaction partners present themselves – instead, they react as rapidly as they can with immediately available suitors. Of course, if in the experimental procedure some reagents are added only at later stages of the sequence, this delayed addition is reflected in the structure of the graph (connecting the incoming chemical to the step in which it is used).

Regioselectivity of C-H deprotonation is assessed for motifs prone to non-equivalent deprotonation (asymmetric ketones, 1,3-di(thio)carbonyls, 1,3-ketoesters or other active methylene compounds) using pKa values pre-calculated using the graph convolutional neural network pKa predictor we described in ref. ⁴². If one equivalent of base is used, deprotonation and subsequent reaction are allowed only at the most acidic position. With excess base and possible formation of a di-anion, the reaction is allowed to proceed at the second most acidic locus.

With the graph thus processed, we proceed to assign kinetic rates to individual steps. We do this by extending the popular Mayr’s reactivity parameters available from^39,40. According to the so-called Mayr–Patz equation³⁹, the logarithms of a second-order reaction rate constant at 20 ^oC can be related to the nucleophilicity parameter N and electrophilicity parameter E as

$$\log k=s(N+E)$$

(1)

where N varies between –8.8 and +32, E between −30 and +8, and s is a nucleophile-dependent slope parameter. Here, we do not use the values of slope parameters, $s$, from Mayr’s tables because they can depend quite strongly on specific solvents (and temperatures). Moreover, in some cases, the values for the tabulated examples for specific substrates lead to problematic predictions – e.g., predicting that $s$ be higher for a reaction between aniline and an aldehyde than for the reaction between a primary amine with the same aldehyde. In light of these problems, for all reactions, we set the value of the slope parameter to an arbitrary value, say $s$ = 2.303, such that the Eq. 1 simplifies to a natural logarithm more commonly used in linear free energy relationships, LFER, we will use later:

$${{{\mathrm{ln}}}}\;k=N+E$$

(2)

Since Mayr’s collection encompasses only 1273 specific nucleophiles and 344 electrophiles, it is very unlikely that they would coincide with the species along our pathways. To remedy this, for any nucleophile-electrophile transformation present in a given mechanistic network, we search Mayr’s compendium for compounds that share the same reacting groups and are the most similar (by Tanimoto similarity based on ECFP4 fingerprints) to the substrates of our particular transformation (Fig. 4); if there are no examples with matching reactive groups in the Mayr set, we assign to such reactions a default value of the rate constant, ${{{\mathrm{ln}}}}\;k=1$. We strive to select parameters for solvents that are closest-matching to those predicted by our algorithm. Specifically, if for a given nucleophile and electrophile, multiple Mayr’s data are available in different solvents, we retain only those entries that match the predicted class of solvents (polar/nonpolar, protic/aprotic). Then, we take the solvent that is common to the electrophile and nucleophile (rarely, if multiple such common solvents are available, we chose the one most popular in Mayr’s tables). If there are no common solvents for a given nucleophile/electrophile pair, the algorithm selects Mayr’s data corresponding to solvents that are most similar (with similarity defined as a number of common qualifiers, e.g., MeOH and EtOH have two common qualifiers, ‘protic_solvents’ and ‘alcohols’, whereas MeOH and water have only one, ‘protic_solvents’). If Mayr’s data are available only for solvents not matching the predicted solvent class, we take ${{{\mathrm{ln}}}}\;k=1$. Finally, we limit the sum $N+E$ to some maximal absolute value later taken as model’s free parameter; this is to avoid extremely high or low reaction rates (which may be unphysical, especially given that they extrapolate to our system(s) from some specific solvent and only one temperature tabulated by Mayr).

**Fig. 4: A similarity-based assignment of N,E parameters to molecules not present in the Mayr’s collection.**

Obviously, such Mayr-like rates provide only a very rough approximation – indeed, when the parameters of the model defined thus far were optimized against the 20 literature MCRs, the correlations between calculated and experimental yields were very poor (vide infra). This called for the inclusion of additional terms to better approximate the rate constants, in effect following the LFER philosophy known and developed for decades^43,44. In this spirit, we treat the rate constant of step i as a linear combination of several heuristic corrections,

$${{{\mathrm{ln}}}}\;{k}_{{{{\rm{i}}}}}={{{\mathrm{ln}}}}\;{k}_{{{{\rm{i}}}}}^{{{{\rm{Mayr}}}}}+\sum {{{\rm{corrections}}}}({r}_{{{{\rm{i}}}}})$$

(3)

These corrections are of eight types. First, as mentioned above and detailed in ref. ¹¹, all mechanistic transforms come along with the general classification of their rates (e.g., very slow, slow, fast) to which we assign numerical values (to be optimized) defining correction ${r}_{{{{\rm{i}}}}}^{{{{\rm{rate\; class}}}}}$. Second, some transforms are known to proceed as side-reactions but are never dominant (and do not proceed in high yields). For instance, propargyl organolithium compounds can eliminate to allenes but allenes are virtually never purposefully prepared in this manner. Such reactions as well as quenching reactions have their own, lower values of ${r}_{{{{\rm{i}}}}}^{{{{\rm{class}}}}}$ correction. This type of down-correction also applies to reactions which, in order to proceed in good yields, require specific reagents not present in the reaction mixture (e.g., during thiol alkylation, a potential side reaction is disulfide formation – however, it can occur in good yield only if oxygen is present). Third, we penalize (by parameter ${r}_{{{{\rm{i}}}}}^{{{{\rm{water}}}}}$ assigned a value lower than a default value for other reactions) those side steps that ideally require aqueous conditions (e.g., hydrolysis) but, in reality, can use only small amounts of water supplied to them as a by-product of some other reaction in the graph (say, imine formation). Fourth, there is also a penalizing conditions correction, ${r}_{{{{\rm{i}}}}}^{{{{\rm{cond}}}}}$, for side-reactions which can no longer take place after a putative change of conditions along the main MCR pathway). For instance, if a MCR is started under basic conditions but, at some later point, the reaction mixture is neutralized, then deprotonation side-reactions are not possible after this change of conditions. Fifth, there is a global correction ${r}_{{{{\rm{i}}}}}^{{{{\rm{rev}}}}}$ assigned to reversible reactions such as imine formation or Michael addition of a thiol (i.e., equilibria other than acid-base and tautomerizations discussed earlier). The rationale for this term is that it is often possible to adjust the reaction conditions such as to shift the equilibrium in the desired direction. Sixth, correction ${r}_{{{{\rm{i}}}}}^{{{{\rm{ring}}}}}$ promotes intramolecular reactions forming 3, 5, 6 or 7-membered rings. Seventh, rate of bimolecular reactions is scaled based on the concentration of the non-limiting substrate, ${r}_{{{{\rm{i}}}}}^{{{{\rm{bimolecular}}}}}$. In practice, this means that if one molecule can react in two bimolecular reactions, characterized by the same class of rate, reaction with the more abundant second substrate is favored. Eight and last, we implement a correction inspired by the so-called Evans–Polanyi rule stipulating that in a series of homologous reactions, activation energy is proportional to reaction enthalpy $\Delta H$ (thus promoting exothermic reactions, $\Delta H$ < 0). In our case, we take ${r}_{{{{\rm{i}}}}}^{{{{\rm{Polanyi}}}}}$ as proportional to $\exp (-\Delta H)$ where enthalpy is approximated by reaction energy $\Delta E$ (calculated at the PM6 level in MOPAC⁴⁵), and extreme values of $\Delta E$ (with threshold being model’s free parameter) are rejected to avoid artifacts of, for instance, reactions with strong solvation effects not captured by gas-phase energetic calculations.

Having defined the rate constants, we quantify the changes in the extent ${\xi }_{{{{\rm{i}}}}}$ of reaction i (Eq. 4) and concentrations ${C}_{{{{\rm{x}}}}}$ of species ${{{\rm{x}}}}$ (Eq. 5). In both equations, ${\nu }_{{{{\rm{x}}}},{{{\rm{i}}}}}$ stands for the stoichiometric coefficient of compound x in reaction i.

$$d{\xi }_{{{{\rm{i}}}}}=({k}_{{{{\rm{i}}}}}/{\nu }_{{{{\rm{s}}}},{{{\rm{i}}}}}){{dC}}_{{{{\rm{substr}}}},{{{\rm{i}}}}}$$

(4)

$${{dC}}_{{{{\rm{x}}}}}={\sum}_{{{{\rm{incoming}}}},{{{\rm{i}}}}}{\nu }_{{{{\rm{x}}}},{{{\rm{i}}}}}d{\xi }_{{{{\rm{i}}}}}-{\sum}_{{{{\rm{outgoing}}}},{{{\rm{j}}}}}{\nu }_{{{{\rm{x}}}},{{{\rm{j}}}}}d{\xi }_{{{{\rm{j}}}}}$$

(5)

For a given set of model’s 23 free parameters, these equations are numerically integrated using final difference method. Integration is initiated for the concentrations of the starting materials equal to their stoichiometric coefficients in a given MCR, and concentrations of all other species assigned to zero. The yield for a particular MCR is then calculated as the product-to-initial substrate concentration ratio at the end of integration. The length of integration is defined as some global parameter N (to be optimized) multiplied by the number of steps in a given reaction pathway.

Parameters’ optimization

The model’s 23 parameters are optimized on the training set of 20 reaction networks underlying the known MCRs (from Fig. 1a; in total, spanning 993 mechanistic steps). This optimization (i) aims to maximize the correlation coefficient between the calculated and experimental yields; and (ii) is performed using the Bayesian optimization algorithm with the Gaussian process as a surrogate model, as implemented in OpenBox library⁴⁶. Specifically, we used expected improvement as an acquisition function and radial basis function (RBF) kernel. Search space comprised of continuous variables along with their ranges and starting values (see argsParser.py for details of space definition) as well as constraints defining relations between variables (e.g., to enforce that “Fast” reaction is faster than “Slow,” and “Slow” is faster than “Very Slow”; function buildOpenBoxSpaceFromDict in optscan.py file). For each model considered, five independent runs were performed for at least 100 optimization steps each. Each optimization aimed to maximize the coefficient of determination of linear regression (without intercept, i.e., passing through (0,0)) between experimentally reported yields and yields predicted by the model (see optFunctionOpenBox function in optscan.py file for implementation details). The model offering the highest correlation was then taken. With the parameters thus optimized, the model was used to estimate the yields of 10 MCRs from the test set (Fig. 1b).

Model’s performance and limitations

As shown in Fig. 5 and further quantified in Fig. 6, optimization on Level 3 networks yields Pearson correlation coefficient, ${\rho }_{{{{\rm{train}}}}}^{2}=$ 0.800, coefficient of determination, ${R}_{{{{\rm{train}}}}}^{2}\,$= 0.432 and mean absolute error, ${{{\rm{MAE}}}}=10.5\%$. On the test set, the model achieves ${\rho }_{{{{\rm{test}}}}}^{2}=$ 0.861, ${R}_{{{{\rm{test}}}}}^{2}$ = 0.852, and ${{{\rm{MAE}}}}=7.3\%$. This means that the model can extrapolate well to unseen mechanisms and reaction types. Indeed, Fig. 7 shows that only 32 reaction classes are common between 115 classes in the training-set MCRs and 71 classes in the test set (for the list of reaction classes, see Supplementary Section S2). Of note, the performance of the model is likely as good as it can get given the small size of the training dataset and model’s assumptions. This is corroborated by the analyses detailed in Supplementary Section S1 in which we compared the best model described here with the ensemble of 331 models that showed similar performance. Indeed, the differences between (i) the predicted yields averaged over the 331-model ensemble and (ii) predicted yields from the best model are small (Supplementary Fig. S1), and so are standard deviations of yield predicted from the ensemble of models. Also, Supplementary Fig. S2 evidences that these standard deviations do not correlate with experimental yields, suggesting that all models have low diversity (i.e., give similar errors/results on new data).

**Fig. 5: Correlation between experimental and predicted reaction yields.**

**Fig. 6: Performance of the full model and sub-models without selected correction(s).**

**Fig. 7: Reaction types used in test and training sets.**

Next, we investigated which parameters of the model are crucial to its performance. As already mentioned, ${{\mathrm{ln}}}{{k}}_{{{{\rm{i}}}}}^{{{{\rm{Mayr}}}}}$ term by itself performs poorly – on the training set, it achieves ${\rho }_{{{{\rm{train}}}}}^{2}=$ 0.678, ${R}_{{{{\rm{train}}}}}^{2}\,$= −1.10 and ${{{\rm{MAE}}}}=22.1\%$, and on the test set, ${\rho }_{{{{\rm{test}}}}}^{2}=$ 0.404, ${R}_{{{{\rm{test}}}}}^{2}$ = −0.059 and ${{{\rm{MAE}}}}=20.7\%$. This is quantified by the second-to-the-left set of bars in the histogram in Fig. 6. The remaining pairs of bars in this figure are for the full model with individual correction terms removed. As seen, the most important correction is ${r}_{{{{\rm{i}}}}}^{{{{\rm{water}}}}}$, penalizing steps which require stoichiometric water but are part of pathways not using aqueous conditions. Without the correction, the model achieves only ${\rho }_{{{{\rm{train}}}}}^{2}=$ 0.601, ${R}_{{{{\rm{train}}}}}^{2}\,$= −0.439 and ${{{\rm{MAE}}}}=17.9\%$ and, on the test set, ${\rho }_{{{{\rm{test}}}}}^{2}=$ 0.462, ${R}_{{{{\rm{test}}}}}^{2}$ = 0.166 and ${{{\rm{MAE}}}}=17.3\%$. In turn, contribution from ${r}_{{{{\rm{i}}}}}^{{{{\rm{class}}}}}$ is important for better model generalization. Without this correction, performance on the training set is comparable to that of the full model but correlation on the test set is significantly worse. On the flipside of the coin, some of the corrections may be spurious, implying that models with lower numbers of parameters work equally well. For instance, models without ${r}_{{{{\rm{i}}}}}^{{{{\rm{Polanyi}}}}}$ or ${r}_{{{{\rm{i}}}}}^{{{{\rm{cond}}}}}$ perform comparably to the full model.

We also analyzed how detailed the mechanistic knowledge has to be to assure accurate yield predictions. On one hand, training the full model (i.e., with all corrections) only on the main reaction pathways with immediate side-reactions (Level 2, L2) is insufficient as the metrics of accuracy (${\rho }_{{{{\rm{train}}}}}^{2}=$ 0.345, ${R}_{{{{\rm{train}}}}}^{2}\,$= −0.127 and ${{{\rm{MAE}}}}=16.1\%;$ ${\rho }_{{{{\rm{test}}}}}^{2}=$0.042, ${R}_{{{{\rm{test}}}}}^{2}$ = −0.590 and ${{{\rm{MAE}}}}=27.1\%$) are much lower than for the Level 3, L3 analysis described above. On the other hand, expansion to Level 4, L4, allowing for downstream reactions of the L3 species, also worsens the performance accuracy (${\rho }_{{{{\rm{train}}}}}^{2}=$ 0.709, ${R}_{{{{\rm{train}}}}}^{2}\,$= 0.053 and ${{{\rm{MAE}}}}=14.1\%;$ ${\rho }_{{{{\rm{test}}}}}^{2}=$0.436, ${R}_{{{{\rm{test}}}}}^{2}$ = −0.061 and ${{{\rm{MAE}}}}=20.5\%$). We believe this effect can be reasonably explained in terms of model’s inherent error (due to the simplified treatment of kinetics) propagating with the rapidly increasing sizes of the networks. In fact, the L4 networks are ca. 80% larger than L3 ones (Fig. 8).

**Fig. 8: Mean sizes of mechanistic networks at different levels of calculations for test and training sets.**

While this trend could be expected, it also points to the main drawback of the mechanistic approach – namely, that it can be quite sensitive to some parts of the mechanistic picture missing. For example, had the algorithm used to construct the mechanistic networks not known substitution of bromide with thiolate, it would not be able to predict the formation of 2-[(2-methoxy-2-oxoethyl)sulfanyl]hept-2-enoate by-product in the beta-elimination step of the last reaction in Fig. 1b (marked with a yellow star), and would grossly overestimate its yield (81% instead of 47%). In this respect, we emphasize that even though our current 8000-set covers a broad range of mechanistic steps (acid-base catalyzed, substitutions, eliminations, additions, rearrangements, pericyclic reactions, basic transformations catalyzed by transition metals), it is not without notable omissions. For instance, radical mechanistic steps are still to be included since their proper generalization (from specific precedents⁴⁷ into transforms applicable to different scaffolds) is challenging. This effort may require a separate study, akin to our recent work on carbocationic rearrangements¹⁰. Also, the currently available data is insufficient to predict how reaction rates depend on specific choices of catalysts. In the fullness of time, this information may become available either from high-end quantum-mechanical calculations or from the promising marriage of experiments with AI^{48,49,50,51,52,53,54}.

Discussion

In summary, the approach we described is a union of computer-assisted analysis of mechanistic reaction networks with rate approximations and corrections grounded in physical-organic chemistry. The model as-is works well in predicting yields of MCRs based on reactions between various nucleophiles and electrophiles, and transfers well between training and test sets based on markedly different and diverse reactions mechanisms. Recognizing that more work is needed to incorporate other classes of reactivities, we feel that approaches like this one should be pursued as an alternative to chemical AI methods based on full-reaction data which, despite being trained on thousands to millions of literature examples, do not offer satisfactory accuracy of yield prediction.

Methods

The optimization procedure to identify the best model was repeated 10 times from different random starting parameters. This was done using Bayesian optimization with two surrogate models: (i) Gaussian Process (GP), and (ii) probabilistic random forest (PRF). During these optimization campaigns, 331 sets of parameters with R² > 0.40 were found. Average predicted yields from this ensemble were then compared against those from the best model with the differences being generally small (2.7 % average over all reactions, see Supplementary Fig. S1). To quantify model’s uncertainty, we considered the standard deviation of yield prediction between individual models within the ensemble of all 331 models. For the test and training sets, the standard deviations were small, 2% and 5%, respectively. Moreover, standard deviation and prediction error were not correlated (Supplementary Fig. S2). We also verified that all models have comparable accuracy, and the results from the ensemble are not significantly better than those of a single model. This suggests that diversity between models is low, and all optimizations result in models with similar accuracies/errors on the test data. Colloquially put, the performance metrics described in the main text are likely as good as they can get, given the dataset and model’s key assumptions.

Data availability

The reaction networks data generated in this study (along with condition classification, rate categorization, and optimized rate parameters) have been deposited in the publicly available repository https://zenodo.org/records/13381381. All data are also available from the corresponding authors upon request. User manuals are available in Supplementary Section S3.

Code availability

Codes for the optimization of kinetic parameters and for the calculation of yields are deposited at https://zenodo.org/records/13381381. These codes are accompanied by all 30 digitized mechanistic networks we describe in the text (of course, other networks can be used as input). Codes for network expansion and analysis, from ref. ¹¹, are deposited at https://zenodo.org/records/13381201. Interactive Allchemy MECH web-app is freely available at https://mech.allchemy.net (given server capacity, to twenty concurrent academic users on a rolling basis and two-week slots).

References

Emami, F. S. et al. A priori estimation of organic reaction yields. Angew. Chem. Int. Ed. 54, 10797–10801 (2015).
Article CAS Google Scholar
Skoraczyński, G. et al. Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient? Sci. Rep. 7, 3582 (2017).
Article ADS PubMed PubMed Central Google Scholar
Beker, W. et al. Machine learning may sometimes simply capture literature popularity trends: A case study of heterocyclic Suzuki-Miyaura coupling. J. Am. Chem. Soc. 144, 4819–4827 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 360, 186–190 (2018).
Article ADS CAS PubMed Google Scholar
Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Mach. Learn. Sci. Technol. 2, 015016 (2021).
Article Google Scholar
Saebi, M. et al. On the use of real-world datasets for reaction yield prediction. Chem. Sci. 14, 4997–5005 (2023).
Article CAS PubMed PubMed Central Google Scholar
Liu, Z., Moroz, Y. S. & Isayev, O. The challenge of balancing model sensitivity and robustness in predicting yields: a benchmarking study of amide coupling reactions. Chem. Sci. 14, 10835–10846 (2023).
Article CAS PubMed PubMed Central Google Scholar
Fitzner, M., Wuitschik, G., Koller, R., Adam, J.-M. & Schindler, T. Machine learning C–N couplings: Obstacles for a general-purpose reaction yield prediction. ACS Omega 8, 3017–3025 (2023).
Article CAS PubMed PubMed Central Google Scholar
Strieth-Kalthoff, F. et al. Machine learning for chemical reactivity: The importance of failed experiments. Angew. Chem. Int. Ed. 61, e202204647 (2022).
Article ADS CAS Google Scholar
Klucznik, T. et al. Computational prediction of complex cationic rearrangement outcomes. Nature 625, 508–515 (2024).
ADS CAS PubMed Google Scholar
Roszak, R. et al. Systematic, computational discovery of multicomponent and one-pot reactions. Nat. Commun. (2024) In press.
Kolb, J., Beck, B. & Dömling, A. Simultaneous assembly of the β-lactam and thiazole moiety by a new multicomponent reaction. Tetrahedron Lett. 43, 6897–6901 (2002).
Article CAS Google Scholar
Polikarchuk, V. et al. Novel variants of the multicomponent reaction for the synthesis of 1,2,4-triazolo[1,5-а]pyrimidines and pyrido[3,4-е][1,2,4]triazolo[1,5-а]pyrimidines. Chem. Heterocycl. Compd. 56, 1054–1061 (2020).
Article CAS Google Scholar
Boddaert, T., Coquerel, Y. & Rodriguez, J. N-heterocyclic carbene-catalyzed Michael additions of 1,3-dicarbonyl compounds. Chemistry 17, 2266–2271 (2011).
Article CAS PubMed Google Scholar
Marcaccini, S. et al. One-pot synthesis of quinolin-2-(1H)-ones via tandem Ugi–Knoevenagel condensations. Tetrahedron Lett. 45, 3999–4001 (2004).
Article CAS Google Scholar
Marcos, C. F., Marcaccini, S., Pepino, R., Polo, C. & Torroba, T. Studies on isocyanides and related compounds; A facile synthesis of functionalized 3(2H)-pyridazinones via Ugi four-component condensation. Synthesis 2003, 0691–0694 (2003).
Google Scholar
Nguyen, H. H., Palazzo, T. A. & Kurth, M. J. Facile one-pot assembly of imidazotriazolobenzodiazepines via indium(III)-catalyzed multicomponent reactions. Org. Lett. 15, 4492–4495 (2013).
Article CAS PubMed Google Scholar
Soeta, T., Matsuzaki, S. & Ukaji, Y. A one-pot O-phosphinative Passerini/Pudovik reaction: Efficient synthesis of highly functionalized α-(phosphinyloxy)amide derivatives. Chem. Eur. J. 20, 5007–5012 (2014).
Article CAS PubMed Google Scholar
Pirrung, M. C. & Sarma, K. D. Multicomponent reactions are accelerated in water. J. Am. Chem. Soc. 126, 444–445 (2004).
Article CAS PubMed Google Scholar
Yoshida, H. et al. Three-component coupling of arynes and organic bromides. Angew. Chem. Int. Ed. 50, 9676–9679 (2011).
Article CAS Google Scholar
Wang, R., Tian, P. & Lin, G. Stereoselective total synthesis of tubulysin V. Chin. J. Chem. 31, 40–48 (2013).
Article CAS Google Scholar
Opatz, T. & Ferenc, D. An unexpected three-component condensation leading to amino- (3-oxo-2,3-dihydro-1H-isoindol-1-ylidene)- acetonitriles. J. Org. Chem. 69, 8496–8499 (2004).
Article CAS PubMed Google Scholar
Okuma, K., Hino, H., Sou, A., Nagahora, N. & Shioji, K. Cascade approach to trichloroalkyl phenyl ethers from benzyne, epoxides, and chloroform. Chem. Lett. 38, 1030–1031 (2009).
Article CAS Google Scholar
Barrow, J. C. et al. Discovery and X-ray crystallographic analysis of a spiropiperidine iminohydantoin inhibitor of beta-secretase. J. Med. Chem. 51, 6259–6262 (2008).
Article CAS PubMed Google Scholar
Onitsuka, K., Suzuki, S. & Takahashi, S. A novel route to 2,3-disubstituted indoles via palladium-catalyzed three-component coupling of aryl iodide, o-alkenylphenyl isocyanide and amine. Tetrahedron Lett. 43, 6197–6199 (2002).
Article CAS Google Scholar
Tietze, L. F., Böhnke, N. & Dietz, S. Synthesis of the deoxyaminosugar (+)-D-forosamine via a novel domino-Knoevenagel-hetero-Diels-Alder reaction. Org. Lett. 11, 2948–2950 (2009).
Article CAS PubMed Google Scholar
Kulkarni, A. M., Pandit, K. S., Chavan, P. V., Desai, U. V. & Wadgaonkar, P. P. Cobalt ferrite nanoparticles: a magnetically separable and reusable catalyst for Petasis-Borono–Mannich reaction. RSC Adv. 5, 70586–70594 (2015).
Article ADS CAS Google Scholar
Keating, T. A. & Armstrong, R. W. A Remarkable two-step synthesis of diverse 1,4-benzodiazepine-2,5-diones using the Ugi four-component condensation. J. Org. Chem. 61, 8935–8939 (1996).
Article CAS PubMed Google Scholar
Kim, J. W. & Chung, Y. K. Pauson-Khand reaction catalyzed by Co₄(CO)₁₂. Synthesis 1998, 142–144 (1998).
Article Google Scholar
Betancort, J. M., Sakthivel, K., Thayumanavan, R., Tanaka, F. & Barbas, C. F. III Catalytic direct asymmetric Michael reactions: Addition of unmodified ketone and aldehyde donors to alkylidene malonates and nitro olefins. Synthesis 2004, 1509–1521 (2004).
Article Google Scholar
Zeng, Y. et al. Silver-mediated trifluoromethylation-iodination of arynes. J. Am. Chem. Soc. 135, 2955–2958 (2013).
Article CAS PubMed Google Scholar
Szymkuć, S. et al. Computer-assisted synthetic planning: The end of the beginning. Angew. Chem. Int. Ed. 55, 5904–5937 (2016).
Article Google Scholar
Klucznik, T. et al. Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory. Chem 4, 522–532 (2018).
Article CAS Google Scholar
Mikulak-Klucznik, B. et al. Computational planning of the synthesis of complex natural products. Nature 588, 83–88 (2020).
Article ADS CAS PubMed Google Scholar
Wołos, A. et al. Synthetic connectivity, emergence, and self-regeneration in the network of prebiotic chemistry. Science 369, aaw1955 (2020).
Article Google Scholar
Wołos, A. et al. Computer-designed repurposing of chemical wastes into drugs. Nature 604, 668–676 (2022).
Article ADS PubMed Google Scholar
Molga, K., Gajewska, E. P., Szymkuć, S. & Grzybowski, B. A. The logic of translating chemical knowledge into machine-processable forms: a modern playground for physical-organic chemistry. React. Chem. Eng. 4, 1506–1521 (2019).
Article CAS Google Scholar
Gothard, C. M. et al. Rewiring chemistry: algorithmic discovery and experimental validation of one-pot reactions in the network of organic chemistry. Angew. Chem. Int. Ed. 51, 7922–7927 (2012).
Article CAS Google Scholar
Mayr, H. & Patz, M. Scales of nucleophilicity and electrophilicity: A system for ordering polar organic and organometallic reactions. Angew. Chem. Int. Ed. 33, 938–957 (1994).
Article Google Scholar
Mayr’s Database of Reactivity Parameters - Start page Available at: https://www.cup.lmu.de/oc/mayr/reaktionsdatenbank/. (Accessed: 6th December 2023).
Hagberg, A., Schult, D., Swart, P. & Hagberg, J. M. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (eds Varoquaux, G. et al.) 11–15 (2008).
Roszak, R., Beker, W., Molga, K. & Grzybowski, B. A. Rapid and accurate prediction of pKa values of C–H acids using graph convolutional neural networks. J. Am. Chem. Soc. 141, 17142–17149 (2019).
Article CAS PubMed Google Scholar
Hammett, L. P. The effect of structure upon the reactions of organic compounds. Benzene derivatives. J. Am. Chem. Soc. 59, 96–103 (1937).
Article CAS Google Scholar
Edwards, J. O. Correlation of relative rates and equilibria with a double basicity scale. J. Am. Chem. Soc. 76, 1540–1547 (1954).
Article CAS Google Scholar
Moussa, J. E., Steward, J. J. P. MOPAC software https://doi.org/10.5281/zenodo.6511958.
Jiang, H. et al. OpenBox: A Python toolkit for generalized black-box optimization. arXiv [cs.LG] http://arxiv.org/abs/2304.13339 (2023).
Tavakoli, M., Chiu, Y. T. T., Baldi, P., Carlton, A. M. & Van Vranken, D. RMechDB: A public database of elementary radical reaction steps. J. Chem. Inf. Model. 63, 1114–1123 (2023).
Article CAS PubMed PubMed Central Google Scholar
Reid, J. P. & Sigman, M. S. Holistic prediction of enantioselectivity in asymmetric catalysis. Nature 571, 343–348 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Dotson, J. J. et al. Data-driven multi-objective optimization tactics for catalytic asymmetric reactions using bisphosphine ligands. J. Am. Chem. Soc. 145, 110–121 (2023).
Article CAS PubMed Google Scholar
Gensch, T. et al. A Comprehensive discovery platform for organophosphorus ligands for catalysis. J. Am. Chem. Soc. 144, 1205–1217 (2022).
Article CAS PubMed Google Scholar
Zahrt, A. F. et al. Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning. Science 363, eaau5631 (2019).
Article CAS PubMed PubMed Central Google Scholar
Rinehart, N. I. et al. A machine-learning tool to predict substrate-adaptive conditions for Pd-catalyzed C-N couplings. Science 381, 965–972 (2023).
Article ADS CAS PubMed Google Scholar
Tsuji, N. et al. Predicting highly enantioselective catalysts using tunable fragment descriptors. Angew. Chem. Int. Ed. 62, e202218659 (2023).
Article CAS Google Scholar
Hueffel, J. A. et al. Accelerated dinuclear palladium catalyst identification through unsupervised machine learning. Science 374, 1134–1140 (2021).
Article ADS CAS PubMed Google Scholar
Dai, J. et al. New oblongolides isolated from the endophytic fungus Phomopsis sp. from Melilotus dentata from the shores of the Baltic Sea. Eur. J. Org. Chem. 2005, 4009–4016 (2005).
Article Google Scholar
Bunyapaiboonsri, T., Yoiprommarat, S., Srikitikulchai, P., Srichomthong, K. & Lumyong, S. Oblongolides from the endophytic fungus Phomopsis sp. BCC 9789. J. Nat. Prod. 73, 55–59 (2010).
Article CAS PubMed Google Scholar
Ireland, R. E., Armstrong, I., Lebreton, J. D. J., Meissner, R. S. & Rizzacasa, M. A. Convergent synthesis of polyether ionophore antibiotics: synthesis of the spiroketal and tricyclic glycal segments of monensin. J. Am. Chem. Soc. 115, 7152–7165 (1993).
Article CAS Google Scholar
Danishefsky, S. J., DeNinno, S. & Lartey, P. A concise and stereoselective route to the predominant stereochemical pattern of the tetrahydropyranoid antibiotics: an application to indanomycin. J. Am. Chem. Soc. 109, 2082–2089 (1987).
Article CAS Google Scholar
Parker, K. A. & Georges, A. T. Reductive aromatization of quinols: synthesis of the C-arylglycoside nucleus of the papulacandins and chaetiacandin. Org. Lett. 2, 497–499 (2000).
Article CAS PubMed Google Scholar
Gurjar, M. K., Krishna, L. M., Reddy, B. S. & Chorghade, M. S. A versatile approach to anti-asthmatic compound CMI-977 and its six-membered analogue. Synthesis 2000, 557–560 (2000).
Article Google Scholar
Hur, J., Jang, J. & Sim, J. A Review of the pharmacological activities and recent synthetic advances of γ-butyrolactones. Int. J. Mol. Sci. 22, 2769 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chavan, S. R. et al. Iminosugars spiro-linked with morpholine-fused 1,2,3-triazole: Synthesis, conformational analysis, glycosidase inhibitory activity, antifungal assay, and docking studies. ACS Omega 2, 7203–7218 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tanaka, N. et al. Isolation and structures of attenols A and B. Novel bicyclic triols from the Chinese bivalve Pinna attenuata. Chem. Lett. 28, 1025–1026 (1999).
Article Google Scholar
Chen, D. et al. Discovery, structural insight, and bioactivities of BY27 as a selective inhibitor of the second bromodomains of BET proteins. Eur. J. Med. Chem. 182, 111633 (2019).
Article ADS CAS PubMed Google Scholar
Teiji, K. et al. Multi-cyclic cinnamide derivatives. Patent US 2007219181A1 (2007).
Banwell, M. G. et al. Small molecule glycosaminoglycan mimetics. Patent WO 2006135973A1 (2006).
Mattson, R. J. & Catt, J. D. Piperazinyl-cyclohexanes and cyclohexenes. Patent US 6153611A (2000).

Download references

Acknowledgements

Development of the MECH module within the Allchemy platform (by R.R., A.W., S.S.) was supported by internal funds of Allchemy, Inc. Analysis of pathways and writing of the paper by B.A.G. was supported by the Institute for Basic Science, Korea (Project Code IBS-R020-D1).

Author information

Authors and Affiliations

Allchemy, Inc., Highland, IN, USA
Sara Szymkuć, Agnieszka Wołos & Rafał Roszak
Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
Agnieszka Wołos, Rafał Roszak & Bartosz A. Grzybowski
Center for Algorithmic and Robotized Synthesis (CARS), Institute for Basic Science (IBS), Ulsan, Republic of Korea
Bartosz A. Grzybowski
Department of Chemistry, Ulsan Institute of Science and Technology, UNIST, Ulsan, Republic of Korea
Bartosz A. Grzybowski

Authors

Sara Szymkuć
View author publications
Search author on:PubMed Google Scholar
Agnieszka Wołos
View author publications
Search author on:PubMed Google Scholar
Rafał Roszak
View author publications
Search author on:PubMed Google Scholar
Bartosz A. Grzybowski
View author publications
Search author on:PubMed Google Scholar

Contributions

S.S., A.W. selected the literature data, assisted in developing the algorithm, formulated chemical heuristics, and analyzed the results, R.R. developed the algorithm. B.A.G. conceived and supervised research and wrote the paper with help from other authors.

Corresponding authors

Correspondence to Rafał Roszak or Bartosz A. Grzybowski.

Ethics declarations

Competing interests

The authors declare the following competing interests: S.S., A.W., R.R., and B.A.G. are consultants and/or stakeholders of Allchemy, Inc. Allchemy software and its MECH module is property of Allchemy, Inc. USA. All queries about access options to Allchemy, including academic collaborations, should be sent to saraszymkuc@allchemy.net.

Peer review

Peer review information

Nature Communications thanks Xin Hong and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Szymkuć, S., Wołos, A., Roszak, R. et al. Estimation of multicomponent reactions’ yields from networks of mechanistic steps. Nat Commun 15, 10286 (2024). https://doi.org/10.1038/s41467-024-54550-1

Download citation

Received: 12 April 2024
Accepted: 14 November 2024
Published: 27 November 2024
Version of record: 27 November 2024
DOI: https://doi.org/10.1038/s41467-024-54550-1

This article is cited by

Robot-assisted mapping of chemical reaction hyperspaces and networks
- Yankai Jia
- Rafał Frydrych
- Bartosz A. Grzybowski
Nature (2025)
Systematic, computational discovery of multicomponent and one-pot reactions
- Rafał Roszak
- Louis Gadina
- Bartosz A. Grzybowski
Nature Communications (2024)