MechFind: a computational framework for de novo prediction of enzyme mechanisms

Hartley, Austin D.; Upadhyay, Vikas; Boorla, Veda Sheersh; Maranas, Costas D.

doi:10.1038/s41467-026-71957-0

Download PDF

Article
Open access
Published: 29 April 2026

MechFind: a computational framework for de novo prediction of enzyme mechanisms

Nature Communications volume 17, Article number: 3903 (2026) Cite this article

1428 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Fewer than one thousand cataloged mechanistic annotations can be found in the open literature and databases. Here, we introduce MechFind, a computational tool that generates element and charge-balanced putative enzyme mechanisms using only the reaction stoichiometry. Unlike methods requiring structural data, MechFind abstracts reaction steps as changes in chemical moieties. It identifies the most parsimonious mechanistic descriptions and re-ranks them based on similarity to known mechanisms. MechFind recovers the validated mechanism for 72% of a curated training dataset within the top ten predictions and is indirectly validated on enzymes absent from the training set. When applied on 14,931 reactions from the Rhea database, MechFind identifies plausible mechanisms for 57% of all entries, generating over 18,000 hypotheses. This resource significantly expands mechanistic annotation and provides detailed reaction steps to support de novo enzyme design and engineering. All codes, curated datasets, and results are available at https://github.com/maranasgroup/MechFind.git (Commit Hash: fcc0896).

Integrating a multitask graph neural network with DFT calculations for site-selectivity prediction of arenes and mechanistic knowledge generation

Article Open access 07 April 2025

EzMechanism: an automated tool to propose catalytic mechanisms of enzyme reactions

Article Open access 21 September 2023

Organic reaction mechanism classification using machine learning

Article 25 January 2023

Introduction

Enzymes catalyze nearly every biochemical transformation in living systems, mediating both the catabolic oxidation of carbon substrates to harvest energy and the anabolic assembly of small molecules into the macromolecular machinery of the cell. Their remarkable chemoselectivity and rate enhancements have been harnessed in biotechnology to produce a broad array of compounds, from bulk biochemicals such as amino acids¹, vitamins², and polymer monomers³ to high-value pharmaceuticals⁴, vaccine adjuvants⁵, and pheromones⁶. The ability to engineer existing enzymes or design biocatalysts is therefore of significant value, but has traditionally depended on time- and cost-intensive methods such as random mutagenesis⁷ and directed evolution⁸. However, recent breakthroughs in de novo protein modeling^9,10, backbone generation¹¹, and sequence design¹² have begun to accelerate progress towards effective enzyme redesigns. Successful examples of computationally driven enzyme design include functional luciferases¹³, Kemp eliminases¹⁴, and serine hydrolases¹⁵, with many more highlighted in review articles^16,17,18. These examples use knowledge of the reaction’s transition state to design possible “theozymes” which are computational models of the active site. According to transition state theory^19,20, an enzyme achieves its catalytic power by preferentially stabilizing this transition state relative to the substrate and product complexes. Knowledge of the transition state can, in turn, be computed from the enzyme’s mechanism, as each mechanistic step describes the flow of electrons and the rearrangement of atoms and bonds.

Therefore, a complete, stepwise understanding of enzyme mechanisms is essential for illuminating the transition states needed to engineer active sites for enzymatic reactions. Yet, a significant “mechanism gap” exists as fewer than one thousand enzymatically catalyzed reaction entries collected in major biochemical repositories carry complete mechanistic annotation. For example, the Mechanism and Catalytic Site Atlas (M-CSA)²¹ curates thousands of examples, but only 734 entries document every bond-breaking and bond-forming event in a complete mechanism. This scarcity stands in stark contrast to resources such as USPTO²², BRENDA²³, KEGG²⁴, Rhea²⁵, or MetaNetX²⁶, which catalog tens of thousands of balanced reactions, though without any mechanistic detail. Bridging this “mechanism gap” is a significant obstacle to expanding enzyme design beyond a few model systems. Prior computational efforts, such as MechSearch²⁷ and EzMechanism²⁸, partially address this challenge but are limited by their requirement for user-supplied active site residues and, in the case of EzMechanism, a high-resolution protein structure. MechSearch has been used to propose mechanisms for less than a thousand reactions from the Rhea database. Moreover, their graph-based representations of elementary steps can become computationally prohibitive when mined exhaustively across millions of hypothetical enzyme-substrate pairs.

Herein, we introduce MechFind, a computational framework that predicts detailed enzymatic mechanisms using only the chemical structures of the starting materials and final products. At its core lies a moiety-based encoding²⁹ in which every non-hydrogen atom is classified by its local bonding environment, and each reaction is summarized as a vector of “moiety gains” and “moiety losses.” This approach was based by the novoStoic³⁰ framework, which used a similar encoding to design multi-step metabolic pathways; MechFind adapts this concept to predict multi-step enzyme mechanisms. The original optimization formulation carries out both the identification and ordering of the elementary steps. However, we found that it is more computationally efficient to split these two tasks into two separate optimization tasks. To distinguish the validated mechanism from other chemically plausible, parsimonious alternatives, we also developed a re-ranking procedure that scores each candidate based on its similarity to known mechanisms. This method identifies the most plausible candidate by finding the one that most closely resembles a validated mechanism from our curated database. Through this hybrid approach, MechFind recovers the validated mechanism among the top ten predictions for 85% of the 661 entries in the M-CSA²¹ database deemed valid inputs. MechFind is independently validated on a set of six recently published mechanisms absent from the training set^{31,32,33,34,35,36}. When deployed at scale on 14,931 reactions from the Rhea database²⁵, MechFind identifies a plausible mechanism for 57% of all entries, generating over 18,000 mechanistic hypotheses and systematically characterizing failures as either limitations in computational tractability or gaps in the underlying moiety description set. Finally, we demonstrate MechFind’s utility not just as a prediction tool but as an exploratory engine to map the network of competing catalytic strategies for a given reaction, providing a rich set of testable hypotheses for de novo enzyme design.

Results

We begin by benchmarking the performance of the initial parsimony-based approach on the curated M-CSA database²¹. The parsimony-based framework, on its own, successfully recovered the validated mechanism as the top-ranked prediction in 64% of cases and within the top ten predictions in 85% of cases. We then test the framework’s capacity for discovery by validating it against a set of six recently characterized mechanisms absent from the training data, successfully identifying the validated mechanism in all cases. Having assessed its accuracy, we then deploy MechFind at a large scale to annotate tens of thousands of reactions from the Rhea²⁵ and MetaNetX²⁶ databases, generating over 18,000 mechanistic hypotheses and representing a more than 20-fold increase in available mechanisms. Finally, we contextualize our results by comparing MechFind’s performance directly with existing tools and demonstrate its utility as an exploratory engine for mapping the landscape of mechanistic diversity for a given reaction.

Moiety-based representation of reactions

Each M-CSA²¹ entry enumerates substrates, products, catalytic residues/cofactors, and formal arrow-pushing schemes for every step. MechFind abstracts this information into a moiety-based representation. For simplicity, we represent each moiety using a canonical SMILES string³⁷. Every elementary rule is defined by its changes in moiety counts from the set of all unique moieties M. This representation formally supports polar and radical mechanisms, strictly conditional on the specific heterolytic or homolytic bond cleavage or bond formation events being represented in the training set. Each unique rule is placed in a matrix denoted T, where entry ${T}_{m,r}$ represents the gain (positive value) or loss (negative value) of moiety m according to rule r. The overall reaction stoichiometry is similarly encoded as a vector T^O, which describes the net moiety changes between all products and reactants. The fundamental requirement of any valid mechanism is that the cumulative action of all its elementary steps must match the overall reaction stoichiometry. Figure 1 illustrates this encoding process, where the sum of the moiety change vectors for the elementary steps (Fig. 1b) equals the moiety change vector for the overall reaction (Fig. 1a).

**Fig. 1: Example of the moiety-based representation of (S)-hydroxynitrile lyase, entry 217 from the M-CSA database.**

Benchmarking of MechFind on M-CSA entries

Having established a curated set of 4091 unique elementary rules from the M-CSA database²¹, we first perform a comprehensive validation test to assess MechFind’s ability to recapitulate the known mechanisms from the 661 M-CSA entries that could be used as input. This analysis excludes 73 of the 734 reactions because they are primarily isomerizations and racemizations that result in no net change to the moiety fingerprint. For each entry, MechFind is tasked with finding the most parsimonious (i.e., fewest steps) mechanism using only the overall reaction stoichiometry as input.

The framework recovered the validated mechanism (as curated in the M-CSA database) as the top-ranked prediction (rank 1) in 429 of the 661 cases (Fig. 2). This means simply by invoking parsimony, the validated mechanism is recovered in nearly two-thirds of cases (64%). The recovery rate rose to 78% if the validated mechanism was among the top three predictions and reached 85% of times within the top ten predictions. This implies that by invoking the criterion of parsimony and retaining the top handful of predictions provides strong confidence in recovering the validated mechanism.

Fig. 2: Frequency at which the curated validated mechanism was found at a given rank for 661 M-CSA entries using the parsimony-based minRules and OrderRules framework. — **Fig. 2: Frequency at which the curated validated mechanism was found at a given rank for 661 M-CSA entries using the parsimony-based *minRules* and *OrderRules* framework.**

However, in 11% of cases, the validated mechanism was not recovered. Inability of recapitulation of the validated mechanism was due to two reasons: either the search for solutions exceeded the 20-min computational time limit or the framework returned a full list of ten putative mechanisms that did not include the known validated one. This occurs in cases where the principle of parsimony breaks down, as the validated mechanism is relatively long even though there exist many shorter, chemically plausible alternatives.

Indirect validation on recently characterized mechanisms unseen in training set

To challenge the framework and demonstrate its potential for genuine discovery, we performed an indirect validation using six recently characterized mechanisms absent from the M-CSA training set^{31,32,33,34,35,36}. For each case, we generated the top parsimonious predictions and then re-ranked them using our similarity-scoring method. These six enzymes represent a diverse set of catalytic functions, spanning EC classes 2 (Transferases), 3 (Hydrolases), and 4 (Lyases).

The results, detailed in Table 1, show a marked improvement in ranking after applying the similarity-based scoring. For example, the validated mechanism for methylisocitrate lyase improved from rank 2 to become the top-ranked prediction. Similarly, the mechanism for glycosaminoglycan lyase saw a significant improvement from rank 8 to rank 4. In four of the six cases, the similarity re-ranker either improved the rank or maintained an already high rank. Crucially, the method never performed worse than the parsimony baseline and successfully identified the validated mechanism within the top 6 predictions for all tested enzymes.

Table 1 Performance of MechFind on an indirect set of recently published mechanisms

Full size table

Notably, the validated mechanistic steps were often constructed by combining elementary rules derived from enzymes in distantly related organisms, highlighting that the fundamental chemical logic of elementary step catalysis can be interoperable across species. For instance, the validated mechanism for a Peptidoglycan lysozyme from the protist D. discoideum³⁵ was assembled using a combination of elementary rules derived from enzymes in honeybee (A. mellifera), a virus (S. phage), a bacterium (B. circulans), and humans (H. sapiens) (see Table 1). Similarly, the mechanism for a human ubiquitin-conjugating enzyme³¹ was predicted using rules from both human and yeast acetyltransferases (Fig. 3). Although MechFind has no explicit knowledge of ubiquitin or its specific chemistry, it identified the reactive moieties involved—a thioester on acetyl-CoA and a lysine residue’s amine group—and applied a known transformation rule to predict the correct product connectivity. This ability to recognize analogous chemical patterns in vastly different molecular contexts is a key strength of the moiety-based abstraction. A detailed description of the mechanism predictions are shown in Table 1. The lowest-ranked predicted mechanism by parsimony was for Glycosaminoglycan lyase (i.e., rank 8). The fact that the validated mechanism was found within a relatively narrow range of predictions provides confidence that MechFind can serve as a reliable hypothesis-generation tool for enzyme mechanism of action.

**Fig. 3: A comparison of the ubiquitin conjugation mechanism from Wathan et al. and a proposed mechanism from MechFind.**

Large-scale mechanistic annotation of the Rhea and MetaNetX databases

Having established its predictive power and generalizability, we next deploy MechFind to address the growing gap between reaction entries and provided mechanisms in popular reaction databases. We processed 14,931 unique elementally balanced reactions from the Rhea database²⁵ and 23,112 from the MetaNetX database²⁶, the vast majority of which had no prior mechanistic annotation. The outcomes of this high-throughput screen are summarized in Table 2. MechFind successfully generates at least one putative, parsimonious mechanism for 8452 (56.6%) of the Rhea entries and 13,658 (59.1%) of the MetaNetX entries. We identified over 18,000 mechanistic solutions from the Rhea database. This larger number of generated hypotheses than database entries arise from the fact that MechFind generates (up to ten) putative mechanisms per reaction. This effort represents a more than tenfold increase in the number of biochemical reactions with available mechanistic hypotheses over prior efforts, creating a public repository that could significantly expand the mechanistic map of known enzyme chemistry. In comparison, the previous state-of-the-art tool, MechSearch, found mechanisms for only 942 (11%) of the reactions it processed from the Rhea database. All predicted mechanisms from the two databases are available in Supplementary Data 1.

Table 2 Performance of MechFind on Rhea and MetaNetX Database

Full size table

This analysis also helped to identify some of the current limitations of the MechFind framework. As many as 17.0% of Rhea reactions²⁵ processed by MechFind did not converge within the time limit. While the minRules optimization formulation repeatedly proposed sets of elementary steps, the OrderRules formulation failed to find a feasible ordering causing the search to exhaust the time limit without a solution. For the 26.4% of reactions where a mechanism could not be constructed, the limitation lies in the scope of our M-CSA-derived rule set²¹, as these reactions involve moieties absent in the training data. Figure 4 shows four such examples from the Rhea database, including cyanate (Rhea:11121) and benzyl isothiocyanate (Rhea:10005). While MechFind cannot currently process these reactions, this analysis highlights a clear path for expansion: curating mechanisms involving these out-of-scope moieties and adding them to MechFind’s rule set would directly broaden the applicability of our framework to previously unannotated classes of enzymatic transformations.

**Fig. 4: Four compounds involved in reactions for which MechFind could not propose a mechanism.**

Comparisons with existing computational tools

To contextualize the performance of our framework, we conducted a direct comparison with MechSearch²⁷, the most similar prior method for computational mechanism prediction. This comparison aimed to quantify the differences in input requirements, underlying data, and large-scale performance. The most salient features and outcomes of this analysis are summarized in Table 3. The most fundamental difference lies in the Required Inputs. MechFind requires only reaction stoichiometry, making it applicable to any elementally balanced reaction, whereas MechSearch’s need for user-supplied active site residues limits its use to enzymes with known or hypothesized structures. MechFind leverages a more comprehensive set of rules derived from 734 curated M-CSA mechanisms²¹, nearly double the 368 used by MechSearch. This broader rule set, combined with the removal of the active site residue requirement, allows MechFind to be applied to a significantly wider range of reactions. On a comparable set of reactions from the Rhea database, MechFind processed 12,370 unique reactions with a Success Rate of 60.3%. This represents a more than fivefold improvement over the 11% success rate reported for MechSearch on a smaller subset of the database.

Table 3 Comparison of Input requirements and scale of applicability between MechFind and MechSearch

Full size table

Application of MechFind to explore mechanistic diversity

Beyond identifying a single most parsimonious mechanism, MechFind can be used as an exploratory engine to map the network of plausible catalytic pathways for a given transformation. We applied this capability to a canonical esterase reaction, constructing the mechanistic network shown in Fig. 5 by compiling the unique elementary steps from the top-ten predicted mechanisms. This network reveals a landscape of competing catalytic strategies, such as the formation of a serine covalent intermediate versus a direct nucleophilic attack by water, demonstrating that multiple distinct mechanistic routes are often chemically plausible for a single overall reaction.

**Fig. 5: Mechanistic network for a canonical esterase reaction.**

This landscape of mechanistic diversity presents a powerful opportunity for de novo enzyme design. The bridge between a predicted mechanism and a functional enzyme lies in the principle of transition state stabilization¹⁹. Each elementary step in the network proceeds through a unique transition state. By using computational chemistry methods to find the lowest-energy path between the predicted intermediates of a given step, the highest-energy point can be identified as the putative 3D transition state structure. These structures are the precise, actionable targets required by advanced generative models like RFdiffusion All-Atom to build a stabilizing protein scaffold (Krishna et al., 2024). MechFind, therefore, provides a menu of these potential transition states, transforming the abstract challenge of “designing an esterase” into a series of concrete, testable engineering tasks, such as, “design a protein active site that stabilizes the tetrahedral transition state formed by Rule 640.” In this way, MechFind’s ability to map the mechanistic landscape provides a rich foundation of parallel hypotheses to guide the rational design and engineering of enzymes with tailored catalytic functions.

Discussion

In this work, we introduced MechFind, a computational framework designed to address the significant and growing gap in mechanistic information for catalogued biochemical reactions. By leveraging a curated set of elementary chemical rules from the M-CSA database²¹, MechFind predicts detailed, multi-step enzyme mechanisms using only overall reaction stoichiometry as input, employing a hybrid approach that combines mixed-integer linear programming with a similarity-based re-ranker that scores candidate mechanisms against our entire database of validated mechanisms. The framework’s efficacy was first established by recapitulating the validated mechanism within the top ten predictions for 85% of the 661 known M-CSA entries, with the validated mechanism identified as the top-ranked, most parsimonious solution in 64% of cases. Its capacity for genuine discovery was then demonstrated on an indirect validation set of six enzymes, where it successfully identified the validated mechanism in all cases, often by combining chemical rules from distantly related organisms. When deployed at scale, MechFind generates over 8,000 putative mechanisms for 57% of the 14,931 reactions tested from the Rhea database²⁵, representing a more than tenfold increase in the number of reactions with available mechanistic hypotheses and a nearly sixfold performance improvement over existing tools. Finally, we demonstrate MechFind’s utility not just as a prediction tool, but as an exploratory engine for generating a comprehensive network of testable hypotheses about the diverse catalytic strategies available for a given reaction, as shown for a canonical esterase. We emphasize that these outputs are computational mechanistic hypotheses; they serve as promising starting points for expert review or downstream QM/MM simulations to rigorously assess mechanistic feasibility.

The primary advance of MechFind lies in its ability to bypass the main bottleneck that has limited previous computational approaches. Methods such as MechSearch²⁷ and EzMechanism²⁸ require user-supplied active site residues or high-resolution protein structures, respectively, restricting their application to a small subset of well-characterized enzymes. This prerequisite makes the high-throughput analysis of entire databases infeasible. By removing this requirement, MechFind enables high-throughput mechanistic annotation of entire reaction databases. This capability begins to bridge the “mechanism gap” between the tens of thousands of known biochemical reactions and the fewer than one thousand with detailed catalytic steps. The resulting library of over 18,000 mechanistic hypotheses provides a critical starting point for downstream rational enzyme engineering efforts. It fundamentally changes the paradigm from a low-throughput, hypothesis-driven process, where a researcher must first pinpoint which residues are involved, to a high-throughput, data-driven discovery process that formalizes the generation of hypotheses.

Our large-scale application of MechFind yields non-obvious insights into both the conservation of enzymatic strategies and the nature of the challenges that remain. The indirect validation results strongly suggest the existence of a universal “chemical grammar” that is conserved across species. For example, the mechanism for a Mycoloyltransferase from the bacterium C. glutamicum was constructed using elementary rules derived from enzymes in a kiwi plant and a mouse (Table 1). Similarly, the mechanism for a human ubiquitin-conjugating enzyme was predicted using rules from yeast acetyltransferases (Fig. 3). These findings demonstrate that fundamental catalytic principles are modular and can be recognized and redeployed by the model, regardless of the evolutionary context or the overall molecular scaffold. This insight validates our core moiety-based abstraction, proving that the local chemical environment is a powerful descriptor for predicting enzymatic reaction mechanisms across the kingdoms of life. It implies that the vast diversity of enzymatic function arises not from an unbounded set of chemical possibilities, but from the recombination of a relatively small set of elementary reaction rules.

While this work demonstrates a significant step forward, it is essential to acknowledge the framework’s current limitations, which define clear directions for future development. Our analysis of the 43% of Rhea²⁵ reactions for which no mechanism was found reveals that the problem is twofold. We identified that for 17% of reactions, the challenge stems from a limitation in our decomposed optimization strategy, where the minRules formulation proposes rule sets for which OrderRules cannot find a valid sequence within the time limit. This points to the need for more efficient optimization algorithms or heuristics for particularly large and complex substrates. In addition, we found that for 26% of reactions, the limitation is the scope of our M-CSA-derived rule set²¹, which lacks the necessary chemical moieties to describe the transformation. This highlights that the model’s predictive power is limited to its library of known chemistry, as it identifies pathways by combining existing reaction steps rather than generating new transformations. The clear path forward is to continue expanding the rule set by curating more diverse mechanisms. Furthermore, our current implementation deliberately aggregates stereoisomers. Given training data, the incorporation of stereo-specific rules would unlock more complex mechanisms for MechFind.

Perhaps the key impact of this work, however, lies in proposing a path to close the loop for automated enzyme design. As established in the introduction, the goal of de novo enzyme design is to create a protein active site that stabilizes the transition state of a desired reaction¹⁹. The newest generation of protein design tools, such as RFdiffusion All-Atom¹¹ and ProteinMPNN¹², are very effective at generating protein structures but require a precise structural target to aim towards. The mechanistic hypotheses generated by MechFind provide the putative transition states that these advanced design tools require. This transforms the abstract challenge of “designing an enzyme for this reaction” into the concrete and actionable task of “designing a protein scaffold to stabilize this specific proposed mechanism.” This is particularly powerful when considering the diversity of possible mechanisms (see Fig. 5) that MechFind can uncover offering a roadmap of candidate transition states that can serve as targets for parallel design efforts. In conclusion, MechFind provides both a foundational tool and a public resource that significantly expands the map of known enzyme chemistry, paving the way for the next generation of rational enzyme engineering.

Methods

Database curation and rule generation

A prerequisite for building a predictive framework based on chemical transformations is a dataset that is internally consistent and adheres to mass and charge conservation laws. Curation began with the 734 fully annotated mechanisms in the M-CSA database²¹, which form the basis of our elementary rule set. Initial analysis revealed that a significant fraction of these entries contained inconsistencies, such as incorrect protonation states, inconsistent stereochemical assignments, or mis-drawn homolytic versus heterolytic bond events. These issues resulted in violations of elemental and charge balance for either individual steps or the overall reaction. To rectify this, we manually curated the entire set of 734 mechanisms. Due to stereochemical ambiguities in most catalogued mechanisms, we opted to bypass molecular chirality and simply represent all compounds by their constitutional isomers. Beyond this simplification, other corrections were designed to be as minimal as possible, primarily addressing inconsistencies in protonation states and ensuring that arrow-pushing schemes correctly represented bond reorganizations. In total, 463 entries (63%) required at least one correction. For example, in M-CSA entry 66 (see Fig. 6a), the protonation state of the product was inconsistent with the elementary steps, requiring a correction to the overall reaction stoichiometry (Fig. 6b). Following the curation process a final, self-consistent set of 3235 reversible mechanistic steps was assembled. Each step was treated as reversible, creating a forward and backward version, and combined with protonation steps of common moieties to generate the 4091 unique elementary rules that underpin MechFind. Note that moieties are simply classifications of all non-hydrogen atoms based on their first bonding shell present in the metabolites forming the database. Reaction rules abstract the overall reaction as the gain or loss of specific moieties.

**Fig. 6: Examples of common adjustments made in the molecules in the M-CSA database with the original state circled in red and their corrected state in blue.**

While our current framework uses constitutional isomers, we developed a proof-of-concept workaround for reactions involving the inversion of a single chiral center (e.g., racemases, epimerases). For the 13 relevant M-CSA entries, we defined a reaction from the substrate to a non-chiral intermediate and used MechFind to predict the first half of the mechanism, inferring the second half by reversing the steps. This approach successfully generated plausible mechanisms for these cases, though full stereochemical integration remains future work. For more information on the 13 relevant entries, see Supplementary Information.

Optimization formulation for mechanism prediction

To assemble plausible, stepwise enzymatic mechanisms, the fundamental challenge is to simultaneously identify the necessary set of elementary reaction rules and establish a feasible order. Conceptually, this can be addressed by a single mixed-integer linear programming (MILP) formulation, which we term minOrderRules. This formulation minimizes the total number of steps subject to constraints, ensuring moiety balance and a valid sequencing order that prevents the use of moieties not present in the reactants or synthesized beforehand.

minOrderRules

The minOrderRules formulation uses integer variable, y_r, to represent the number of times a rule r is used, and binary variable ${z}_{k,r}$ to assign rule r to a specific step k in the mechanism.

Objective function

$$\min {\sum}_{r\in R}{y}_{r}$$

(1)

- This objective minimizes the total number of elementary steps, enforcing the principle of parsimony.

Subject to:

Moiety balance:

$$\begin{array}{c}{\sum }_{r\in R}{T}_{m,r}{y}_{r}={T}_{m}^{o}\\ \forall m\in M\end{array}$$

(2)

- This constraint ensures that the sum of moiety changes from all applied rules equals the net change associated with the overall reaction.

Cumulative sum:

$$\begin{array}{c}{C}_{m}^{o}+{\sum }_{r\in R}{\sum }_{{k}^{{\prime} }=1}^{k}{T}_{m,r}{z}_{{k}^{{\prime} },r}\ge 0\\ \forall k\in K\forall m\notin {M}^{*}\end{array}$$

(3)

- This key constraint ensures chemical plausibility by preventing the consumption of any substrate or intermediate moiety before it has been created or was present in the initial reactants. It accomplishes this by ensuring that the cumulative balance of these specific moieties remains non-negative at every step k of the mechanism. Here, ${C}_{m}^{o}$ is the initial count of moiety m in the substrates.

Crucially, this constraint is applied to all moieties m that are not part of the set M^*, which represents all labeled moieties originating from catalytic residues and cofactors. This distinction is essential for modeling general acid/base catalysis, allowing catalytic groups to act as reusable proton donors (sources) or acceptors (sinks) without violating mass balance. For example, this allows a protonated histidine to be consumed in one step and a deprotonated form to be generated in another without violating the constraint, as both histidine moieties are in M^* and thus exempt from this rule. This gives the model the necessary flexibility to utilize the different states of catalytic residues and cofactors as required throughout the mechanism, reflecting their dynamic role in the active site. A complete list of the labeled moieties in M^*, which includes amino acid residues (e.g., His, Asp) and catalytically relevant metal ions (e.g., Mg, Fe, Mo), can be found in Supplementary Information.

One rule per step:

$$\begin{array}{c}{\sum }_{r\in R}{z}_{k,r}\le 1\\ \forall k\in K\end{array}$$

(4)

-This enforces that each step k in the mechanism makes use of at most one elementary rule.

Step utilization limit:

$$\begin{array}{c}k{\sum }_{r\in R}{z}_{k,r}\le {\sum }_{r\in R}{y}_{r}\\ \forall k\in K\end{array}$$

(5)

- This constraint ensures that steps are used only if they are within the total number of steps required for the mechanism. It effectively forces ${z}_{k,r}$ to be zero for any step k that is greater than the total number of steps $\sum {y}_{r}$.

Contiguous steps:

$$\begin{array}{c}{\sum }_{r\in R}{z}_{k,r}\ge {\sum }_{r\in R}{z}_{k+1,r}\\ \forall k=1,\ldots,K-1\end{array}$$

(6)

- This ensures that the sequence of steps is contiguous, with no empty (i.e., rule unassigned) steps between steps in use.

Linking variables:

$$\begin{array}{c}{\sum }_{k\in K}{z}_{k,r}={y}_{r}\\ \forall r\in R\end{array}$$

(7)

- This constraint links the step assignment variable ${z}_{k,r}$ to the rule count variable y_r, ensuring that the total number of times a rule is assigned to a step equals its total count.

Integer cuts:

$$\begin{array}{c}{\sum }_{r\in {S}^{l}}{y}_{r}\le \left|{S}^{l}\right|-1\\ \forall l\in L\end{array}$$

(8)

- To generate a ranked list of the top-ten most parsimonious mechanisms, specialized integer cut constraints³⁸ are used iteratively. After finding a solution l, we define S^l as the set of rules r with y_r > 0. The corresponding integer cut constraint excludes the prior solution from the set of feasible choices without excluding any other combination. By successively appending integer cut constraints and resolving the optimization problem a ranked list of optima (i.e., first, second, third, etc.) is obtained.

Decomposed formulation for computational tractability

However, early trials revealed that solving minOrderRules formulation is computationally taxing for all but the simplest reactions, with solution times often taking several minutes per candidate mechanism. To create a high-throughput and scalable tool, we decomposed this complex task into two separate MILP problems: minRules, which first identifies the needed rules, and OrderRules, which subsequently finds a feasible ordering. For all analyses presented in this work (M-CSA, Rhea, and MetaNetX), a computational time limit of 20 min was imposed per reaction.

In the OrderRules formulation, the total number of steps in the mechanism is no longer a variable to be optimized but is instead a fixed integer, defined by the sum of the y_r values in the solution set S^l provided by minRules. This dramatically reduces the combinatorial search space for the second, more complex ordering problem. Consequently, the constraints for ensuring step contiguity and limiting step utilization (Equations 5 and 6) are no longer required. Conversely, in minRules formulation, the maximum possible number of rules must be directly specified (Eq. (11)).

minRules

This first formulation, minRules, identifies the minimal set of rules required to satisfy the overall reaction stoichiometry.

Objective function:

$$\min {\sum }_{r\in R}{y}_{r}$$

(9)

-This objective function is identical to Eq. (1).

Subject to:

Moiety balance:

$$\begin{array}{c}{\sum }_{r\in R}{T}_{m,r}{y}_{r}={T}_{m}^{o}\\ \forall m\in M\end{array}$$

(10)

-This constraint is identical to Eq. (2).

Maximum number of rules:

$${\sum }_{r\in R}{y}_{r}\le W$$

(11)

-This limits the total number of elementary steps to a user-defined value W (herein set to 20, chosen because the longest validated mechanism in the M-CSA database, Naringenin-chalcone synthase entry 355, consists of 20 steps).

Integer cuts:

$$\begin{array}{c}{\sum }_{r\in {S}^{l}}{y}_{r}\le \left|{S}^{l}\right|-1\\ \forall l\in L\end{array}$$

(12)

-This constraint is identical to Eq. (8) and is used to generate multiple unique sets of rules.

OrderRules

For each set of rules S^l found by minRules, this second formulation determines if a feasible sequence exists. Therefore, its objective function is a dummy function that plays no role in the solution.

Subject to:

Cumulative sum:

$$\begin{array}{c}{C}_{m}^{o}+{\sum }_{r\in {S}^{l}}{\sum }_{{k}^{{\prime} }=1}^{k}{T}_{m,r}{z}_{{k}^{{\prime} },r}\ge 0\\ \forall k\in K\forall m\notin {M}^{*}\end{array}$$

(13)

- This constraint is analogous to Eq. (3).

One rule per step:

$$\begin{array}{c}{\sum }_{r\in {S}^{l}}{z}_{k,r}\le 1\\ \forall k\in K\end{array}$$

(14)

- This constraint is analogous to Eq. (4).

Each rule must be used:

$$\begin{array}{c}{\sum }_{k\in K}{z}_{k,r}={y}_{r}\\ \forall r\in {S}^{l}\end{array}$$

(15)

-This constraint, analogous to Eq. (7), ensures every rule from the minRules solution is used in the sequence.

Similarity-based Re-ranking

To improve the recovery rate of validated mechanisms beyond simple parsimony, we first attempted to train a deep neural network to re-rank the top candidates. This approach, however, could not improve upon the initial parsimony-based rankings (see Supplementary Information for details), likely due to the limited size (734 entries) and high dimensionality of the training data preventing the model from learning complex chemical patterns. We therefore developed an alternative method to re-rank the top-ten candidate mechanisms generated by the minRules and OrderRules formulations. This successful method scores each candidate based on its similarity to the known mechanisms in the M-CSA database²¹.

The re-ranking process begins with the list of the top-ten most parsimonious mechanisms for a given reaction. Each of these candidate mechanisms undergoes a pairwise comparison against every validated mechanism in the curated M-CSA database. The similarity is quantified using the “unordered” score variant described by Ribeiro et al.³⁹, which is calculated over the sets of elementary chemical steps that constitute each mechanism and ranges from zero (no shared steps) to one (identical sets of steps). Each candidate mechanism is then assigned a single score corresponding to the maximum similarity value obtained from all its pairwise comparisons, which represents its similarity to its closest known analog in the database. This scoring logic guarantees that for the M-CSA benchmark, if the validated mechanism is found within the parsimonious set, it will achieve a perfect similarity score of one by matching its own entry in the reference database, thereby ensuring it is re-ranked to the top position. Finally, the ten candidates are re-ranked in descending order of their assigned similarity scores, with the candidate having the highest score ranked first.

Data acquisition from public databases

To apply MechFind on a large scale, we integrated reaction data from Rhea²⁵ and MetaNetX²⁶. Rhea supplies a standardized list of 34,575 biochemical reactions we removed all non-unique reactions reduces the total to 14,931 reactions for application in MechFind. A similar process was performed for the MetaNetX database, which contains 23,586 balanced non-transport reactions. After removing all non-unique reactions, a final set of 22,461 reactions from MetaNetX was used for testing. A significant overlap exists between these two curated datasets, with 8196 (35%) of the reactions in our final MetaNetX set also being present in the Rhea set. The distribution of these reactions across the seven main Enzyme Commission (EC) classes is shown in Fig. 7, highlighting that both databases are predominantly composed of oxidoreductases (EC 1), transferases (EC 2), and hydrolases (EC 3). Notably, a significant fraction (i.e., 54% in Rhea and 29% in MetaNetX) of both databases do not have any EC number associated with it. A complete list of the reactions included from both databases is available in Supplementary Data 1.

**Fig. 7: Frequency of reactions across the seven main EC classes.**

Adaptive radius

To address specific transferase reactions where the standard radius 1 moiety encoding results in no net change (i.e., the reaction vector is zero despite a chemical transformation taking place), we implemented an adaptive resolution strategy. For these reactions, the moiety radius is incrementally increased until a net moiety change is detected, at which point MechFind is executed using a rule set re-derived at this higher, consistent radius.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The curated M-CSA mechanisms, elementary rules matrix (Unique_Rules.csv), and arrow environment data (M-CSA_arrow_rules_r0.json) used in this study are available in the GitHub repository. The full list of reactions from the Rhea and MetaNetX databases that were processed along with the predicted mechanisms are available in the Supplementary Data 1. The source data underlying Figs. 2, 7, and Supplementary Information Figs. S2–S8 are provided as a Source Data file. Source data are provided with this paper.

Code availability

The source code for the MechFind framework and the Jupyter notebook to reproduce the example analysis are publicly available on GitHub at https://github.com/maranasgroup/MechFind.git⁴⁰ (Commit Hash: fcc0896).

References

Mitsuhashi, S. Current topics in the biotechnological production of essential amino acids, functional amino acids, and dipeptides. Curr. Opin. Biotechnol. 26, 38–44 (2014).
Article CAS PubMed Google Scholar
Survase, S. A., Bajaj, I. B. & Singhal, R. S. Biotechnological production of vitamins. Food Technol. Biotechnol. 44, 381–396 (2006).
CAS Google Scholar
Chung, H. et al. Bio-based production of monomers and polymers by metabolically engineered microorganisms. Curr. Opin. Biotechnol. 36, 73–84 (2015).
Article CAS PubMed Google Scholar
Ahmad, A. L., Oh, P. C. & Shukor, S. R. A. Sustainable biocatalytic synthesis of L-homophenylalanine as pharmaceutical drug precursor. Biotechnol. Adv. 27, 286–296 (2009).
Article CAS PubMed Google Scholar
Liu, Y. Z. et al. Complete biosynthesis of QS-21 in engineered yeast. Nature 629, 937–944 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Petkevicius, K., Löfstedt, C. & Borodina, I. Insect sex pheromone production in yeasts and plants. Curr. Opin. Biotechnol. 65, 259–267 (2020).
Article CAS PubMed Google Scholar
Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014).
Article CAS PubMed PubMed Central Google Scholar
Turner, N. J. Directed evolution drives the next generation of biocatalysts. Nat. Chem. Biol. 5, 568–574 (2009).
Article Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384, 6693 (2024).
Article Google Scholar
Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Yeh, A. H. et al. De novo design of luciferases using deep learning. Nature 614, 774–780 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Listov, D. et al. Complete computational design of high-efficiency Kemp elimination enzymes. Nature 643, 1421–1427 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Lauko, A. et al. Computational design of serine hydrolases. Science 388, 6744 (2025).
Article Google Scholar
Hossack, E. J., Hardy, F. J. & Green, A. P. Building enzymes through design and evolution. ACS Catal. 13, 12436–12444 (2023).
Article CAS Google Scholar
Zhou, J. H. & Huang, M. L. Navigating the landscape of enzyme design: from molecular simulations to machine learning. Chem. Soc. Rev. 53, 8202–8239 (2024).
Article CAS PubMed Google Scholar
Wen, S. X. et al. Generative artificial intelligence for enzyme design: recent advances in models and applications. Curr. Opin. Green Sustain. Chem. 52, 101010 (2025).
Article CAS Google Scholar
Schramm, V. L. Enzymatic transition states and transition state analog design. Annu. Rev. Biochem. 67, 693–720 (1998).
Article CAS PubMed Google Scholar
Amyes, T. L. & Richard, J. P. Specificity in transition state binding: the Pauling model revisited. Biochemistry 52, 2021–2035 (2013).
Article CAS PubMed PubMed Central Google Scholar
Ribeiro, A. J. M. et al. Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites. Nucleic Acids Res. 46, D618–D623 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lowe, D. Chemical reactions from US patents (1976–2016). Figshare https://doi.org/10.6084/m9.figshare.5104873 (2017).
Chang, A. et al. BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res. 49, D498–D508 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M. et al. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 51, D587–D592 (2023).
Article CAS PubMed PubMed Central Google Scholar
Bansal, P. et al. Rhea, the reaction knowledgebase in 2022. Nucleic Acids Res. 50, D693–D700 (2022).
Article CAS PubMed PubMed Central Google Scholar
Moretti, S. et al. MetaNetX/MNXref: unified namespace for metabolites and biochemical reactions in the context of metabolic models. Nucleic Acids Res. 49, D570–D574 (2021).
Article CAS PubMed PubMed Central Google Scholar
Andersen, J. L. et al. Graph transformation for enzymatic mechanisms. Bioinformatics 37, i392–i400 (2021).
Article CAS PubMed PubMed Central Google Scholar
Ribeiro, A. J. M. et al. EzMechanism: an automated tool to propose catalytic mechanisms of enzyme reactions. Nat. Methods 20, 1516–1522 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kumar, A. & Maranas, C. D. CLCA: maximum common molecular substructure queries within the MetRxn database. J. Chem. Inf. Model. 54, 3417–3438 (2014).
Article CAS PubMed Google Scholar
Kumar, A. et al. Pathway design using de novo steps through uncharted biochemical spaces. Nat. Commun. 9, 184 (2018).
Article ADS PubMed PubMed Central Google Scholar
Wathan, A. J. et al. The lysine deprotonation mechanism in a ubiquitin conjugating enzyme. J. Phys. Chem. B 129, 4962–4968 (2025).
Article CAS PubMed PubMed Central Google Scholar
Lesur, E. et al. Synthetic mycolates derivatives to decipher protein mycoloylation, a unique post-translational modification in bacteria. J. Biol. Chem. 301, 108243 (2025).
Article CAS PubMed PubMed Central Google Scholar
Stuart, W. S. et al. Structure and catalytic mechanism of methylisocitrate lyase, a potential drug target against Coxiella burnetii. J. Biol. Chem. 301, 108517 (2025).
Article CAS PubMed PubMed Central Google Scholar
Osika, K. R., Gaynes, M. N. & Christianson, D. W. Crystal structure and catalytic mechanism of drimenol synthase, an unusual bifunctional terpene cyclase-phosphatase. Proc. Natl. Acad. Sci. USA 122, e2506584122 (2025).
Article CAS PubMed PubMed Central Google Scholar
Ortjohann, M. & Leippe, M. Molecular characterization of two newly recognized lysozymes of the protist Dictyostelium discoideum. Dev. Comp. Immunol. 164, 105334 (2025).
Article CAS PubMed Google Scholar
Wei, L. et al. Crystal structure and catalytic mechanism of PL35 family glycosaminoglycan lyases with an ultrabroad substrate spectrum. eLife 13, RP102422 (2025).
Article PubMed PubMed Central Google Scholar
Weininger, D. SMILES, a chemical language and information-system.1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article CAS Google Scholar
Glover, F. Stronger cuts in integer programming. Oper. Res. 15, 1174–1177 (1967).
Article Google Scholar
Ribeiro, A. J. M. et al. Measuring catalytic mechanism similarity–a new approach to study enzyme function and evolution. FEBS J. 292, 4200–4210 (2025).
Article CAS PubMed PubMed Central Google Scholar
Hartley, A.D., Upadhyay, V., Boorla, V.S. & Maranas, C.D. MechFind source code v1.0.0. Zenodo https://doi.org/10.5281/zenodo.18674529 (2026).

Download references

Acknowledgements

This material is based upon work supported by the Center for Bioenergy Innovation (CBI), U.S. Department of Energy, Office of Science, Biological and Environmental Research Program under Award Number ERKP886 (A.D.H., V.S.B., C.D.M.). Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the U.S. Department of Energy. This work was also supported by the U.S. National Science Foundation funded Molecule Maker Lab Institute (MMLI), award number 2019897 (A.D.H., V.U., C.D.M.), supported by National AI Research Institutes Program of the Directorate for Computer and Information Science and Engineering (CISE), in collaboration with the Division of Chemistry (CHE) and the Division of Chemical, Bioengineering, and Environmental Transport Systems (CBET), awarded to C.D.M. This publication was supported by the National Institutes of Health (NIH) Training Grant Number 5T32GM149417 (A.D.H., C.D.M.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors of this work recognize the Penn State Institute for Computational and Data Sciences (RRID:SCR_025154) for providing access to computational research infrastructure within the Roar Core Facility (RRID: SCR_026424).

Author information

Authors and Affiliations

Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, USA
Austin D. Hartley, Vikas Upadhyay, Veda Sheersh Boorla & Costas D. Maranas
The Center for Bioenergy Innovation, Oak Ridge, TN, USA
Austin D. Hartley, Veda Sheersh Boorla & Costas D. Maranas

Authors

Austin D. Hartley
View author publications
Search author on:PubMed Google Scholar
Vikas Upadhyay
View author publications
Search author on:PubMed Google Scholar
Veda Sheersh Boorla
View author publications
Search author on:PubMed Google Scholar
Costas D. Maranas
View author publications
Search author on:PubMed Google Scholar

Contributions

A.D.H. Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing (original draft), visualization. V.U. Conceptualization, software, supervision. V.S.B. Conceptualization, software, supervision. C.D.M. Conceptualization, writing (review & editing), supervision, project administration, funding acquisition.

Corresponding author

Correspondence to Costas D. Maranas.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks António J M Ribeiro and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Description of Additional Supplementary File (download PDF )

Supplementary Data 1 (download ZIP )

Reporting Summary (download PDF )

Transparent Peer Review file (download PDF )

Source data

Source data file (download ZIP )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Hartley, A.D., Upadhyay, V., Boorla, V.S. et al. MechFind: a computational framework for de novo prediction of enzyme mechanisms. Nat Commun 17, 3903 (2026). https://doi.org/10.1038/s41467-026-71957-0

Download citation

Received: 17 September 2025
Accepted: 01 April 2026
Published: 29 April 2026
Version of record: 29 April 2026
DOI: https://doi.org/10.1038/s41467-026-71957-0

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Moiety-based representation of reactions

Benchmarking of MechFind on M-CSA entries

Indirect validation on recently characterized mechanisms unseen in training set

Large-scale mechanistic annotation of the Rhea and MetaNetX databases

Comparisons with existing computational tools

Application of MechFind to explore mechanistic diversity

Discussion

Methods

Database curation and rule generation

Optimization formulation for mechanism prediction

minOrderRules

Decomposed formulation for computational tractability

minRules

OrderRules

Similarity-based Re-ranking

Data acquisition from public databases

Adaptive radius

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links