Main

Enzymes evolve as part of the metabolic network, a large, interconnected system that possesses a topology dependent on evolution and the chemical properties of its metabolites3,4,5. Because of the central role of metabolism, enzymes are important drug targets, biomarkers and a focus of bioengineering6,7,8. Despite its critical role across disciplines, our understanding of the global biochemical constraints that shape enzyme function—and therefore its evolution—remains incomplete.

Comparing enzyme sequences across evolution has revealed various constraints that act at the amino acid level, such as the chemical identity of side chains and epistatic interactions9,10,11. Moreover, the sequence of enzymes is shaped by the costs of enzyme production12. Indeed, metabolic cost optimization is observed at the species and molecular levels. For example, under many conditions, cells prefer cost-effective fermentation over oxidative metabolism, despite the latter producing stoichiometrically higher ATP amounts13. Furthermore, especially high-abundance enzymes have evolved by incorporating less energetically costly amino acids14,15,16.

We hypothesized that the systematic accessibility of protein structures enabled by structural prediction1 would allow us to integrate structural biology with evolutionary genomics and expand our understanding of the relationship between metabolism and protein evolution. We leveraged the extensive characterization of the Saccharomycotina subphylum, which represents 400 million years of evolution and includes Saccharomyces cerevisiae and Candida albicans2,17,18,19, and examined 11,269 AlphaFold2-predicted and experimentally determined enzyme structures that belong to 424 orthologue groups (orthogroups) associated with 361 metabolic reactions in 224 metabolic pathways. Linking these structures with phenotypic data, enzyme properties and metabolic network reconstructions, we identified structural changes that are associated with metabolic constraints. We report how structural evolution depends on metabolic properties across species, pathways and molecular levels.

Mapping evolution in enzyme structures

We selected 26 out of 332 highly phylogenetically diverse yeast species of the Saccharomycotina subphylum19,20 (Supplementary Note 1) and included the model fission yeast Schizosaccharomyces pombe as an outgroup to root protein trees (Fig. 1a and Supplementary Table 1.1). Then, for enzymes present in the YeastPathways database, we initially assigned orthologues on the basis of sequence-based clusters19. From these sequences, we obtained 1,301 structures from AlphaFoldDB21 and a further n = 9,968 structures that we predicted using AlphaFold v.2.0.1 at the start of our project (Supplementary Note 2). Our final dataset consisted of 11,269 enzyme structures organized in 424 orthogroups.

Fig. 1: Divergence in structurally conserved regions corresponds to metabolic properties acting at the species, pathway and molecular levels.
figure 1

a, Phylogenetic tree of the Saccharomycotina yeast subphylum highlighting the 26 species (+1 outgroup species) for which metabolic enzyme structures were generated and systematically compared. Colours indicate phylogenetic order, numbering counterclockwise starting at C. albicans. Branch lengths and topology are from the species time tree as calculated in ref. 2, except for the branch for the outgroup species S. pombe, which is not drawn to scale. b, Illustration of our analysis pipeline. c, Example alignment for 5-formyltetrahydrofolate cyclo-ligase structures in five species (S. cerevisiae, C. albicans, Kluyveromyces lactis, Kluyveromyces pastoris and S. pombe). The black line denotes the reference structure in S. cerevisiae (Fau1p). Insets show the orthologue from C. albicans with mapped residues (M, orange) and unmapped residues (N, cyan), as well as residues conserved (C, purple) between S. cerevisiae and C. albicans (purple). d, Mean mapping ratio (M/(M + N)) to mean conservation ratio (C/M) for the 529 reference structures that passed our filters. Dotted line denotes the identity line, the dashed line denotes the axis median and the solid line indicates the best linear fit. WGH, whole-genome hybridization.

Source data

Prediction quality was assessed using the predicted local distance difference test (pLDDT) score22,23, which revealed that our dataset included well-structured proteins (mean pLDDT = 90.4, mean coefficient of variation (c.v.) = 0.15). Compared with the overall structures, the terminal regions were predicted with a lower quality (first 10% of the sequence: mean pLDDT = 79.1, mean c.v. = 0.29; central 80%: mean pLDDT = 92.2, mean c.v. = 0.13; last 10%: mean pLDDT = 87.3, mean c.v. = 0.18) (Extended Data Fig. 1a). Then, we validated and refined orthogroup assignments using hierarchical clustering on the basis of a bidirectional template modelling score24 (Extended Data Fig. 1b,c). On the basis of a linkage cut-off of 0.2, 29 sequence-based orthogroups were split into distinct structural orthogroups (Supplementary Table 1.5), which improved the average template modelling score (from 0.71 to 0.77).

To benefit from the extensive characterization of S. cerevisiae, we calculated pairwise alignments for each orthogroup to S. cerevisiae enzyme structures using the matchmaker algorithm of UCSF Chimera25. To link these structures to metabolic constraints (Fig. 1b), we calculated averaged mapping ratios (MRs) and conservation ratios (CRs). The MR quantifies the percentage of amino acids that are 1:1 mappable to a S. cerevisiae enzyme structure, whereas the CR quantifies the percentage of mapped residues identical to those of the reference structure. The CR was tightly correlated with a CR based on the amino acid types (Extended Data Fig. 1d), thus capturing physicochemical properties. In agreement with this, a strong inverse relationship to the change of the octanol–water partition coefficient is observed (Extended Data Fig. 1e; P < 1 × 10−31, Spearman’s ρ = −0.49).

We illustrate the MR and CR for the orthogroup of 5-formyltetrahydrofolate cyclo-ligase (27 structures, CR = 0.40, MR = 0.87; Fig. 1c and Extended Data Fig. 1f; Fau1p in S. cerevisiae). The core structures of the enzymes map well and the secondary structural elements (helical or extended) revealed a mean MR of 95.4%, whereas the regions without secondary structures (random coils), which have a higher conformational flexibility, have a mean MR of 77.3%. The median MR for all orthogroups was 87.4% (interquartile range (IQR) = 78.3–93.9%) (Fig. 1d), with missing mapping mainly in low-pLDDT scoring regions (Extended Data Fig. 1g), that is, the carboxy and amino termini (Extended Data Fig. 1a) and random coil regions (60% of the unmapped and 36% of the mapped regions). Although the MR generally correlated with the CR (Spearman’s ρ = 0.55, adjusted P (Padj.) < 1 × 10−41) (Fig. 1d), both the MR and the CR reflect different properties of structural divergence. As the larger structural rearrangements reflected by the MR were less frequent in our orthogroups, we focused on the CR, for which we observed a high degree of diversity (median CR = 62.9%, IQR = 53.6%, 71.2%, total range = 24.5%, 89.2%) that could be linked to metabolic evolution. Here we refer to sequence divergence in structurally mapped regions as divergence (low CR) and sequence similarity in structurally mapped regions as conservation (high CR).

Biochemical constraints

The impact of metabolic specialization

We asked whether metabolic specializations at the species level are reflected in the protein structures and linked divergence to the growth properties of yeast in 21 different carbon sources17,18,19,26 (Extended Data Fig. 2a). Enzymes of species able to ferment glucose, raffinose, galactose and sucrose exhibited the smallest P values for differences in average CR between subgroups, alongside enzymes from species that grew aerobically on d-xylose (Padj. < 1 × 10−83, two-sided Wilcoxon signed-rank test). Enzymes from anaerobically fermenting species had a higher conservation relative to the structure from S. cerevisiae, which also ferments (Fig. 2a and Extended Data Fig. 2b). Although this finding corresponds to their closer phylogenetic relationship, some of the largest differences in CR were detected in the orthogroups of enzymes involved in central carbon metabolism and the electron transport chain (ETC), for example, Kgd2p (tricarboxylic acid (TCA) cycle) and Cox7p (respiratory chain). These more divergent orthogroups also included Met10p (methionine and sulfur cycle), Ath1p (trehalose metabolism) and Erg1p (ergosterol biosynthesis). We also observed cases in which the CR in non-fermenting species was higher than in the fermenting species. Again, the enzymes were directly related to oxidative metabolism, including Ndi1p, Ald5p, Idp1p and Ilv6p. Moreover, gene ontology (GO)-slim terms ‘membrane’, ‘lipid metabolism’, ‘endoplasmic reticulum’ and ‘endomembrane system’ were enriched in the first quartile of orthogroups with the largest differences in CR between subgroups for glucose fermentation (Padj. < 1 × 10−2, Fisher’s exact test) (Extended Data Fig. 2c).

Fig. 2: Metabolic network organization constrains the structural evolution of enzymes in Saccharomycotina.
figure 2

a,b, Mean conservation ratio per protein calculated for species that do or do not ferment glucose (a) and species that can or cannot grow on d-xylose (b), in refs. 17,18. For this analysis, orthogroups were temporarily subdivided on the basis of the phenotype of the species. For a, proteins with the largest differences in both directions, and, for b, pathways with remarkable changes, are highlighted. The dotted line denotes the identity line, the solid line denotes the linear fit and the dashed line denotes the axis median. c, Conservation ratio projected onto the yeast metabolic map of iPath3. d, Receiver operating characteristic (ROC) curve for four highly enriched pathways covering the TCA cycle and glucose metabolism. The number in the legend indicates the AUC. The dashed line denotes the identity line representing random sampling. e, Distribution of the mean conservation ratio of orthogroups assigned to oxidoreductases (red) or hydrolases (blue). The average distribution for all orthogroups is shown in black. f, Conservation ratio of enzymes known to bind metals (n = 158) and those not known to bind metals (n = 371), expressed as a box plot. The line denotes the median and the boxes denote the first and third quartile, the whiskers extend up to 1.5 times the IQR. Each dot represents an orthogroup. ***P < 1 × 10−4, two-sided Wilcoxon–Mann–Whitney U-test.

Source data

To study an example of metabolic specialization, we focused on the xylose use pathway. Of the 26 species examined, 12 can grow on d-xylose, 8 cannot and 6 have a conditional phenotype (Extended Data Fig. 2a). Several enzymes required for xylose use, such as transketolase, enzymes in the thiamine biosynthetic pathway and the ETC, were among those with the highest change in CR (Fig. 2b). Notably, the CR measured in relation to the two S. cerevisiae acetyl-CoA synthase paralogues, the aerobic (Acs1p) or the anaerobic (Acs2p), behaved differently depending on the capacity to use xylose (Fig. 2b, Extended Data Fig. 2d,e and Supplementary Note 3.1), indicating specialization for Acs1p. Thus, species that specialize metabolically show different patterns of divergence in enzymes that are related to the relevant metabolic traits.

The impact of pathway membership

Next, we projected diversity in the orthogroups onto a genome-scale reconstruction of the metabolic network (Fig. 2c). Then, we performed a pathway enrichment analysis on the 25% most divergent and conserved enzymes (Extended Data Fig. 3a,b). The most conserved enzyme structures belonged to pathways for purine biosynthesis, specific amino acid biosynthesis as well as central metabolism (Padj. < 0.05, Fisher’s exact test; Extended Data Fig. 3a,b). The same pathways also showed early enrichment in the receiver operating characteristic curve and high area under the curve (AUC) values (Fig. 2d and Extended Data Fig. 3c). Consistently, these orthogroups were enriched in the GO-slim terms ‘generation of precursor metabolites and energy’ and ‘nucleobase-containing small molecule metabolic process’. The most divergent enzymes tended to be more broadly distributed across the metabolic network, with the exception that enzymes belonging to ‘lipid metabolic process’ were enriched. For instance, Oar1p (CR = 0.384), Tes1p (CR = 0.391) and Eci1p (CR = 0.441) for fatty acid oxidation and Tsc10p (CR = 0.411) for sphingolipid synthesis were within the top-5%-quantile of divergence. We speculate that lipidomes of cells have a greater flexibility to adapt to the metabolic environment. Furthermore, the UDP-N-acetylglucosamine transferases Alg13p and Alg14p, related to N-linked glycosylation, were also among the most divergent orthogroups.

The impact of molecular function

Next, we annotated the biochemical activities by extracting enzyme classification (EC) identifiers from UniProtDB. Of 468 matching EC entries, 238 (51%) were supported by a direct PubMed Evidence code. Our enzyme structures with functional annotation covered 44% (361 out of 817) of all EC classifiers, distributed across 224 metabolic pathways (S. cerevisiae annotation; Supplementary Table 1.4) and encompassed all major enzyme classes, including 119 oxidoreductases, 191 transferases, 55 hydrolases, 49 lyases, 21 isomerases and 29 ligases. The dataset also contains two translocases, but, because of their low number, we refrained from making general conclusions for this EC class.

We noted that in each orthogroup, most enzymes seem to catalyse the same or highly similar reactions, and observed no changes in EC function at levels 1, 2 or 3. A change in the EC level 4 classification was detected in only five orthogroups. For example, in OG1390, containing Hsu1p, Str2p and YML082p in S. cerevisiae, a slight change in the catalysed reaction occurs27,28, which might be facilitated by changes in a domain close to the binding site (Supplementary Note 3.2).

We detected a clear relationship between enzyme class and diversity. Oxidoreductases are enriched for high CR, and hydrolases for low CR (Fig. 2e and Extended Data Fig. 3b). The high conservation of oxidoreductases was explained by their prominent role in glycolysis, gluconeogenesis and the TCA cycle. After excluding these central metabolic pathways, oxidoreductases were not more conserved than other enzymes (Extended Data Fig. 3d). Conversely, the increased diversity of hydrolases was not explained by their role in specific metabolic pathways (Extended Data Fig. 3e).

Furthermore, we report a role for non-catalytic protein–small molecule interactions. First, metal-binding enzymes are more conserved than non-metal-dependent enzymes (two-sided Wilcoxon–Mann–Whitney U-test, P < 1 × 10−4, 7.6% decrease in median conservation, Cliff’s Δ = 0.23; Fig. 2f). Furthermore, we observed that enzymes with a higher number of intracellular inhibitors29 are more conserved (Kendall τ = 0.22, Padj. < 1 × 10−4; Extended Data Fig. 3f). For instance, Gnd2p, a central enzyme of the pentose phosphate pathway, is inhibited by at least 45 cellular metabolites29, and its orthogroup was one of the least divergent. We thus concluded that characteristics of enzymes, such as pathway membership, dependency on metal ions and the number of small molecule interactions, constrain divergence.

Abundance and flux diversity constraints

Protein abundance is important for sequence conservation30,31. Given that enzyme expression is contingent on the specific activity and flux of the enzyme, we hypothesized that abundance could be a mechanism through which metabolism influences structural evolution. We obtained nine of the examined species from the UK National Collection of Yeast Cultures20 (Fig. 3a) and used proteomics to estimate their protein abundance (Fig. 3a, Extended Data Fig. 4a,b and Supplementary Methods). We found that low-abundance enzymes were more diverse than high-abundance enzymes (Fig. 3b; Spearman’s ρ = 0.48, Padj. <1 × 10−27). This relationship was dependent on the enzyme class; abundance and diversity were highly interdependent for isomerases, but not for hydrolases (Fig. 3c).

Fig. 3: Roles of enzyme abundance and flux in structural evolution.
figure 3

a, Protein abundance was determined using proteomics with data-independent acquisition (Supplementary Methods). b, Mean conservation ratio and the mean log2-transformed protein abundances for 9 of the 27 investigated species measured during exponential growth in minimal medium (n = 491). The orthogroup containing the Thi5/11/12/13p family is highlighted in red, other enzymes of the thiamine biosynthetic pathway are highlighted in purple. The solid line indicates the best linear fit, the dashed line denotes the axis median. c, Spearman’s correlation between protein abundance and conservation ratio for all tested enzymes as well as broken down according to enzyme class (numbers are adjusted according to the presence of protein abundance data). *Padj. < 0.05. d, Correlation between different measures of predicted flux (median, c.v. and number of species (n = 329) with flux through a given orthogroup) and conservation ratio for all tested enzymes, broken down in each column according to the enzyme class. For the flux median and c.v., the Spearman’s correlation was used, and the Kendall τ correlation was used for the number of species. The number in brackets indicates the number of enzymes per class. *Padj. < 0.05. e, Violin plot of the c.v. of the fluxes for the orthogroups in the first and last quartile of conservation ratio (blue) or protein abundance (orange). f, The thiamine biosynthesis pathway is shown. Heat maps underneath enzymes indicate the Z-scores of the mean conservation ratio, mean log2-transformed protein abundance and averaged cost. The Thi5/11/12/13p family and Thi4p undergo suicide reactions in which they lose a histidine and cysteine residue, respectively. Illustrations in a were created using BioRender. Heineike, B. (2025) https://BioRender.com/r831qhq. HMP, 4‐amino‐2‐methyl‐5‐pyrimidine; HMP-P, 4‐amino‐2‐methyl‐5‐pyrimidine phosphate; HMP-PP, 4‐amino‐2‐methyl‐5‐pyrimidine diphosphate; LC–MS, liquid chromatography–mass spectrometry; NAD+, nicotinamide adenine dinucleotide; TDP, thiamine diphosphate; TMP, thiamine phosphate.

Source data

Next, we estimated metabolic flux through each pathway on the basis of genome-scale metabolic models for each of the 26 species18. We benchmarked the fluxes to 13C flux measurements for central metabolites, which are captured by both approaches, and obtained good agreement (Pearson’s r = 0.8 for median (13C flux); Extended Data Fig. 4c). Notably, we revealed only a weak correlation between flux and CR (Fig. 3d; Spearman’s ρ = 0.19, Padj. < 1 × 10−4). Instead, a stronger signal was detected for flux variability (Spearman’s ρ = −0.27, Padj. < 1 × 10−8). Consistently, orthogroups with low CR exhibited a wide range of fluxes (Fig. 3e and Extended Data Fig. 4d). There were exceptions to this trend, such as in the orthogroup containing the sphingosine kinase Ysr3p, indicating highly variable flux and high diversity (low CR), despite high abundance. Moreover, we also detected a dependence on the enzyme class. Although divergence was strongly linked to the flux carried by oxidoreductases, for which all tested measures (median flux, variability of the flux and species in which flux was present for the orthogroup) correlated with conservation, other enzyme classes, especially hydrolases (Fig. 3d; Padj. > 0.7 for all three measures), lacked these relationships. We also estimated enzyme processivity (kcat) for each protein sequence in a species-specific manner32, and found a weak relationship between conservation and kcat. However, we report that the log-transformed variability (standard deviation) of kcat is higher in orthogroups that are diverse (Extended Data Fig. 4e; Spearman’s ρ = −0.27, Padj. < 1 × 10−8). Thus, both flux and kcat seem to be associated with structural evolution primarily through their variability, rather than their absolute values.

In parallel, we noticed that enzymes can escape the typical relationship between abundance and conservation due to unique functional constraints. Our attention was drawn to the thiamine (vitamin B1) biosynthetic pathway, for which the orthogroup containing Thi5p, Thi11p, Thi12p and Thi13p was highly conserved, despite low abundance (Fig. 3b). Notably, thiamine is extremely energetically costly to synthesize, as two of its reaction steps are catalysed by suicide enzymes that lose an essential amino acid after catalysis33 (Fig. 3f). This orthogroup thus illustrated that evolutionary diversification is not necessarily directly constrained by abundance, but more likely by cost.

Hierarchy of cost optimization

We thus calculated the average cost per amino acid for each protein averaged over the orthogroup using established cost metrics12,14,15,34,35,36 (Supplementary Note 4). Previous studies indicated that high-abundance proteins evolve a less costly amino acid composition14,15,16. Consistently, low-abundant, diverse enzymes had more costly amino acid compositions than high-abundant enzymes (Fig. 4a). However, the overall protein chain length, a proxy for the total protein cost, showed only a weak correlation with the mean averaged cost per amino acid (Spearman’s ρ = −0.13, Padj. < 1 × 10−2), indicating that cost optimization might be due to selection acting on specific structural elements. Notably, these relationships were indicated by cost models that are both dependent and independent of species-specific metabolic networks (Supplementary Note 4).

Fig. 4: Evolutionary cost optimization acts differently depending on the structural element and amino acid properties.
figure 4

a,b, Spearman’s correlation of the average cost per amino acid for the entire protein compared with the mean conservation ratio, log2-transformed protein abundance and length of the protein chain (a) and compared with the mean CR of selected structural features (b). Cost measures have been described previously12,14,34,36. *Padj. < 0.05. c, Median normalized cost (Rel. cost) of each amino acid sorted in descending order. Error bars denote the median absolute deviation. Bar colouring indicates the Spearman’s correlation between the mean CR and the relative amino acid content of the entire protein. df, Spearman’s correlation between the relative amino acid content for various structural features of proteins compared with the mean overall CR (e), scatter plot illustrating the relationship between mean overall CR and the relative (Rel.) alanine content of the entire mapped region (d) and relative surface glycine content of the mapped region (f). Solid line indicates the best linear fit and the dashed line denotes the axis median (d and f). *Padj. < 1 × 10−4 (e). The colour bar shown in e applies to ac and e.

Source data

To test the role of structural elements, we next identified small-molecule-binding sites using two orthogonal strategies: (1) binding-site information deposited in UniProt to account for directly coordinating residues (UPb) and (2) residues near small molecules from experimentally determined structures obtained from the RCSB Protein Data Bank (PDB) to also account for the physicochemical environment (CSb). Next, to distinguish between surface and core residues, we used the relative amino acid solvent-accessible surface area and designated each residue as either a surface-exposed or core residue37. We found that the amino acid composition is optimized differently in these structural elements. In general, the average cost of core residues was higher than the cost of surface residues, whereas binding sites had a more variable cost. Among surface residues, we also observed increased costs for membrane-bound proteins and short proteins (<100 amino acids), such as Qcr8p or Kti11p in S. cerevisiae, that might be part of larger complexes (Extended Data Fig. 5a).

Moreover, we observed a hierarchy of structural evolution. Enzymes with more conserved surfaces tended to be less costly. This effect, although less strong, was also present for core residues, but was markedly diminished for binding sites, suggesting that surface residues are the primary sites for cost optimization (Fig. 4b). Moreover, the more expensive aromatic amino acids, such as phenylalanine and tyrosine, were more frequent in the variable and low-abundance enzymes. By contrast, the conserved, high-abundant enzymes contained higher amounts of the least expensive amino acids, glycine, glutamate and alanine (Fig. 4c,d and Extended Data Fig. 5b,c).

Addressing the molecular level, we detected a significant association between CR and amino acid content in at least one structural feature, for 11 of the 20 canonical amino acids (Padj. < 1 × 10−4; Fig. 4e and Extended Data Fig. 5d). For instance, surfaces and coil regions of highly conserved enzymes had higher glycine content (Fig. 4f), whereas the core and helical regions had higher alanine content. Presumably, these small and inexpensive amino acids can replace many other amino acids, and their physicochemical properties are more suited to either the core or the surface. Furthermore, the surfaces of the variable enzyme structures had a higher cysteine content. This is presumably because cysteine often performs dedicated functions related to its high reactivity, such as catalytic activity or stabilization through disulfide bridges38. Notably, there were no significant associations between the presence of any amino acid in binding sites and the CR of the enzymes.

Conservation of structural features

Hierarchy of structural evolution

At the substructure level, the core of the enzyme was more conserved than the surface (Fig. 5a). Furthermore, the binding sites had the highest CR (Fig. 5b and Extended Data Fig. 6a), and orthogroups containing fewer variable enzymes were significantly enriched for possessing a known binding site (AUC = 0.62 and 0.66; Fig. 5c). An exception was the orthogroup of the fatty acid synthetase subunit Fas2p, in which the binding site (n = 21 (UPb), 95 (CSb) amino acids) was more variable than the overall protein (Extended Data Fig. 6b,c).

Fig. 5: Structural features and enzyme function are linked to conservation.
figure 5

a, Mean conservation ratios of the core and the surface. Dotted line denotes the identity line and the dashed line denotes the axis median. Bottom right, depiction of the core (black) and surface (orange) of the protein based on relative solvent-accessible surface area for Fau1p. b, Distribution of the mean conservation ratio for all mapped residues (blue), the core (black), the surface (orange) and the binding-site residues (purple, CSb). c, Enrichment of orthogroups containing the 25% most conserved enzymes for structural features and catalytic properties. Dot size indicates the number of metabolic enzymes associated with the relevant list in our selection, colour indicates the adjusted P value. d, Mean conservation ratios of the extended and the helical parts. Lines as in a. e, Percentage of residues in a column on the surface of the related structure for residues under purifying selection (Pur. sel.) or neutral drift (N. drift) (site model). ***P < 1 × 10−4, two-sided Wilcoxon signed-rank test. f, Depiction of the conserved regions in Cdc19p. Each non-white region denotes a fully conserved cluster of amino acids. The spheres indicate the ligand (phosphoenolpyruvate (PEP), associated clusters: dark blue, light blue, green and cyan) and the activator (fructose-1,6-bisphosphate (F-1,6-BP), associated cluster: orange) (crystal structure: PDB 1A3W). Homotetrameric subunits are highlighted (associated cluster: dark blue). Top right, clustered network (nodes, residues; edges, Cα atoms within 10 Å; colours, cluster labels). Bottom right, summary of the clustered network (node sizes, adjusted cluster sizes; edges, connectivity; edge widths, edge number). g, Percentage of either all or only fully conserved residues identified as binding sites (CSb, n = 140). The line denotes the median and the boxes denote the first and third quartile, the whiskers extend up to 1.5 times the IQR. ***P < 1 × 10−4, two-sided Wilcoxon signed-rank test.

Source data

At the level of the secondary structure, most helical portions (for example, α-helices) varied more than did extended portions (for example, β-sheets) (Fig. 5d). This result is robust to the tendency of AlphaFold2 to wrongly predict random coil regions (2.28%) as helical more than as extended39 (Supplementary Note 2). To our surprise, helical portions were also more variable than mapped random coils, which are assumed to have a higher dynamic flexibility (Extended Data Fig. 6d and Supplementary Note 5). One explanation for this might be the presence of turns in the random coil regions, which show high conservation (Extended Data Fig. 7a,b).

Another factor that can influence structural evolution is larger architectures (protein folds). To assess this, we extracted fold information for those proteins in our dataset that were present in The Encyclopedia of Domains database40. Indeed, amino acids in these folds are slightly more conserved (two-sided Wilcoxon signed-rank test, P < 1 × 10−44, adjusted Cliff’s Δ = 0.64). Consistently, the two most dominant folds, the Rossmann fold (CATH fold 3.40.50) and TIM barrel (CATH fold 3.20.20), also exhibit higher conservation (two-sided Wilcoxon signed-rank test, P < 1 × 10−4, adjusted Cliff’s Δ = 0.36 (Rossmann fold) and P < 1 × 10−7, adjusted Cliff’s Δ = 0.86 (TIM barrel); Extended Data Fig. 6e). The orthogonal bundle (CATH architecture 1.10) was enriched in the most conserved enzymes (Fisher’s exact test, Padj. < 1 × 10−3).

Conserved clusters capture interactions

To identify important functional residues, we conducted evolutionary selection analysis by estimating non-synonymous (dN) and synonymous (dS) substitution rates at both the whole-protein and site levels41 (Extended Data Fig. 8 and Supplementary Note 6.1). We observed a higher probability of neutral drift (dN/dS < 1) in surface residues (Fig. 5e). Moreover, in specific cases with (1) low conservation and high whole-protein dN/dS values (ETC proteins and Pox1p); or (2) the presence of sites under positive selection across all species, for example the GAPDH orthogroup containing Tdh1/2/3p, we observed residues with evidence of positive selection in specific branches using a branch-site model42,43. In Pox1p, positively selected residues are found in the homodimer interface (Extended Data Fig. 9a,b), whereas in GAPDH, positively selected residues were primarily on surface-exposed residues, and did not affect the multimeric protein–protein interaction (PPI) sites. Moreover, one positively selected residue in GAPDH is within distance of the substrate–cofactor binding interface (Extended Data Fig. 9c,d). Finally, in the ETC supercomplex, 96 out of 157 unique residues with signatures of positive selection were on internal PPI sites (Extended Data Fig. 9e–g). Thus, despite surfaces being most divergent, followed by core residues and low diversification in the substrate binding sites, positive selection can occur in all structural elements (Supplementary Note 6.2).

In parallel, we noticed that fully conserved structural regions are often organized in clusters. Generating a network of fully conserved residues, resulted in a median number of 4.00 (IQR = 2.00–6.00) clusters with a median size of 18.55 amino acids (IQR = 14.00–23.64) per cluster (Supplementary Table 5.1). For example, in the pyruvate kinase orthogroup (Cdc19p of S. cerevisiae), these clusters correspond to the allosteric activation site for fructose-1,6-bisphosphate, the substrate-binding site for phosphoenolpyruvate and to a symmetric PPI site (Fig. 5f). We thus speculated that these clusters could correspond to metabolite-binding or other interaction sites. Indeed, they contain twice as many substrate- and ligand-binding-site residues (Fig. 5g and Extended Data Fig. 6f). Moreover, 91% (CSb) and 97% (UPb) of the annotated binding sites overlapped with at least one cluster. On the other hand, 27% (CSb) or 50% (UPb) of these clusters did not overlap with annotated binding sites. To test whether clusters can be identified as containing known binding sites, we trained a histogram-based gradient boosting classification tree on physicochemical properties. Evaluating ten fivefold cross-validations, we obtained an average balanced accuracy of 0.63 and average AUC of 0.68 on test data removed before training (Extended Data Fig. 6g), outperforming two random models (balanced accuracy = 0.57, AUC =  0.60 (random binding site) and balanced accuracy = 0.53, AUC = 0.54 (randomized labels); Extended Data Fig. 6h). Projecting the prediction confidence onto individual amino acids confirms that residues in known binding sites are more likely to be predicted as part of a binding site (P < 1 × 10−38, two-sided Wilcoxon signed-rank test, adjusted Cliff’s Δ = 0.78). Moreover, we tested the degree to which these clusters can correspond to other types of interaction site and observed a slight over-representation of known PPI sites as extracted from the RCSB PDB (Extended Data Fig. 6i; Cliff’s Δ = 0.23 (compared with small-molecule-binding sites with 0.84 (CSb) and 0.92 (UPb)), which is consistent with a high conservation of PPI sites (Extended Data Fig. 6j). In agreement, enzymes with more physical PPIs (Extended Data Fig. 6k; Kendall τ = 0.32; Padj. < 1 × 10−24) and enzymes involved in protein complexes44 (ComplexomeDB; Fig. 5c) were more highly conserved. Thus, although highly conserved clusters are dominated by small-molecule-binding sites, they also reflect PPI sites. Notably, some known small-molecule-binding sites were not in highly conserved clusters. This situation might be explained by the small size of some small-molecule-binding sites (such as metal interaction sites), but also points to the need for better annotation of existing sites.

Discussion

The metabolic network is evolutionarily ancient, and its origins are commonly explained by two prevailing hypotheses. One suggests that its topological organization emerged as a consequence of enzyme evolution, whereas the other proposes that the metabolic network structure originated from non-enzymatic reactions45. Although metabolic evolution has probably experienced elements of both, increasing amounts of experimental evidence favour the second scenario. For example, many enzyme-catalysed reactions resemble non-enzymatic reactions as promoted by metal ions found frequently in Archaean sediment46,47. Moreover, despite considerable divergence in the enzyme sequences, the basic structure of the metabolic network remains conserved3,48. At the same time, modern metabolic pathways are highly efficient and respond to the environment, which suggests that they are optimized during evolution49.

Studies into enzyme evolution support the model in which metabolic pathways evolve alongside chemical topologies, followed by their evolutionary optimization. Comparative genomics and detailed structural and functional investigations have described enzyme evolution as a dynamic process shaped by genetic innovations, biological constraints, cost and ecological interactions5,50,51,52. Functional clustering of enzymes and their domains into families has provided insights into the extent of gene duplication and divergence across the tree of life40,53,54,55,56. Furthermore, studies on niche adaptation have demonstrated how metabolic capabilities are lost and gained through both the expansion of gene families and the functional diversification of promiscuous enzymes2,18,19. These findings suggest that enzyme evolution follows selective pressures that balance catalytic efficiency with cellular metabolic demands and resource allocation.

We speculated that the ability to generate protein structures systematically and across species barriers1 enables the integration of high-resolution structural data into functional genomic approaches, aiding the understanding of evolutionary processes. Such an approach is certainly constrained by the accuracy of structural prediction39,57. We focused on 26 diverse species selected from the Saccharomycotina subphylum to build on well-annotated genome sequences, enzyme functional annotations, genome-scale metabolic network reconstructions and proteomic data. The 11,269 enzyme structures examined cover most metabolic pathways, GO terms and enzyme classes present in their metabolic networks. We asked whether metabolic properties that differ between species, pathways and enzyme classes can help to explain the range of sequence diversity in structurally conserved regions observed in the different orthogroups. In this way, we identified metabolic constraints that influence structural evolution at different biological scales, from the level of the organism to those of pathways, individual enzymes and enzyme substructures. We also identified pathways, enzyme classes and, in some instances, structural elements that contribute to these relationships.

At the scale of the organism, we found that structural evolution is influenced by niche specialization, such as nutritional preferences. This included changes from oxidative to fermentative metabolism, which are a dominant metabolic module in yeast, other microorganisms and higher organisms. Changes from fermentation to respiration represent a major metabolic shift as fermentation is faster and less costly in terms of resource allocation, but oxidative metabolism has a better stoichiometry for ATP production, and imposes constraints on antioxidant metabolism13,58,59.

At the pathway scale, our data suggest that enzyme structural evolution depends on pathway membership, the type of reaction catalysed and interactions with other metabolites. Enzymes involved in central carbon metabolism, oxidoreductases and metal-binding enzymes were the most constrained, whereas hydrolases and enzymes of more peripheral metabolic pathways, such as those functioning in lipid metabolism or protein glycosylation, diversified. Interestingly, we report an interdependency between enzyme conservation, metabolic flux and processivity, but find that variability of flux and processivity was more important than the total amount, indicating for these properties that the dynamic nature of metabolism constrains enzyme evolution more than do static properties.

At the structural level, our study confirmed that high-abundance enzymes evolve towards a more cost-efficient amino acid composition14,15,16 with cost optimization depending on enzyme class and structural element, and distinct amino acid substitutions prevailing in different contexts. Most optimization occurs on surface regions, with the least in binding sites. Furthermore, structural features exhibit specific trends in amino acid substitutions. For example, alanine is more prevalent in the core, whereas glycine residues are more common on the surface of highly conserved enzymes.

It has been proposed that the diversification of enzyme structures can lead to new metabolic capabilities4,5,55. Our dataset is consistent with this possibility but highlights that most of the structure-driven diversification results in shifts between chemically similar reactions. Our dataset identifies no example of a higher order change in enzyme function, and at the molecular level, binding sites are highly conserved and not optimized for costs. Therefore, we observe the formation of clusters of high structural conservation in small-molecule-binding sites. Our data suggest that these clusters can be used to annotate previously undescribed binding sites, whereas some serve other functions such as PPI.

Across all our analyses, we obtain a consistent picture that the relationship between structure and evolutionary constraints is dominated by catalytic function. Notably, hydrolases differ in several properties (Extended Data Fig. 10a–e) and escape most of the otherwise highly pronounced relationships. This is not due to overall structural differentiation, as the MR was not significantly different to other enzyme classes (Extended Data Fig. 10f), neither was it due to changes in binding site conservation (Extended Data Fig. 10g,h). We speculate that a contributing factor to this situation is that hydrolases do not require a cofactor60, that their reaction mechanism is thus less constrained in evolution, and that they participate in a diverse spectrum of metabolic processes. By contrast, the high conservation of oxidoreductases is predominantly explained by their role in central metabolism.

Overall, this dominance of the catalytic constraints across all layers investigated is consistent with a model in which metabolic enzymes evolve alongside the chemical topology of the metabolic network, with structural components involved in catalysis changing the least. An alternative hypothesis, that constraints on enzyme structure drive changes in metabolism, would result in more flexible binding sites in conserved structures. This notion does not rule out that structural changes in enzymes could cause changes in substrate specificity and amplification of promiscuous activities that could pave the way for the evolution of metabolic pathways, for instance, through the selection of a promiscuous reaction. These findings illuminate the relationship between enzyme function, metabolic environment and structural evolution, providing innovative strategies for enzyme annotation and metabolic network engineering.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.