Abstract
The differential cyclization and rearrangement of 2,3-oxidosqualene controlled by oxidosqualene cyclases (OSCs) represents one of the most complex single enzyme transformations in nature and gives rise to a vast array of triterpenoid diversity in the plant kingdom. Here we systematically mine 599 plant genomes representing 387 species and investigate OSC diversity across different plant lineages. From the OSC sequences identified, 20 were selected for functional evaluation. Through analysis of these enzymes, we discover product profiles within clades previously believed to be functionally conserved and OSCs producing triterpenes for which no enzymatic source was known. We also discover OSCs with product profiles that yield mechanistic insights into the control of specific reaction pathways. Our study reveals lineage-specific blooms of OSC subgroups suggestive of adaptation to different environmental niches, opens up previously inaccessible chemistry and provides a framework for systematic investigations of metabolic diversification and underlying enzymatic mechanisms in the plant kingdom.

Similar content being viewed by others
Main
The terpenoids are among the largest classes of plant natural products, with >80,000 reported to date1. Their scaffolds are biosynthesized from linear substates containing multiple five-carbon (C5) isoprene units to give cyclized molecules of varying complexity, for example monoterpenes (C10), sesquiterpenes (C15), diterpenes (C20) and triterpenes (C30). The triterpenoids are the most diverse and structurally complex of all of the terpenoids2. These compounds have many important functions in plants, providing protection against pests and pathogens, shaping the root microbiome and influencing crop quality3,4. They are also precursors to essential steroid membrane components and hormones and serve as scaffolds for the biosynthesis of steroidal specialized metabolites (for example, steroidal glycoalkaloids)5. The triterpenoids are a particularly rich source of bioactive molecules and are of considerable commercial interest for medicinal and other uses3,6. Examples include the vaccine adjuvant QS-21 produced by the Chilean soapbark tree Quillaja saponaria7,8, the anti-inflammatory compound escin from horse chestnut9,10 and bee-friendly insecticidal limonoids produced by neem (Azadiracta indica) and other members of the Meliaceae family11.
Over 20,000 triterpenoids have been reported so far from nature2, the vast majority of these from plants. The triterpene scaffolds that form the basis for this plethora of plant triterpenoids are generated from a single linear C30 precursor 2,3-oxidosqualene, by a family of enzymes known as oxidosqualene cyclases (OSCs). Collectively, OSCs are capable of cyclizing 2,3-oxidosqualene into an array of diverse triterpene scaffolds (>200 reported so far) through a complex origami-like process involving sequential carbocation cascades (Fig. 1)5,12,13. These scaffolds can then be modified by cytochromes P450 and other tailoring enzymes (for example glycosyltransferases and acyltransferases) to give enormous structural diversity2,6. Around 170 plant OSCs have been functionally characterized to date, collectively accounting for ~60 of the >200 triterpene scaffolds represented in the natural product databases14. The OSCs for the remaining ~140 reported triterpene scaffolds (orphan scaffolds) are as yet unknown. The amount of triterpene scaffold diversity represented in plants is in reality likely to be much greater than this, as a number of examples of OSCs that make entirely new triterpene scaffolds have recently been reported, even for well-characterized species such as rice15,16.
a, The protosteryl–dammarenyl cation dichotomy in triterpene reaction pathways and their relationship to lanosterol and euphol. b, Examples of structural diversity originating from E-ring expansion of the dammarenyl cation, showing the relationship between the final scaffolds and the shared intermediatory carbocations. c, Carbon numbering and ring nomenclature used in this Article. The red asterisk in b indicates a cation referred to in Fig. 5.
The conversion of 2,3-oxidosqualene to different triterpene scaffolds is one of the most complex enzymatic transformations observed in nature and has fascinated chemists for the best part of a century. In the 1950s, attempts to mechanistically rationalize the stereochemical divergence of two constitutionally identical tetracyclic triterpenes, lanosterol and euphol, led to the formulation of the biogenic isoprene rule17,18. This rule postulated that the stereoselectivity of initial ring-forming electrophilic additions is controlled through the prior conformational folding of the substrate to either a chair–boat–chair (CBC) or chair–chair–chair (CCC) arrangement with respect to the A, B and C rings (Fig. 1a). Subsequent formation of the D ring would result in two different tetracyclic cation intermediates, now termed the protosteryl and dammarenyl cations, respectively. Resolution of these two cations through an analogous series of 1,2-hydride and 1,2-alkyl shifts terminating in an elimination yields the neutral alkenes, lanosterol (from the protosteryl cation) and euphol (from the dammarenyl cation) (Fig. 1a). The protosteryl cation can in addition give rise to other CBC conformation products such as parkeol, cycloartenol and cucurbitadienol (Fig. 1a), while the dammarenyl cation is the gateway to a huge array of CCC conformation products that in turn form the basis of a swathe of plant specialized triterpenes (Fig. 1b). The paradigm that all triterpenes comply with this stereochemical conformity and can be rationalized mechanistically as originating from one of these two common intermediates is not, however, absolute. For example, the recent discovery of a divergent OSC from rice that produces a pentacyclic triterpene (orysatinol), the stereochemistry of which cannot be reasoned to originate from either the protosteryl or dammarenyl cation, provides an exception to this rule15,16. Although multiple amino acid residues have been shown by mutational analysis to be involved in OSC function14,19,20, the features determining product specificity and mechanisms of triterpene cyclization are not well understood.
It is clear that the natural products reported from plants to date represent only the tip of the iceberg, as sequencing has revealed that around 20% of the genes within plant genomes are predicted to have yet-uncharacterized functions in specialized metabolism21. The amount of available plant genome sequence data has expanded rapidly over the last 10 years and is set to explode with the advent of large-scale sequencing initiatives such as the Earth BioGenome Project22. This opens up opportunities to investigate the potential of plant genomes to encode enzymes that make different chemistry through systematic genome mining and functional analysis of OSCs, to understand the mechanisms of triterpene cyclization and to reveal how the ability to synthesize different types of triterpenes has arisen in different plant lineages suggestive of adaptation to environmental niches.
There are currently over 4,800 land plant species represented in the National Center for Biotechnology Information (NCBI) genome database. However, fewer than 15% of these genomes have been submitted with accompanying gene annotation data (www.ncbi.nlm.nih.gov/genome/). In contrast to microbes, predicting accurate gene models without transcriptome mapping in plants is nontrivial because of variability in genome size and complexity, splice site frequency, intron size and intergenic distances23,24,25. In this study, we use the highly accurate and easy to use Selenoprofiles26 in a sequential pipeline with PSI-tBLASTn27, Exonerate28 and GeneWise29 as a targeted approach for the identification of putative OSC gene models from unannotated plant genome sequences.
Results
Mining OSC sequence diversity across the plant kingdom
We used a total of 599 plant genome sequences of varying annotation quality representing 387 species within the Viridiplantae (Supplementary Data 1). The order-level taxonomy of the plant species used is shown in Supplementary Fig. 1, with species with multiple genomes counted as one. The computational workflow for high-throughput systematic mining for OSC genes is shown in Extended Data Fig. 1. A total of 2,068 unique, nonoverlapping predicted OSC genes were identified from genomes of 258 species. These were filtered on the basis of sequence length and gene model confidence to give 1,405 high-quality OSC sequences (Methods), of which 809 sequences were derived from unannotated genome data using Selenoprofiles.
A maximum-likelihood tree of the amino acid sequences of the 1,405 high-quality OSC genes, along with 83 functionally characterized OSCs (Supplementary Data 2), is shown in Fig. 2. The structures of the products of these enzymes are shown in Supplementary Fig. 2. The OSC clades are labeled A–N following the nomenclature used in previously published analyses5,30,31, with cycloartenol synthases from green algae and early diverging land plants forming the root (group A). Eudicot OSCs that produce protosteryl-derived products form groups B and C; group B contains both cycloartenol and cucurbitadienol synthases, while group C appears to contain cycloartenol synthases only. Monocot OSCs that make protosteryl-derived and dammarenyl-derived products fall primarily into groups D and E, respectively. Group F consists of lanosterol synthases, which are specific to eudicots32,33. Group G is a monocot-specific clade from which a single OSC producing the bicyclic product poaceatapetol has been characterized in rice34. All characterized OSCs in group H are lupeol synthases, although lupeol synthase activity is also found in other groups. The remaining groups (I–N) represent a large monophyletic dicot clade that contains β-amyrin synthases and other diverse types of plant OSCs5,30,31. Groups K and N are specific to brassicas, corresponding respectively to the clades I and II reported by Liu et al35. Group J is of particular note, as all eudicot genomes (with the exception of Kalanchoe spp. and Brassica spp.) have at least one OSC in this group. Groups J and K contain characterized β-amyrin synthases (primarily in group J) and also multifunctional OSCs that make α-amyrin and other mixed products. As such, these two groups appear to form a (paraphyletic) core group of OSCs that likely function as β-amyrin synthases or as other types of dammarenyl-derived triterpene synthases. Outside of these core groups, groups I, L, M and N all contain OSCs that produce dammarenyl-derived compounds, although there is little indication of conservation of function within the groups. Duplication 1 (D1) (Fig. 2) represents the ancient gene duplication of the ancestral cycloartenol synthase (ACS) and the ancestral lanosterol synthase-like (ALSL) OSCs (>140 million years ago)31. D2 and D3 represent the origins of dicot (D2) and monocot (D3) OSCs that make dammarenyl-derived products. It is evident that the monocots have independently evolved dammarenyl-derived biosynthetic function through the ACS clade, as opposed to the ALSL clade as in eudicots. A summary of the representation of the different OSC groups across the Viridiplantae is provided in Extended Data Fig. 2.
Maximum-likelihood tree of 1,405 OSC sequences mined from plant genomes, along with 83 previously characterized OSCs (Supplementary Data 2). Symbols on the tree indicate the products made by these characterized OSCs (key, bottom left). The structures of compounds named here are presented in Supplementary Fig. 2. Top left: the branch colors indicate the plant clades to which the OSCs belong. The OSCs are grouped into clades A–N, following nomenclature published previously5,30,31, with some OSC groups sharing apparent functional and/or plant clade specificity. D1 represents the ancient gene duplication of the ACS and ALSL OSCs as in ref. 31. D2 and D3 represent the origins of dicot (D2) and monocot (D3) OSCs that make dammarenyl-derived products. The OSCs selected for functional analysis here (OSCs 1–20) are indicated by yellows stars.
Current Pfam classification groups all eukaryotic OSCs into a single family together with prokaryotic squalene-hopene cyclases (SQHop_cyclase_N/C: PF13249, PF13243). Having established the above model of phylogenetic diversity of OSCs across plants, we next generated profile hidden Markov models (pHMMs) (Supplementary Data 3) for each of OSC groups A–N to enable classification of OSCs without the need for alignments and tree building (an approach that we term OSC profiling). To demonstrate the utility of our pHMM-based profiling approach, we applied this to the genomes listed in Supplementary Data 1 (Supplementary Fig. 3). Figure 3 shows a subset of these OSC profiles for species across a range of plant clades, while profiles for selected plant clades can be found in Supplementary Fig. 4. This analysis again demonstrates how the diversity of OSCs differs across plant lineages and allows rapid assessment of potentially interesting or unusual OSCs in a given species. It is also possible to quickly assess clade-specific or species-specific blooms of OSCs of different groups (as defined by a mean copy number of ≥4 (Supplementary Fig 5), for example, group A OSCs in spikemosses (Selaginella spp.), group L OSCs in widow’s thrill stonecrops (Kalanchoe spp.) and the expansion of the apparently lupeol-specific group H OSCs in the Fabales species walnut (Juglans regia) and the peanut relative Arachis duranensis. Some caveats apply to the interpretation of the outputs of this approach. For example, high overall OSC numbers can in some cases be because of polyploidy (for example, bread wheat, Triticum aestivum), while relatively low numbers may be indicative of poorer quality genome assemblies (for example, for hop, Humulus lupulus, and bottle gourd, Lagenaria siceraria) (Supplementary Fig. 3). OSC groups can also be observed to have variable propensity to occur in putative biosynthetic gene clusters (BGCs; as detected by plantiSMASH)36, such as the notable clustering seen with group N OSCs (that is, those identified as clade II OSCs by Liu et al.35) (Supplementary Fig. 6). There is enrichment of the apparently conserved OSC group F (lanosterol) and group H (lupeol) in BGCs. No BGCs have been characterized thus far that use OSCs from these groups, suggesting that undiscovered metabolic pathways might be found in plants using OSCs from these groups.
Functional analysis of selected OSC sequences
We next selected a total of 20 previously uncharacterized OSC sequences for functional analysis (labeled OSCs 1–20 in Figs. 2 and 3; alignment shown in Supplementary Fig. 7). These OSCs were prioritized because they are either located in divergent branches of the phylogenetic tree or in clades that have not been well characterized. Further information about these 20 OSCs, along with the reason for their selection, is provided in Supplementary Table 1. OSC coding sequences were synthesized and cloned into the pEAQ-HT-DEST1 expression vector before Agrobacterium-mediated transient expression in Nicotiana benthamiana37. The OSCs were coexpressed with a feedback insensitive mutant variant of HMG-CoA reductase (tHMGR) that boosts precursor supply and, hence, triterpene yields38,39. Then, 5 days after agroinfiltration, the leaves were isolated and extracts were analyzed by gas chromatography–mass spectrometry (GC–MS). The total ion chromatograms (TICs) of extracts from leaves expressing each OSC were compared to those of leaves infiltrated with A. tumefaciens containing the tHMGR expression construct alone (which served as a negative control) (Fig. 4a). The triterpenes present in these extracts were monitored on the basis of retention time and estimated abundance relative to an internal standard (IS; coprostanol, 100 ppm). Unique OSC products were subsequently indexed by their electron impact (EI) mass spectra and numbered in order of increasing relative retention time (Supplementary Data 4). Putative structural assignments of the indexed triterpenes were initially made on the basis of a comparison of the fragmentation patterns observed in each EI spectrum to those reported in the literature for known triterpene compounds. These putative assignments were then confirmed by comparison to authentic standards where available. Remaining unconfirmed assignments were selectivity targeted for further investigation and either confirmed or revised on the basis of nuclear magnetic resonance (NMR) analysis (Fig. 4b and Supplementary Figs. 8–28).
a, Representative TICs for OSCs 1–20 and the control (tHMGR expressed alone). Retention times are relative to an IS (coprostanol, 100 ppm). The product peaks are numbered 1–40 in order of increasing retention time (EI spectrum in Supplementary Data 4). S, squalene; OS, oxidosqualene. b, Summary of key data (relative retention times and structures where known) for the products shown in a (in numerical order from top left, reading down the columns). Red numbers, assignments confirmed by comparison with authentic standards or NMR. Blue numbers, putative assignments based on comparison of EI spectrum to those reported in the literature (details in Supplementary Figs. 8–28). Gray numbers, unassigned products along with their respective relative retention times. Compound 36 (†) degrades into two peaks during extraction (Supplementary Fig. 12); compound 39 (‡) partially degrades to compound 21 during GC–MS (Supplementary Fig. 8). *The relative retention time scale is approximate. Source TICs are provided in Supplementary Data 4.
Triterpene products were detected by GC–MS for 16 of the 20 OSCs tested (Fig. 4a). Overall, a total of 41 distinct triterpene products were detected. These included 19 known or putatively assigned triterpenes and a further 22 unassigned triterpenes (Fig. 4b and Supplementary Table 1). The product profiles of the 16 functional OSCs are shown in Extended Data Figs. 3 and 4. Five of the OSCs produced one major detectable product, while the remaining 11 produced multiple products. For the four OSCs that did not produce any detectable products, this may be because of errors in the publicly available sequences, failure to express or fold correctly, lack of catalytic activity or conversion of products to undetected compounds in N. benthamiana.
OSC16 produces a unique E-ring methyl group arrangement
The group L enzyme OSC16 from Kalanchoe fedtschenkoi (Fig. 2) was found to be a highly multifunctional enzyme producing a total of 13 compounds. Most of these, however, were minor side products (including dammarenyl-derived triterpenes eupha-7,24-dien-3β-ol (11) and α-amyrin (20)) and the product profile was dominated by two coeluting unassigned triterpenes (compounds 24a and 24b) (Fig. 4, Extended Data Figs. 3 and 4 and Supplementary Table 1). These two compounds were subsequently resolved and isolated by preparative high-performance liquid chromatography (HPLC) following infiltration of 30 N. benthamiana plants. Extensive one-dimensional and two-dimensional NMR analysis of the resulting fractions (Supplementary Figs. 29–36) revealed their structures to have the connectivity of pentacyclic triterpene scaffolds, proposed herein, and named kalanchoeol (24a) and spirokalanchoeol (24b) (Fig. 5). Spirokalanchoeol (24b) has a 6,6,6,6,5 scaffold containing an unusual spirocyclized D/E-ring junction. Although this basic ring architecture is present in some previously reported triterpenoid natural products such as cleistanone (produced by the plant species Cleistanthus indochinensis)40, the methyl group arrangement in the E ring of spirokalanchoeol (24b) is not. The presence of the methyl group at C19, which is vicinal to an intact pair of geminal methyl groups at C20, would appear to require a rarely invoked 1,3-alkyl shift before the ring contraction that we propose affords the spirocyclized D/E-ring junction (Fig. 5). The structure of the product kalanchoeol (24a) supports this proposed mechanism; it represents an alternative resolution of the C17 cation that would be generated from this proposed 1,3-alkyl shift, with C16 proton elimination instead of the ring contraction affording the D-ring alkene of kalanchoeol (24a). In contrast, the triterpene precursor of the cleistanone-type triterpenoids is proposed to form through D-ring expansion followed by a Wagner–Meerwein rearrangement (Fig. 5)40. The stereochemistry of kalanchoeol (24a) and spirokalanchoeol (24b) is presented in its most likely configuration based on the stereospecific rearrangements required from intermediatory dammarenyl-derived cations. The group L clade to which OSC16 belongs is unique to Kalanchoe. The C19 cation that precedes the 1,3-alkyl shift in the formation of kalanchoeol (24a) and spirokalanchoeol (24b) is shared with the reaction pathways of other previously characterized group L OSCs (Fig. 2) that produce known pentacyclic 6,6,6,6,6 triterpenes such as taraxerol, friedelin and glutinol (Figs. 1 and 5). A detailed phylogenetic tree showing the group L OSCs along with the triterpene products of OSC16 and other characterized group L enzymes is shown in Supplementary Fig. 37. Detailed computational and experimental investigations of the mechanism of action of OSC16 will form the basis of future studies.
The structures of kalanchoeol (24a) and spirokalanchoeol (24b) produced by OSC16 and their proposed reaction pathway relationship to a common cationic intermediate shared with pentacyclic triterpenes such as β-amyrin (13) and friedelin (cation marked with the red asterisk in Fig. 1). The difference between the likely reaction mechanisms of these scaffolds and the known spriocyclized cleistanone-type triterpenoids is shown. The structure of kalanchoeol (24a) can be seen to support the proposed 1,3-shift of the C28 methyl group before ring contraction in the formation of spirokalanchoeol (24b). The stereochemistry of kalanchoeol (24a) and spirokalanchoeol (24b) is presented in its most likely configuration based on the stereospecific rearrangements required from intermediatory dammarenyl-derived cations and may be open to future revision.
Broader insights into OSC function and diversification
The products of all 16 OSCs characterized in this study are summarized in Supplementary Table 1, along with their species of origin and the specific names that we have assigned them. Our analyses revealed a wealth of additional insights into OSC function and diversification, of which several are highlighted here. For example, OSC6 and OSC7 both belong to group H, a phylogenetic group previously believed to be a functionally conserved group of committed lupeol synthases (Fig. 2). OSC10 belongs to group I, which contains a previously characterized near-committed α-amyrin synthase (Fig. 2). These three enzymes were all found to be mixed-amyrin synthases, producing β-amyrin (13) along with several other closely related products that all share the lupanyl cation (the terminal cation in the formation of lupeol) as a common intermediate in their respective reaction pathways (Extended Data Fig. 5). Thus, as for OSC16, the structural diversity seen in the product profiles of these OSCs represents an expansion of previously known cladal functionality.
OSC8 and OSC20 both produce the same monocyclic product, putatively assigned as achilleol A (1) (Fig. 4), and are rare examples of committed OSCs that produce a monocyclic product, the only other example being the near-committed Arabidopsis thaliana OSC CAM1 (ref. 41), which produces the monocyclic triterpene, camelliol C (Fig. 2). OSC8 and OSC20 are distantly related (group I and group N, from Gossypium bardense and Brassica rapa, respectively) demonstrating apparent independent evolution. OSC19 represents a first-in-class enzyme in the production of a committed 3β,20-dihydroxylupane synthase (39). This diol has only previously been reported as a minor side product of a highly multifunctional group K OSC (LUP1) from A. thaliana42. OSC19 is a group N OSC from Tarenaya hassleriana (a close sister species of A. thaliana). OSC11 (a group J OSC from Pyrus × bretschneideri) is a committed friedelin synthase. OSC11 may, therefore, be an example of independent evolution of friedelin synthase functionality compared to the previously characterized committed or near-committed friedelin synthases from Kalanchoe daigremontiana (group L)43, Maytenus ilicifolia (group M)44 and Tripterygium wilfordii (group M)45.
OSC1 from agarwood (Aquillaria agallochum, syn. A. malaccensis) produced >90% cucurbitadienol (6), effectively representing a committed enzyme (Extended Data Figs. 3 and 4)—a cucurbitadienol from outside the Cucurbitales. Known cucurbitadienol synthases form a conserved subset of group B OSCs that are closely related to eudicot cycloartenol synthases (Fig. 2). OSC1 is also located within this clade and, thus, may share a common evolutionary origin with the other previously characterized cucurbitadienol synthases. This suggests that the functional conservation of this clade likely extends to a broader range of species and inclusion may be highly predictive of cucurbitadienol synthase activity.
The product of OSC4 (from the monocot Asparagus officinalis) was putatively assigned as the bicyclic triterpene (9R,10S)-polypoda-8(26),13E,17E,21-tetraen-β-ol (17), a closely related isomer of poaceatapetol (Extended Data Fig. 6), and is a committed OSC outside of the Poaceae that produces a bicyclic poaceatapetol-like triterpene. Previously reported poaceatapetol synthases from grass species are pollen specific. The gene encoding OSC4 is specifically expressed in the stamen of A. officinalis, suggesting a conserved biological role of poaceatapetol-like synthases within the monocots, extending beyond the Poaceae (Extended Data Fig. 6).
Mutation of OSC9 reveals residues critical for product specificity
OSC9 from Japanese morning glory (Ipomoea nil) produces four tetracyclic triterpenes: euphol (5), eupha-7,24-dien-3β-ol (11), tirucallol (7) and tirucalla-7,24-diene-3β-ol (18) (Figs. 4 and 6a and Extended Data Figs. 3 and 4), with euphanes (5 and 11) being favored in an approximately 2:1 ratio over the tirucallanes (7 and 18) and the respective Δ9 (5 and 7) and Δ7 (11 and 18) alkene isomers of each class being produced in a roughly 1:1 ratio. It has traditionally been thought that the euphane and tirucallane scaffolds are derived from 1,2-shift migrations of the dammarenyl cation, with the differing stereochemical configuration at C20 resulting from prior differential side-chain rotation around the C17–C20 bond (Extended Data Fig. 7)12. The discovery that OSC9 produces both euphanes and tirucallanes supports the derivation of these scaffolds from the same intermediatory tetracyclic cation. The respective alkene isomers represent resolution of the same terminal C8 cation through proton elimination at either C7 or C9 (Extended Data Fig. 7). Therefore, OSC9 shows a committed sequence of 1,2-shift migrations, with different products being derived only from the initiation and termination of this cascade. Given the refined control of scaffold rearrangement observed, it is tempting to speculate that OSC9 represents an OSC captured in the process of evolving through the duplication and diversification of a committed euphane synthase ancestor to an enzyme that can produce the tirucallane scaffold (or vice versa).
a, Structures of eupha-7,24-dien-3β-ol (11), euphol (5), tirucallol (7) and tirucalla-7,24-diene-3β-ol (18). b, Lowest-energy (affinity = −16.5 kcal mol−1) flexible-residue docking model of the dammarenyl cation with the OSC9 homolog model (Supplementary Data 5, PDB file of the OSC9 homology model; Supplementary Data 6, PDBQT file of the lowest-energy binding model of the dammarenyl cation and OSC9 flexible residues). The position of the C20 cationic carbon and the residues targeted for substitution in this study are highlighted. c, Representative extracted ion chromatograms (EICs, selected ion 411) for the wild-type OSC9 enzyme and mutant variants (scaled to the largest peak) for key substitutions.
Because of the mechanistic importance of this product profile, we next investigated the control of euphane versus tirucallane scaffold production through docking-directed mutagenesis studies of OSC9. The positioning of the dammarenyl cation in the active site of OSC9 was explored by in silico flexible-residue docking to a homology model of the enzyme (Supplementary Data 5). Inspection of the lowest-energy binding model (Fig. 6b and Supplementary Data 6) revealed that the dammarenyl cation adopted a conformation closest to that required for the first hydride shift to result in the production of euphanes, requiring just 38° of clockwise side-chain rotation compared to 142° of counterclockwise rotation for the production of tirucallanes (Extended Data Fig. 8). This is consistent with a conformational equilibrium that favors euphanes. The amino acid sequence of OSC9 was next aligned with two known committed tirucallane synthases (Supplementary Fig. 38), tirucallol synthase (CrOSC) from Capsella rubella35 and tirucalla-7,24-diene-3β-ol synthase (AiOSC1) from Azadirachta indica46. Active-site amino acid residues were then selected for investigation on the basis of their position in the docking model and/or conserved sequence divergence from the tirucallane synthases (Fig. 6b and Supplementary Fig. 38). Residues F549, S724, T561, Y256, Y725, W216 and W254 were prioritized for mutational analysis. The rationale for this was as follows. Inspection of the docking model suggested that F549 occupied an area of the active site where steric effects could influence side-chain rotation and it was found to be substituted for residues of smaller molar volume in both the tirucallane synthases (leucine or isoleucine)47. As such, a series of OSC9 mutants differing in the molar volume of residue 549 were designed for investigation (F549G, F549A, F549Q, F549I, F549Y and F549W). Conserved active-site substitutes were observed at S724 and T561, being valine and cysteine, respectively, in both tirucallane synthases (S724V and T561C). Two tryptophan residues situated in the side-chain binding pocket were also individually substituted for glycine (W216G and W254G). All were selected to investigate any potential effect on euphane versus tirucallane production. The docking analysis also showed that the cationic carbon (C20) was sandwiched between two potentially stabilizing aromatic tyrosine residues that were conserved across all three enzymes. These were individually substituted for glycine to investigate their potential importance in favoring tetracyclic compounds (Y256G and Y725G). This initial series of 12 OSC9 mutants (F549G, F549A, F549Q, F549I, F549Y, F549W, S724V, T561C, Y256G, Y725G, W216G and W254G) was subjected to functional analysis. Mean euphane content as a percentage of total euphane and tirucallane content, mean Δ7 isomer content as a percentage of total Δ9 and Δ7 isomer content and the percentage change in total accumulation of euphanes and tirucallanes for each mutant compared to the wild type are shown in Extended Data Fig. 9.
The product profiles of mutants F549G, F549A, F549Q, F549I, F549Y and F549W revealed a potential positive relationship between the molar volume of residue 549 and percentage euphane content (Extended Data Fig. 9a,d). However, the smaller molar volume of isoleucine at this position (seen in the tirucallol synthase) was not sufficient to favor tirucallanes in OSC9. The greatest effect on percentage euphane content was observed with mutant S724V (Fig. 6c). This mutant clearly favored production of tirucallanes, with the percentage euphane content dropping from 67.7% (95% confidence interval (CI): 66.6% to 68.8%) to 13.6% (95% CI: 12.6% to 14.7%) (Extended Data Fig. 9a). This was a conserved substitution seen in both tirucallane synthases (Supplementary Fig. 38). However it is not clear why S724V favors tirucallanes from a three-dimensional (3D) structural perspective because this residue is not situated in the side-chain binding pocket and instead sits over the B/C-ring junction of the dammarenyl cation (Fig. 6b).
These initial results prompted the investigation of additional mutants. Broader inspection of the sequence alignment, allowing consideration of neighboring residues that did not directly occupy the active site in the docking model, revealed that residue 548 of OSC9 is leucine, whereas it is phenylalanine in both tirucallane synthases. Thus, the conserved difference between OSC9 and both tirucallane synthases may be better described as a switch in positioning of residue 548 and 549, rather than simply a single substitution at the latter (Supplementary Fig. 38). L548 sits behind the active-site-orientated face of F549 in the docking model (Fig. 6b). To investigate whether switching residues 548 and 549 had a different effect to single substitutions at residue 549, both tirucallol and tirucalla-7,24-diene-3β-ol synthase donor switch mutants were tested (F549I;L548F and F549L;L548F), along with the tirucalla-7,24-diene-3β-ol synthase donor single 549 substitution (F549L) that was not tested in the first set of mutants studied. Both switch mutants clearly favored production of tirucallanes, with the percentage euphane content dropping to 8.8% (95% CI: 7.6% to 9.9%), and 10.3% (95% CI: 8.2% to 12.3%) for F549I;L548F and F549L;L548F, respectively, compared to 70.7% (95% CI: 69.9% to 71.4%) for the wild-type enzyme (Extended Data Fig. 9a and Fig. 6c). The F549L mutant also displayed a similar drop in percentage euphane content to 13.6% (95% CI: 12.9% to 14.3%), which was not seen with the F549I single substitution. However, in the first round of substitutions, mutant F549I showed an increase in production of Δ7 isomers (Extended Data Fig. 9b and Fig. 6c). Thus, double mutants F549L;S724V, and F549I;S724V were also tested to probe a potential role of leucine versus isoleucine in the control of Δ7 versus Δ9 isomers. These both showed the expected favoring of tirucallanes attributed to the S724V substitution. However no differential effect in the production of Δ7 isomers was observed, with both F549L and F549I substitutions resulting in a small increase in Δ7 isomer production compared to the wild type (Extended Data Fig. 9). Taken together these results indicate that there are multiple substitutions that can independently lead to the favoring of tirucallane products. Both valine in place of serine at position 724 and a switch in positioning of phenylalanine with either leucine or isoleucine at positions 549 and 548 may be diagnostic of tirucallane synthases (Fig. 6c).
Discussion
The development of a scalable and robust high-throughput genome mining pipeline has made available direct access to the coding sequences of thousands of plant OSC genes, thereby providing a framework for functional characterization. Many of these genes were previously hidden by the lack of publicly available genome annotations. Large-scale phylogenetic analysis, incorporating previously functionally characterized OSCs, affords the opportunity for the targeted selection of enzymes for study. This potential was demonstrated through the functional screening carried out in this work. Functional analysis of just 20 OSCs (selected because they were either located in divergent branches of the phylogenetic tree or in clades that have not been well characterized) resulted in the discovery of triterpene scaffolds, product profiles within clades previously believed to be functional conserved, and committed OSCs that produce previously reported triterpenes for which the producing enzyme was not known or exemplars were rare (Supplementary Table 1). It revealed examples of the independent evolution of identical product profiles. Critically, it also led to the discovery of OSCs with product profiles that yield mechanistic insights into the control of specific reaction pathways. An updated version of the phylogenetic tree shown in Fig. 2 annotated with functional enzymatic information for the 16 functional OSCs characterized in this study can be found in Extended Data Fig. 10. Of the 170 previously characterized OSCs, 50 (that is, 29%) were multifunctional (14). In our study, we found that 12 of 16 OSCs (that is, 75%) were multifunctional. It is conceivable that the proportion of multifunctional OSCs within our study may have been skewed by the selection of divergent OSCs, although our sample size is too small to robustly make this conclusion. Analysis of a subset of previously characterized OSCs using our workflow gave results consistent with the literature, confirming that any discrepancy in the proportion of dedicated/multifunctional OSCs is unlikely to be because of the methods used (Supplementary Fig. 39). Multifunctional OSCs potentially represent plastic enzymes that are in the process of probing evolutionary niches through duplication and diversification.
The reliability of previously characterized cladal functions to predict the product specificity of related OSCs was found to be highly variable. However, chemistries appear to be predictable on the basis of expansion of reaction pathways that are already established with the clade. Again, this is consistent with a process of duplication and diversification. There is a pressing need for more extensive functional characterization of the OSC enzyme family. More examples of sequences of known function are required to allow comprehensive investigation of sequence–function relationships and improve the accuracy and specificity of automated functional gene annotation tools. It is clear that focusing functional characterization efforts based on the wider phylogeny of OSCs has the potential to greatly accelerate our understanding of how nature controls and exploits this complicated and highly plastic cyclization process. In summary, this study provides a framework for systematic investigation of triterpene metabolic diversification and underlying enzymatic mechanisms in the plant kingdom and, ultimately, for understanding the developmental, physiological and ecological roles of this vast array of structurally diverse specialized metabolites.
Methods
Alignments and phylogenetics
Unless otherwise stated, all alignments were carried out with MAFFT (version 7.394)48 using the global pairwise alignment model and a maximum of 1,000 iterations. Phylogenetic trees were generated with RaXML (version 8.2)49 using the PROTGAMMAAUTO model, with 100 runs and bootstraps.
Genome mining for OSCs
A total of 599 publicly available Viridiplantae genomes representing 387 species were sourced from the NCBI genome database (https://www.ncbi.nlm.nih.gov/datasets/genome/), Phytozome (https://phytozome-next.jgi.doe.gov/) and CoGe (https://genomevolution.org/coge/). Summary data for these genomes are provided in Supplementary Data 1. In some cases, both RefSeq and GenBank genome assemblies were used, as preliminary mining showed that variations in annotation pipeline led to some OSCs being missed depending on the parameters of structural gene modeling. In cases where multiple genomes assemblies were present for a single species, the genome that yielded the highest number of high-quality OSCs was used. Characterized OSCs were obtained from the literature (summarized in Supplementary Data 2). These were aligned and used to make a profile for input to Selenoprofiles (version 3)26.
To adapt Selenoprofiles for functionality with plant genomes and to provide filtering to produce high-quality putative sequences, the nonstandard parameters used were as follows (with explanation in italics):
-
p2g_filtering = len(x.protein()) > 40 or x.coverage()> 0.3
-
Perform an initial filtering pass using a minimum length of 40 aa and 30% coverage with the OSC profile.
-
p2g_refiltering = x.awsi_filter(awsi=0.2)
-
Refilter using an average weighted sequence identity of 0.2.
-
exonerate_opt = –score 300–maxintron 20000
-
pply a minimum exonerate score of 300 and an allowable intron length up to 20 kbp.
-
genewise_opt = -splice flat
-
Use GT/AG splice models.
-
blast_filtering = x.evalue < 1e-5
-
Filter results with a BLAST score < 1 × 10−5.
This mining approach was applied to all plant genomes available, including those with structural annotations. Wherever prior annotations overlapped with the generated putative annotations, the prior annotations were always selected. For HMMER (version 3.1b2)50 mining, the same alignment of the OSCs listed in Supplementary Data 2 was used as a custom pHMM and a bitscore cutoff of 500 was used to select putative OSC annotations. One genome per species was chosen for subsequent analysis on the basis of the number and quality of putative enzymes found. This resulted in the generation of 2,068 unique, nonoverlapping putative OSC sequences from 304 genomes representing 258 species.
Before alignment, high-quality putative protein sequences were filtered by requiring a minimum length of 650 aa and the removal of pseudogenes as flagged by Selenoprofiles (that is, if frameshifts or indels were required to generate the alignment of protein profile to nucleotides). In summary, the mining process produced 1,405 high-quality OSC sequences, as defined by a length of ≥650 aa and, using a custom profile of characterized OSCs, either a bitscore of >500 (for HMMER mining) or a filtered Selenoprofiles homolog result (according to the parameters above).
OSC profiling
pHMMs of representative sequences within each OSC group were generated by selecting up to 100 representative samples across each clade followed by alignment and building with HMMER. These profiles were then used for on-the-fly characterization of OSC sequences by choosing the profile that most closely matched the OSC in question. A minimum alignment span of 450 aa was selected, which was found to maintain accuracy whilst allowing putative classification of partial sequences not included in the phylogenetic analysis. It should be noted that not all groups were monophyletic; thus, accuracy was reduced when attempting to assign proteins to specific groups on the basis of sequence similarity alone for these groups.
BGC mining
plantiSMASH-release36 and its dependencies were installed according to the developer’s instructions (https://bitbucket.org/satriaphd/plantismash-release/src/master/). Genomes with structural annotations were put forward for BGC mining by plantiSMASH. Standard parameters were used, minus the maximum limit of 9,999 contigs. HTML and JavasScript outputs were parsed by Python. OSCs across the whole genome were defined by alignment to the Pfam profiles ‘SQHop_cyclase_C’ (PF13243) or ‘SQHop_cyclase_N’ (PF13249), with the highest-scoring sequence taken forward where multiple isoforms were present in the annotation.
Gene synthesis
OSCs 1–20 were selected manually, choosing those that appeared to be of interest on the basis of phylogenetic placement and branch length (Supplementary Table 1). Where possible, public RNA-sequencing data were screened to validate the expression of the OSC in the source plant. The native coding sequences of each gene flanked by the 5′ sequence (GGGGACAAGTTTGTACAAAAAAGCAGGCTTA) and 3′ sequence (TACCCAGCTTTCTTGTACAAAGTGGTCCCC) were synthesized as double-stranded linear gene fragments (eBlocks gene fragments service, Integrated DNA Technologies). OSC9 mutants: The amino acid sequence of each mutant was used to generate a coding sequence which was codon-optimized for plant expression and gene synthesis. These were synthesized flanked by the 5′ sequence (GGGGACAAGTTTGTACAAAAAAGCAGGCTTA) and 3′ sequence (TACCCAGCTTTCTTGTACAAAGTGGTCCCC) as clonal genes (pTwist, Amp High Copy, Twist Bioscience).
Cloning
The synthesized gene products were cloned into the pDONR207 vector using BP clonase II enzyme mix (Invitrogen) according to the manufacturer’s instructions. These entry vectors were verified for the correct gene insert by Sanger sequencing (Eurofins Genomics). Subsequently, these were cloned into pEAQ-HT-DEST1 (ref. 37) using LR clonase II enzyme (Invitrogen) according to the manufacturer’s instructions. These expression constructs were transformed into chemically competent A. tumefaciens (strain LBA4404) by flash-freezing in liquid nitrogen.
Transient expression in N. benthamiana
Transient expression was carried out as described previously38,39. A. tumefaciens strains were streaked from glycerol stock cultures onto selective Luria–Bertani (LB) agar plates (50 μg ml−1 rifampicin, 50 μg ml−1 kanamycin and 100 μg ml−1 streptomycin) and incubated overnight at 28 °C. These cultured streaks were used to individually inoculate 10 ml of selective LB medium (50 μg ml−1 rifampicin, 50 μg ml−1 kanamycin and 100 μg ml−1 streptomycin) and incubated overnight at 28 °C with shaking. Cells were pelleted by centrifugation and resuspended in freshly prepared MMA buffer (10 mM 2-(N-morpholino)-ethanesulphonic acid pH 5.6 (KOH), 10 mM MgCl2, 150 μM 3′5′-dimethoxy-4′-acetophenone) at an optical density at 600 nm of 0.2. Each prepared suspension was mixed with an equal volume of one carrying the expression construct for tHMGR. For each condition, three leaves of 6-week-old N. benthamiana plants were fully infiltrated by hand using a needless syringe. Following infiltration, the plants were grown for an additional 5 days under greenhouse conditions (25 °C with 16 h of lighting). The leaves were then individually isolated and freeze-dried.
Extraction and GC–MS analysis
Each freeze-dried leaf was ground into a fine powder. Approximately 20 mg of leaf powder was transferred to a 1.5-ml Eppendorf tube and the weight transferred recorded. The leaf powder was suspended in 300 μl of ethyl acetate containing coprostanol as an IS (the concentration of coprostanol was 100 ppm for the initial functional screening and 200 ppm for the analysis of the OSC9 mutagenesis experiments). The suspensions were heated at 55 °C with sonication by floating the Eppendorf tubes in a sonicating water bath. After 1 h, the leaf powder was pelleted by centrifugation. The clarified supernatant was transferred to an HPLC vial for direct analysis by GC–MS. GC parameters were as follows: injection, 1 μl; pulsed split (20:1 split ratio), 30 psi; inlet temperature, 250 °C; column, Phenomenex Zebron ZB5-HT Inferno; carrier gas, He; flow rate, 1.2 ml min−1; temperature gradient, 0 min (170 °C), 2 min (170 °C), 8.5 min (300 °C) and 20 min (300 °C); instrument, Agilent 7890B. MS parameters were as follows: ionization, EI (70 eV); detector, Agilent 5977A; scans, 7.2 scans per s at 60–800 m/z. Data collection was carried out in MassHunter Workstation (version 10.0) Estimation of abundance relative to the IS was calculated as follows: (area analyte/area IS) × concentration IS (µg mg−1 dry weight). For the functional screening experiments, this was calculated from integration of TIC peaks at half peak height (to allow untargeted analysis). For the targeted analysis of the relative quantities of euphanes and tirucallanes between OSC9 mutants, this was calculated from integration of full peak areas from generated extracted ion chromatograms, 411 for compounds 5, 7, 11 and 18 and 388 for the IS (coprostanol).
Isolation and identification of 3β,20-dihydroxylupane (39) in the product profile of OSC19
Through isolation and NMR analysis, the chromatographic peaks at relative retention times of 1.68 and 3.57 min in the product profile OSC19 were found to both correspond to 3β,20-dihydroxylupane (39), with these chromatographic peaks representing degradation artifacts during GC–MS analysis. Transient expression of OSC19, as described above, was repeated in 60 leaves. The isolated and freeze-dried leaves were ground and the resulting dry leaf powder was subjected to batchwise pressurized solvent extraction with ethyl acetate (instrument, Büchi SpeedExtractor E-914; extraction cell size, 120 ml; cycles, three cycles at 100 °C and 130 bar of pressure; hold times, cycle 1 for 0 min and cycles 2 and 3 for 5 min; dispersant, one part by volume quartz sand). Combined extracts were concentrated to dryness by rotatory evaporation under vacuum and dissolved in a minimum volume of ethanol. To this, 50 ml of basic ion-exchange resin (Ambersep 900 OH) was added and the mixture was shaken at room temperature. Every 30 min, an additional 50 ml of basic ion-exchange resin was added until the extract changed from green to murky orange in color. The extract was collected by filtration through a short column of diatomaceous earth and subjected to repeated automated flash chromatography using a Biotage Isolera instrument. The first flash chromatography proceeded with the following parameters: loading method, dry-loaded on silica gel; column, 10-g SnapUltra silica Biotage cartridge; solvent A, hexane; solvent B, ethyl acetate; gradient, 0% B to 100% B over 720 ml; flow rate, 36 ml min−1, with 20-ml fractions. The second flash chromatography proceeded with the following parameters: loading method, dry-loaded on silica gel; column, 10-g SnapUltra silica Biotage cartridge; solvent A, hexane; solvent B, ethyl acetate; gradient: 0% B to 50% B over 720 ml; flow rate, 36 ml min−1, with 20-ml fractions. The third flash chromatography proceeded with the following parameters: loading method, dry-loaded on silica gel; column, 10-g SnapUltra silica Biotage cartridge; solvent A, hexane; solvent B, ethyl acetate; gradient, 0% B to 100% B over 720 ml; flow rate, 36 ml min−1, with 20-ml fractions. Fractions were monitored by GC–MS. The chromatographic peaks at relative retention times of 1.68 and 3.57 min were inseparable. A sample containing these chromatographic peaks was isolated in a trace amount as a white solid and subjected to 1H-NMR analysis in CDCl3 (600 MHz). This showed the sample to contain one triterpene that matched the literature for 3β,20-dihydroxylupane (39) (Supplementary Fig. 8).
Confirmation of the presence of bauerenol (28) and taraxasterol (29) in the product profiles of OSC6 and OSC7 through a metabolomic NMR approach
The EI mass spectra of 28 and 29 were consistent with the known triterpenes bauerenol and taraxasterol, respectively (Supplementary Figs. 26 and 27). The suspected presence of the compounds bauerenol (28) and taraxasterol (29) were confirmed in the product profiles of OSC6 and OSC7, respectively, by a metabolomic NMR approach. For each OSC, transient expression was carried out using six leaves. The isolated and freeze-dried leaf material was ground and the resulting dry leaf powder (approximately 600 mg) was extracted with 20 ml of ethanol at 55 °C with sonication for 99 min. Next, 10 ml of basic ion-exchange resin (Ambersep 900 OH) was added and the mixture was shaken at 60 °C until the extract changed from green to orange in color (approximately 30 min). The extract was filtered and dried onto 1 g of silica gel (40–75-μm particle size, 70-Å pore size) by rotatory evaporation under vacuum before elution with 10 ml of 15% ethyl acetate in hexane. The resulting eluent was treated with 1 g of decolorizing charcoal and recovered by filtration. Solvent was removed by evaporation under a stream of nitrogen to give a colorless oil. The oil was dissolved in 500 μl of CDCl3 and concentrated under of stream of nitrogen three times. Finally, the recovered oil was dissolved again in 500 μl of CDCl3 and subjected to a 32-scan distortionless enhancement by polarization transfer (DEPT)-edited heteronuclear single quantum coherence (HSQC) NMR experiment (1H, 600 MHz; 13C, 125 MHz). Characteristic correctly phased HSQC cross-peaks corresponding to the 1H and 13C chemical shifts for the alkene of each compound were identified, along with other identifiable signals (Supplementary Figs. 9 and 10).
Isolation and identification of kalanchoeol (24a) and spirokalanchoeol (24b) in the product profile of OSC16
Transient expression of OSC16 was carried out using 90 leaves. The isolated and freeze-dried leaves were ground and the resulting dry powder subjected to batchwise pressurized solvent extraction with ethyl acetate (instrument, Büchi SpeedExtractor E-914; extraction cell size, 120 ml; cycles, three cycles at 100 °C and 130 bar of pressure; hold times, cycle 1 for 0 min and cycles 2 and 3 for 5 min; dispersant, one part by volume quartz sand). Combined extracts were concentrated to dryness by rotatory evaporation under vacuum and dissolved in a minimum volume of ethanol. To this, 50 ml of basic ion-exchange resin (Ambersep 900 OH) was added and the mixture shaken at room temperature. Every 30 min, an additional 50 ml of basic ion-exchange resin was added until the extract changed from green to murky orange in color. The extract was collected by filtration through a short column of diatomaceous earth and subjected to repeated automated flash chromatography (using a Biotage Isolera instrument). The first flash chromatography proceeded with the following parameters: loading method, dry-loaded on silica gel; column, 10-g SnapUltra silica Biotage cartridge; solvent A, hexane; solvent B, ethyl acetate; gradient, 0% B to 100% B over 720 ml; flow rate, 36 ml min−1, with 20-ml fractions. The second flash chromatography proceeded with the following parameters: loading method, dry-loaded on silica gel; column, 10-g SnapUltra silica Biotage cartridge; solvent A, hexane; solvent B, ethyl acetate; gradient: 0% B to 25% B over 720 ml; flow rate, 36 ml min−1, with 20-ml fractions. The third flash chromatography proceeded with the following parameters: loading method, dry-loaded on silica gel; column, 10-g SnapUltra silica Biotage cartridge; solvent A, hexane; solvent B, ethyl acetate; gradient, 0% B to 25% B over 720 ml; flow rate, 36 ml min−1, with 20-ml fractions. Fractions were monitored by GC–MS. A 28 mg sample enriched in 24a and 24b was isolated as a pale-yellow oil. This sample was dissolved in 1 ml of tetrahydrofuran and subjected to preparative HPLC (Instrument, Agilent Technologies Infinity system equipped with a 1290 Infinity II fraction collector and a 1290 Infinity II preparative pump; column, 250 × 21.2-mm Luna 5-µm C18(2), 100 Å (Phenomonex); flow rate, 25 ml min−1; gradient, isocratic 100% acetronitrile containing 0.1% formic acid; run time, 30 min; injection volume, 1 ml; fraction size, 21 ml). Fractions were monitored by GC–MS. Under these conditions, 24a and 24b separated and were collected in different fractions. After solvent removal under a steam of nitrogen, both were isolated in trace amounts as white solids. These were directly dissolved in CDCl3 and subjected to extensive NMR analysis (1H, 13C, correlation spectroscopy, DEPT-135, DEPT-edited HSQC and heteronuclear multiple bond correlation) (Supplementary Figs. 29–36).
In silico docking
A homology model of OSC9 was generated using the Phyre2 server51 operating in normal mode. The top-ranked model was selected for further study (Supplementary Data 5). The template structure for this model was the OSC human lanosterol synthase in complex with lanosterol (Protein Data Bank (PDB) 1W6K) (98% alignment coverage, 38% sequence identity). Next, lanosterol from PDB 1W6K was combined with the OSC9 model. OSC9 residues within 5 Å of the transplanted lanosterol were identified (F126, W216, W254, Y256, C257, T260, E368, T408, F409, W414, V479, A481, C482, V530, F549, I552, T561, W609, Y615, S724, Y725, M726, L731 and Y733). These were selected as flexible residues for subsequent in silico docking using AutoDock Vina52. A 3D model of the dammarenyl cation was built and then geometry-optimized by molecular mechanics (force field, MMFF94s; number of steps, 500; algorithm, steepest descent; convergence, 10−7) using Avogadro (version 1.2.0)53. The rigid and flexible receptor files of OSC9 and the ligand file of the dammarenyl cation for use with Autodock Vina (version 1.1.2) in flexible mode were prepared using the accompanying AutoDock Tools package following the default parameters. Docking was performed using the following configuration parameters: center x = 31.09, center y = 70.677, center z = 6.968, size x = 30, size y = 30, size z = 30, exhaustiveness = 24, seed = 166819220, energy range = 10 and num modes = 10. Inspection of the lowest-energy binding model (affinity = −16.5 kcal mol−1) showed the dammarenyl cation to be correctly orientated in the active site compared to lanosterol in PDB 1W6K and, thus, it was selected for use in this study (Supplementary Data 6).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The 16 functional OSC sequences described in the paper were submitted to NCBI GenBank with the following accession codes: OSC1 (Aquilaria agallochum), PV548881; OSC2 (Elaeis guineensis), PV548882; OSC4 (A. officinalis), PV548883; OSC6 (J. regia), PV548884; OSC7 (I. nil), PV548885; OSC8 (G. bardense), PV548886; OSC9 (I. nil), PV548887; OSC10 (Daucus carota), PV548888; OSC11 (Pyrus × bretschneideri), PV548889; OSC12 (Vigna radiata), PV548890; OSC14 (Capsella bursa-pastoris), PV548891; OSC15 (Arabis alpina), PV548892; OSC16 (K. fedtschenkoi), PV548893; OSC17 (H. lupulus), PV548894; OSC19 (T. hassleriana), PV548895; OSC20 (B. rapa), PV548896. The set of 1,405 mined full-length OSCs is provided in Supplementary Data 7. Data are available from the corresponding authors upon request. Source data are provided with this paper.
References
Christianson, D. W. Structural and chemical biology of terpenoid cyclases. Chem. Rev. 117, 11570–11648 (2017).
Hill, R. A. & Connolly, J. D. Triterpenoids. Nat. Prod. Rep. 29, 780–818 (2012).
Moses, T., Papadopoulou, K. K. & Osbourn, A. Metabolic and functional diversity of saponins, biosynthetic intermediates and semi-synthetic derivatives. Crit. Rev. Biochem. Mol. Biol. 49, 439–462 (2014).
Huang, A. C. et al. A specialized metabolic network selectively modulates Arabidopsis root microbiota. Science 364, eaau6389 (2019).
Thimmappa, R. et al. Triterpene biosynthesis in plants. Annu. Rev. Plant Biol. 65, 225–257 (2014).
Dinday, S. & Ghosh, S. Recent advances in triterpenoid pathway elucidation and engineering. Biotechnol. Adv. 68, 108214 (2023).
Kensil, C. R., Patel, U., Lennick, M. & Marciani, D. Separation and characterization of saponins with adjuvant activity from Quillaja saponaria Molina cortex. J. Immunol. 146, 431–437 (1991).
Martin, L. B. B. et al. Complete biosynthesis of the potent vaccine adjuvant QS-21. Nat. Chem. Biol. 20, 493–502 (2024).
Sirtori, C. R. Aescin: pharmacology, pharmacokinetics and therapeutic profile. Pharmacol. Res. 44, 183–193 (2001).
Sun, W. et al. Characterization of the horse chestnut genome reveals the evolution of aescin and aesculin biosynthesis. Nat. Commun. 14, 6470 (2023).
Morgan, E. D. Azadirachtin, a scientific gold mine. Bioorg. Med. Chem. 17, 4096–4105 (2009).
Xu, R., Fazio, G. C. & Matsuda, S. P. T. On the origins of triterpenoid skeletal diversity. Phytochemistry 65, 261–291 (2004).
Miettinen, K. et al. The TriForC database: a comprehensive up-to-date resource of plant triterpene biosynthesis. Nucleic Acids Res. 46, D586–D594 (2018).
Chen, K., Zhang, M., Ye, M. & Qiao, X. Site-directed mutagenesis and substrate compatibility to reveal the structure–function relationships of plant oxidosqualene cyclases. Nat. Prod. Rep. 38, 2261–2275 (2021).
Xue, Z. et al. Identification of key amino acid residues determining product specificity of 2,3-oxidosqualene cyclase in Oryza species. New Phytol. 218, 1076–1088 (2018).
Stephenson, M. J., Field, R. A. & Osbourn, A. The protosteryl and dammarenyl cation dichotomy in polycyclic triterpene biosynthesis revisited: has this ‘rule’ finally been broken? Nat. Prod. Rep. 36, 1044–1052 (2019).
Eschenmoser, A., Ruzicka, L., Jeger, O. & Arigoni, D. Zur kenntnis der triterpene. 190. Mitteilung. Eine stereochemische interpretation der biogenetischen isoprenregel bei den triterpenen. Helv. Chim. Acta 38, 1890–1904 (1955).
Eschenmoser, A. & Arigoni, D. Revisited after 50 years: the ‘stereochemical interpretation of the biogenetic isoprene rule for the triterpenes’. Helv. Chim. Acta 88, 3011–3050 (2005).
Hoshino, T. & Sato, T. Squalene–hopene cyclase: catalytic mechanism and substrate recognition. Chem. Commun. 4, 291–301 .
Hoshino, T. β-Amrin biosynthesis: catalytic mechanism and substrate recognition. Org. Biomol. Chem. 15, 2869–2891 (2017).
Afendi, F. M. et al. KNApSAcK family databases: integrated metabolite–plant species databases for multifaceted plant research. Plant Cell Physiol. 53, e1 (2011).
Kress, W. J. et al. Green plant genomes: what we know in an era of rapidly expanding opportunities. Proc. Natl Acad. Sci. USA 119, e2115640118 (2022).
Bolger, M. E., Arsova, B. & Usadel, B. Plant genome and transcriptome annotations: from misconceptions to simple solutions. Brief. Bioinform. 19, 437–449 (2017).
Wendel, J. F., Jackson, S. A., Meyers, B. C. & Wing, R. A. Evolution of plant genome architecture. Genome Biol. 17, 37 (2016).
Kellogg, E. A. & Bennetzen, J. L. The evolution of nuclear genome structure in seed plants. Am. J. Bot. 91, 1709–1725 (2004).
Mariotti, M. & Guigó, R. Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes. Bioinformatics 26, 2656–2663 (2010).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a novel generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).
Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, 988–995 (2004).
Wang, J. et al. Diverse triterpene skeletons are derived from the expansion and divergent evolution of 2,3-oxidosqualene cyclases in plants. Crit. Rev. Biochem. Mol. Biol. 57, 113–132 (2022).
Xue, Z. et al. Divergent evolution of oxidosqualene cyclases in plants. New Phytol. 193, 1022–1038 (2012).
Sawai, S. et al. Plant lanosterol synthase: divergence of the sterol and triterpene biosynthetic pathways in eukaryotes. Plant Cell Physiol. 47, 673–677 (2006).
Suzuki, M. et al. Lanosterol synthase in dicotyledonous plants. Plant Cell Physiol. 47, 565–571 (2006).
Xue, Z. et al. Deficiency of a triterpene pathway results in humidity-sensitive genic male sterility in rice. Nat. Commun. 9, 604 (2018).
Liu, Z. et al. Drivers of metabolic diversification: how dynamic genomic neighbourhoods generate new biosynthetic pathways in the Brassicaceae. New Phytol. 227, 1109–1123 (2020).
Kautsar, S. A., Suarez Duran, H. G., Blin, K., Osbourn, A. & Medema, M. H. plantiSMASH: automated identification, annotation and expression analysis of plant biosynthetic gene clusters. Nucleic Acids Res. 45, W55–W63 (2017).
Sainsbury, F., Thuenemann, E. C. & Lomonossoff, G. P. pEAQ: versatile expression vectors for easy and quick transient expression of heterologous proteins in plants. Plant Biotechnol. J. 7, 682–693 (2009).
Reed, J. et al. A translational synthetic biology platform for rapid access to gram-scale quantities of novel drug-like molecules. Metab. Eng. 42, 185–193 (2017).
Stephenson, M. J., Reed, J., Brouwer, B. & Osbourn, A.Transient expression in Nicotiana benthamiana leaves for triterpene production at a preparative scale. J. Vis. Exp. 16, 58169 (2018).
Thanh, V. T. T. et al. Cleistanone: a triterpenoid from Cleistanthus indochinensis with a new carbon skeleton. Eur. J. Org. Chem. 2011, 4108–4111 (2011).
Kolesnikova, M. D., Wilson, W. K., Lynch, D. A., Obermeyer, A. C. & Matsuda, S. P. Arabidopsis camelliol C synthase evolved from enzymes that make pentacycles. Org. Lett. 9, 5223–5226 (2007).
Segura, M. J. R., Meyer, M. M. & Matsuda, S. P. T. Arabidopsis thaliana LUP1 converts oxidosqualene to multiple triterpene alcohols and a triterpene diol. Org. Lett. 2, 2257–2259 (2000).
Wang, Z., Yeats, T., Han, H. & Jetter, R. Cloning and characterization of oxidosqualene cyclases from Kalanchoe daigremontiana: enzymes catalyzing up to 10 rearrangement steps yielding friedelin and other triterpenoids. J. Biol. Chem. 285, 29703–29712 (2010).
Souza-Moreira, T. M. et al. Friedelin synthase from Maytenus ilicifolia: leucine 482 plays an essential role in the production of the most rearranged pentacyclic triterpene. Sci. Rep. 6, 36858 (2016).
Zhou, J. et al. Friedelane-type triterpene cyclase in celastrol biosynthesis from Tripterygium wilfordii and its application for triterpenes biosynthesis in yeast. N. Phytol. 223, 722–735 (2019).
Hodgson, H. et al. Identification of key enzymes responsible for protolimonoid biosynthesis in plants: opening the door to azadirachtin production. Proc. Natl Acad. Sci. USA 116, 17096–17104 (2019).
Zamyatnin, A. A. Protein volume in solution. Prog. Biophys. Mol. Biol. 24, 107–123 (1972).
Katoh, K. & Standley, D. M. MAFFT Multiple Sequence Alignment Software Version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. & Sternberg, M. J. E. The Phyre2 web portal for protein modeling, prediction and analysis. Nat. Protoc. 10, 845–858 (2015).
Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455–461 (2010).
Hanwell, M. D. et al. Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J. Cheminform. 4, 17 (2012).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Acknowledgements
We acknowledge the following funding sources: Biotechnological and Biological Sciences Research Council (BBSRC) OpenPlant Synthetic Biology Research Center grant BB/ L014130/1 (M.S. and A.O.), BBSRC grant BB/S016023/1 (M.S. and A.O.), National Institutes of Health grant 1R01AT010593-01 (J.R. and A.O.), the Novozymes Prize 2023 (Novo Nordisk Foundation; A.O.), Wellcome Discovery Award 227375/Z/23/Z (A.O.), the BBRSC Institute Strategic Program Grant ‘Harnessing biosynthesis for sustainable food and health’ (BB/X01097X/1; A.O.) and the John Innes Foundation (C.O. and A.O.). We thank John Innes Center (JIC) Horticultural Services for assistance with plant cultivation and the JIC Metabolomics and NMR platforms for assistance with instruments and method development.
Author information
Authors and Affiliations
Contributions
M.J.S., C.O. and A.O. conceptualized and designed the project. C.O. performed the genome mining and bioinformatic analysis. M.S. carried out the OSC cloning, transient plant expression, metabolite analysis and purification and NMR analysis of OSC products. J.R. performed the analysis of previously characterized OSCs. M.S., C.O. and A.O. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
A.O. and J.R. are a cofounders and shareholders of HotHouse Therapeutics. A.O. serves as a consultant for this company. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Chemical Biology thanks Ranran Gao, Sumit Ghosh, Marco Mariotti and the other, anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Summary of computational workflow for high-throughput, systematic OSC mining from plant genomes of varying annotation quality.
HMMER50 and Selenoprofiles26 were used in conjunction with characterised OSCs in order to fully utilise the available plant genome sequence data, despite the absence or low quality of genome annotations. After filtering the discovered OSCs to ensure only high-quality sequences were assessed, a phylogenetic analysis was carried out, which was then used to define distinct OSC groups and pHMM generation for on-the-fly OSC characterisation.
Extended Data Fig. 2 Diversity and distribution of OSCs across plant clades.
a, Cladogram summarizing OSC evolutionary pathways across plant clades, with D1-3 labelled. Branch colours are as used in Fig. 2. The evolutionary origin of eukaryotic OSCs, unlike other terpene synthases, is shared with the bacterial squalene-hopene synthases (dotted line). Conserved residues from the ‘MWCYCR’ motif are shown5. b, Absence/presence of OSC groups in different plant clades.
Extended Data Fig. 3 Estimated accumulation of products generated by OSCs.
Levels were estimated by GC-MS relative to an internal standard (coprostanol 100 ppm) following transient expression in N. benthamiana. Error bars represent standard error of the mean (three biological replicates). NB: Compound 36 (†) degrades into two peaks during extraction; compound 39 (‡) partially degrades to compound 21 during GC-MS (See, Fig. 4a and Supplementary Figs. 8 and 12). Names of OSCs that make multiple products (multifunctional OSCs) are in black, while those of OSCs that make one major product (that is are committed) are in red. See Supplementary Data 4 – Supplementary Database for source Total Ion Chromatograms and Source Data for this figure.
Extended Data Fig. 4 Summary matrix showing the overall mean estimated accumulation of triterpene compounds 1–40 (µg/mg dry weight) for each OSC tested.
For compounds 36 and 39, the sum of their artefactual degradation peaks are used. Compound 36 (†) degrades into two peaks during extraction; compound 39 (‡) partially degrades to compound 21 during GC-MS (See Fig. 4a and Supplementary Figs. 8 and 12). See Supplementary Data 4 – Supplementary Database for source Total Ion Chromatograms.
Extended Data Fig. 5 Products of OSCs 6, 7 and 10.
a, The reaction pathway relationship between the triterpene products of OSC6, OSC7 and OSC10, and their shared intermediatory cation (the lupanyl cation). b, Representative Total Ion Chromatograms of extracts from N. benthamiana leaves expressing OSC6 (red), OSC7 (blue), and OSC10 (purple); black, control (tHMGR only). IS, internal standard (coprostanol, 100 ppm).
Extended Data Fig. 6 Production of a bicyclic poaceatapetol-like triterpene from Asparagus officinalis.
a, OSC group G from Fig. 2., showing plant taxonomic clades where this OSC group was found, OSC4 from A. officinalis and the characterized poaceatapetol synthase from O. sativa. OSC4 produces a bicyclic poaceatapetol-like triterpene, (9R,10S)-polypoda-8(26),13E,17E,21-tetraen-β-ol (17). b, Heatmap of expression data from NCBI BioProject PRJNA435625 (Salmon TPM counts normalized with DESeq2)54 of the two OSC sequences found in the A. officinalis genome: OSC4 and a likely cycloartenol synthase. OSC4 has specific expression in the male stamens (unlike the expected, broad expression pattern of the likely CAS). This suggests a conserved function for group G OSCs as pollen specific compounds across wider monocot species than only the Poaceae. c, Structures of polypoda-8(26),13E,17E,21-tetraen-β-ol (17) and poaceatapetol.
Extended Data Fig. 7 Products of OSC9.
a, The reaction pathway relationship between the dammarenyl cation and the product profile of OSC9, highlighting the committed nature of the 1,2-shift migration sequence with variation arising only at the initiation and termination of this re-arrangement (differential sidechain rotation and position of final proton elimination). b, Representative Total Ion Chromatograms for leaf extract from N. benthamiana expressing OSC9 (red). Black, control (tHMGR only). IS, internal standard (coprostanol, 100 ppm).
Extended Data Fig. 8 Proto-euphane and proto-tirucallane confirmations.
a, The docking model conformation of the dammarenyl cation and the approximate sidechain rotation required around the C17-20 bond to access the proto-euphane or proto-tirucallane conformation allowing the first 1,2-hydride shift (cw = clockwise, ccw = counterclockwise). b, Approximate Newman projections viewing down the C17-C20 bond, for the conformations above in panel A, illustrating alignment of the empty p-orbital of the C20 cation and the C17 hydrogen (R=sidechain). C20 is colour coded between panels (light purple).
Extended Data Fig. 9 Euphane and tirucallane production resulting from mutations of OSC9.
a, Mean euphane content as a percentage of total euphane and tirucallane content for each OSC product profile. b, Mean Δ7 isomer content as a percentage of total Δ9 and Δ7 isomer content for each OSC product profile. c, Percent change in total accumulation of euphanes and tirucallanes compared to the mean wildtype value (µg/mg). d, Linear regression of % euphane content and the molar volume (mL/mole) of different amino acids at residue 549 (Gly = 36.1, Ala = 53.2, Gln = 86.3, Ile= 100.1, Phe = 113.9, Tyr = 116.2, Trp = 136.7, Values from Zamyatnin, A. A.44 Error bars for all are standard error of the mean (three biological replicates). **** = p ≤ 0.0001, *** = p ≤ 0.001, ** = p ≤ 0.01, * = p ≤ 0.05, ns = p > 0.05) - independent two-tailed T-Tests. Full source data with p values available in Source Data for this figure. See Supplementary Data 4 – Supplementary Database for source Extracted Ion Chromatograms.
Extended Data Fig. 10 Phylogenetic tree shown in Fig. 2 with the functions of the newly characterised OSCs from the study included.
Supplementary information
Supplementary Information
Supplemental Figs. 1–39, Table 1 and references.
Supplementary Data 1, 2 and 4
Supplementary Data 1. Table of all Viridiplantae genomes analyzed in this study. Supplementary Data 2. Functionally characterized OSCs. Supplementary Data 4. Supplementary database, containing the compound index with EI mass spectra and the chromatograms and EI mass spectra for each individual OSC sequence and OSC9 mutant studied.
Supplementary Data 3
pHMMs generated and used for OSC profiling.
Supplementary Data 5
PDB file of the homology model of OSC9.
Supplementary Data 6
PDBQT file of the lowest-energy binding model of the dammarenyl cation and OSC9 flexible residues.
Supplementary Data 7
FASTA file of 1,405 mined full-length OSC sequences, including source species name.
Source data
Source Data Extended Data Fig. 3
Table of values for yields.
Source Data Extended Data Fig. 9
Yields and exact P values.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Stephenson, M.J., Owen, C., Reed, J. et al. Large-scale mining of plant genomes unlocks the diversity of oxidosqualene cyclases. Nat Chem Biol 22, 249–259 (2026). https://doi.org/10.1038/s41589-025-02034-8
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41589-025-02034-8








