Fig. 2: Rationale and curation of a diverse α-KG NHI enzyme library, aKGLib1.
From: Connecting chemical and protein sequence space to predict biocatalytic reactions

a, Abbreviated catalytic cycle and enzymatic transformations with α-KG-dependent C–H functionalization in natural product biosynthesis. In the active site of α-KG-dependent enzymes, iron is complexed by two histidine residues and either a carboxylate-containing residue (R = Asp/Glu) or an environmentally sourced halide (R = Ala/Gly). On α-KG binding and oxidation by atmospheric oxygen, the active iron(iv)-oxo species can initiate hydrogen atom abstraction from the substrate to yield the iron(iii)-hydroxy species and a radical intermediate. This intermediate can undergo structural rearrangements before being terminated by rebound hydroxylation, carbocation formation or halogenation (functionalization by α-KG NHI enzymes in natural product biosynthesis shown in green) and generate succinate as a by-product. b, Workflow to curate a bioinformatics-guided α-KG-dependent NHI enzyme library (aKGLib1). The enzyme library was selected by collecting characterized of-interest enzyme sequences, which led to the inclusion of protein families IPR008775, IPR005123, IPR027443, IPR026992 and IPR044861. These families were used as a seed for the generation of a SSN (e-value = 5, UniRef90), which, after filtering, resulted in the network shown containing 27,005 protein sequences (alignment score = 50, organic full layout). 314 enzymes (purple) representing 1.16% of the total sequences (grey) were selected across 160 clusters, to generate a diverse enzyme library. c, Trends in substrates within clusters of a SSN and efficacy of aKGLib1. The sequences in the SSN at alignment score = 75 contain 94 enzymes that have previously been characterized (purple diamonds) and 220 sequences that are previously uncharacterized (lavender circles, 70% of total library). In clusters containing several characterized proteins, the known compatible common scaffold is highlighted. On performing a multisequence percent identity matrix, it was found that sequences only contained 13.7% shared identity, on average. On transformation and overexpression in E. coli, the presence of protein was investigated through gel electrophoresis, in which 78% of aKGLib1 showed soluble protein overexpressed at the expected molecular weight.