Main

Plants, algae and photosynthetic bacteria together fix around 100 gigatons of carbon annually using ribulose-1,5-bisphosphate carboxylase/oxygenase (rubisco)—the most abundant enzyme on Earth8. Rubisco catalysis, which is slow compared with many other central carbon metabolic enzymes2, is thought to limit photosynthesis under common conditions9. Rubisco is also prone to a side reaction with oxygen, leading to the hypothesis that this apparent inefficiency is in fact a careful balance of several biochemical trade-offs between rate, affinity and promiscuity10,11,12,13.

Efforts to engineer improvements to rubisco have been hampered by the low throughput of obtaining accurate measurements for its parameters, including catalytic rate for carboxylation (kcat,C, called kcat here), CO2 affinity (KC) and specificity for CO2 versus O2 (SC/O). A concentrated effort across several decades has produced several hundred biochemical measurements of natural and mutant rubiscos10,11,12,13. Collection of these measurements has been biased towards vascular plant rubiscos, and the diversity of natural rubiscos remains undersampled. Library screens and rational mutations have been used in the past to increase rubisco activity. These efforts often resulted in improved expression5 but occasionally led to fundamental biochemical improvements14,15.

Protein engineering has benefited in recent years from the introduction of machine learning approaches. One goal of such efforts is to train models with labelled protein sequence–function data from high-throughput functional screens16,17,18,19,20,21. Enzyme engineering with machine learning presents a further challenge: ideally, functional data would be decomposed into individual catalytic parameters measured in high throughput either in vitro22 or in vivo20.

Here we have developed a selection assay in Escherichia coli to estimate the carboxylation fitness of more than 99% (8,760 of 8,835) of the single-amino acid mutants of the model Form II rubisco from Rhodospirillum rubrum (Extended Data Fig. 1). Ribose phosphate isomerase was knocked out to generate Δrpi—a strain that grows on glycerol only when it expresses functional rubisco (Extended Data Fig. 2a). We then generated a barcoded library of single-amino acid mutations of the R.rubrum rubisco, which we assayed in high throughput using Δrpi. By varying the CO2 concentrations of the growth environment, we were able to estimate the effective CO2 affinities of 65% (5,687) of the rubisco variants—a subset of which we went on to validate in vitro. This screen showed a very small minority of mutations that improved affinity for CO2 around threefold. These affinities have never before been observed among bacterial rubiscos, are more typical of the Form I rubiscos found in plants and algae, and indicate that non-trivial alterations to biochemical function are rare, yet readily accessible through mutation.

Characterization of rubisco variants

The rubisco-dependent E.coli strain, Δrpi, cannot grow when glycerol is provided as the only carbon source because ribulose-5-phosphate accumulates with no outlet7. The combined actions of phosphoribulokinase (PRK, which produces the five-carbon rubisco substrate) and rubisco rescue growth by converting this otherwise dead-end metabolite into 3-phosphoglycerate, which can feed back into central carbon metabolism (Fig. 1a and Extended Data Fig. 2a; for similar selection systems see refs. 23,24).

Fig. 1: A deep mutational scan individually characterizes all single-amino acid mutations in rubisco.
figure 1

a, Summary of the metabolism of Δrpi—the rubisco-dependent strain. b, Δrpi grows with a rate proportional to the flux through rubisco. c, Schematic of library selection. A library of rubisco single-amino acid mutants was transformed into Δrpi then selected in minimal medium supplemented with glycerol at elevated CO2. Samples were sequenced before and after selection and barcode counts were used to determine the relative fitness of each mutant. d, Correspondence between two example biological replicates; each point represents the median fitness among all barcodes for a given mutant. e, Fitness of 77 mutants with measurements in previous studies compared with the rate constants measured in those studies (kcat). The outlier is I190T (see Methods for discussion). Fitness error values are the s.e.m. of nine replicate enrichment measurements; kcat errors are from the literature, where available. f, Variant fitnesses (grey) were normalized between values of 0 and 1, with 0 representing the average of fitnesses of mutations at a panel of known active site positions (red distribution, average is plotted as a red dashed line) and 1 representing the average of wild-type (WT) barcodes (white dashed line). g, Heatmap of variant fitnesses. Conservation by position and sequence logo were determined from a MSA of all rubiscos. Black triangle, G186 (an example of a position with high conservation that is mutationally tolerant); grey triangles, active site positions. Ri5P, ribose 5-phosphate; Ru5P, ribulose-5-phosphate; RuBP, ribulose-1,5-bisphosphate; TIM, triosephosphate isomerase.

We first confirmed that the growth rate of Δrpi was related quantitatively to known in vitro enzyme behaviour (Fig. 1b and Extended Data Fig. 2b–l). Expression of rubisco driven by an inducible promoter demonstrated that growth rates increased with the rubisco concentration, indicating that increased enzyme concentration led to higher fitness (Extended Data Fig. 2b,d,g); at isopropyl-β-d-thiogalactopyranoside (IPTG) concentrations above 30 μM, growth yields began to decline, indicating that rubisco overexpression comes with a fitness cost. Similarly, we observed faster growth in the presence of higher CO2 concentrations (Extended Data Fig. 2c,d). We next assessed whether growth-based selection correlated with biochemical behaviour. Previous work on R.rubrum rubisco identified 77 mutants spanning from less than 1% to 100% of wild-type catalytic rate (Supplementary Data 1). Growth of a subset of these mutants was tested and found to correlate with reported catalytic rates (Extended Data Fig. 2i–k). Together, these results are consistent with glycerol growth of Δrpi being limited by rubisco carboxylation flux, which is determined by enzyme kinetics—kcat and KC—as well as enzyme and CO2 concentrations.

We next constructed a library of all single-amino acid substitutions to the model Form II rubisco from R.rubrum (Extended Data Fig. 3a). This library was cloned into a selection plasmid containing PRK, barcoded and bottlenecked to around 500,000 colonies. Long-read sequencing was used to map barcodes to mutants (Extended Data Figs. 3b and 4) and determined that the final library contained approximately 180,000 barcodes, representing 8,760 mutants or more than 99% of the designed library (Extended Data Fig. 4).

This library was transformed into Δrpi to assess mutant fitness (Fig. 1c). Mutant fitness is defined by the relative growth rate of Δrpi expressing that mutant. Three independent library transformations were grown in selective conditions and grown for around seven divisions in 5% CO2 (equivalent to approximately 1,200 μM CO2 in solution; wild-type KC = 150 μM). Selection was in the presence of 20 μM IPTG—a concentration at which rubisco is limiting and overexpression stress is minimized but growth is relatively robust (Extended Data Fig. 2b,d). Short-read sequencing quantified barcode abundance before and after selection (Methods). Mutant fitness was calculated by normalizing pre- and post-selection log10 read-count ratios to a panel of known catalytically dead mutants and all wild-type barcodes (Methods). Nine replicate experiments were performed with an average pairwise Pearson coefficient of 0.98 (Fig. 1d and Extended Data Fig. 5).

We compared mutant fitness measurements against 77 catalytic rate values taken from the literature (Fig. 1e and Supplementary Data 1), as well as 35 in vitro measurements from purified mutants (Extended Data Fig. 6a,b), and observed a linear relationship. Overall, we observed a bimodal distribution of mutant effects (Fig. 1f), with mutant fitnesses clustering near wild-type (neutral mutations) and catalytically dead variants18,25.

We measured fitness values for more than 99% (8,760 out of 8,835) of amino acid substitutions (Fig. 1g and Extended Data Figs. 4f and 7b). Fewer than 0.14% of mutations seemed more fit than wild type (and when they did it was by a small amount (Fig. 1f)) and 72% were found to be deleterious. In vitro analysis of 11 variants with improved fitness did not show higher kcat values (Extended Data Fig. 6b) indicating that those small fitness effects were probably related to protein expression (Extended Data Fig. 2f–h). Mutations at known active site positions had very low fitness (for example, K191, K166 and K329; residues with grey triangles in Fig. 1g, bottom), and mutations to proline were more deleterious on average than any other amino acid (Extended Data Fig. 7a). Phylogenetic conservation and average fitness at each position tended to anti-correlate (Figs. 1g (top tracks) and 2d and Extended Data Fig. 8a) consistent with previous studies26,27; however, several positions seemed to be both highly conserved and mutationally tolerant (Fig. 1g, black triangle).

Fitness variation across the structure

Our fitness assays showed that some regions of the rubisco structure are much more sensitive to mutation than others (Fig. 2a,b). For example, residues on the solvent-exposed faces of the structure are more tolerant to mutation, as expected, whereas active site and buried residues typically do not tolerate mutations well. A notable region of interest is Loop 6 of the triosephosphate isomerase barrel, which is known to fold over the active site during substrate binding and to participate in catalysis (Fig. 2c (inset) and Extended Data Fig. 1 (right panel)). Despite this key role in catalysis, some residues in this loop are highly tolerant to mutation (for example, E331 and E333), although the active site residue K329 is highly sensitive (Fig. 2c).

Fig. 2: Fitness values provide structural, functional and evolutionary insights into rubisco.
figure 2

a, Structure of R.rubrum rubisco homodimer (Protein Data Bank (PDB) 9RUB) coloured by the average fitness value of a substitution at every site. Asterisks denote active sites. b, Variant effects for amino acids in different parts of the homodimer complex. c, Close-up view of the active site and the mobile Loop 6 region. Radar plots show the fitness effects of all mutations at a given position. d, Comparison of average fitness at each position against phylogenetic conservation among all rubiscos. Positions coloured as in b. Positions 215 and 257 form a tertiary interaction (Extended Data Fig. 8c), position 186 is highly conserved with no known function.

We expected that conserved positions would not tolerate mutations well. Consistent with this common hypothesis, the average fitness value at each position was negatively correlated with sequence conservation (Fig. 2d and Extended Data Fig. 8a). There were, however, many outliers, with several positions being highly conserved yet showing high mutational tolerance (for example, G186 (Fig. 2d, top right corner)). Selection in alternative conditions may reveal which selective forces have maintained high conservation at those positions28. Positions with low conservation and low mutational tolerance may indicate a recently evolved, but critical, function26,27; for example, M215 and H257 (Fig. 2d) are in contact in the R. rubrum structure but are absent in Form I sequences (Extended Data Fig. 8a–c).

Affinity inferred by substrate titration

Enzyme fitness is determined by the underlying biochemical parameters, including catalytic rates and affinities. To measure these parameters individually, we performed a substrate titration on the whole library of mutations in tandem (Fig. 3a). Mutant fitness values varied overall with increasing [CO2] (Fig. 3b) and some mutants’ fitnesses were affected strongly (Fig. 3c). We fit the data to a Michaelis–Menten model of catalysis to estimate effective maximum rates \(({\mathop{V}\limits^{ \sim }}_{\max })\) and CO2 half-saturation constants \(({\mathop{K}\limits^{ \sim }}_{{\rm{C}}})\)20 (the tildes distinguish library-derived fit parameters from those measured in vitro). This fitting (Fig. 3d; Methods) generated \({\widetilde{V}}_{\max }\) and \({\widetilde{K}}_{{\rm{C}}}\) estimates for every mutant (Fig. 3g and Extended Data Fig. 8c,d). We judged the reliability of the estimates by the coefficient of variation (s.d. over the mean; σ/μ) of 1,100 bootstrap fits of the data for each mutation (Methods); we focus here on the 65% of the mutants (5,687) that had a coefficient of variation under 1 (ref. 26). The remaining 35% are primarily mutants with low fitness values (Extended Data Fig. 6e) that may fail to fold altogether, although at higher expression levels or in combination with other mutations it may yet be possible to produce reliable estimates of their effects on rate and affinity.

Fig. 3: \({\widetilde{{\boldsymbol{K}}}}_{{\bf{C}}}\) and \({\widetilde{{\boldsymbol{V}}}}_{{\bf{\max }}}\) can be inferred from fitness across a CO2 titration.
figure 3

a, Schematic of rubisco selection in [CO2] titration and some examples of inferred Michaelis–Menten curves of mutants with varying KC and Vmax. b, Variant fitnesses at different [CO2]. c, Measured fitnesses at different [CO2] for two mutants (error bars, s.d. of the mean for N = 3 biological replicates). d, The same data as in c plotted under the assumptions of the Michaelis–Menten equation (error bars, s.d. of the mean for N = 3 biological replicates). e, Individually measured rubisco kinetics for the same two mutants from c and d (points, medians of N = 3 measurements; error bars, s.d.). f, Comparison between rubisco KC values measured in vitro (spectrophotometric assay) and those inferred from fitness values \(({\widetilde{K}}_{{\rm{C}}})\). ρ is calculated from a Spearman correlation; P value reflects the result of a two-sided permutations test analysis. \({\widetilde{K}}_{{\rm{C}}}\) error bars, inner quartiles of the bootstrap fits (Methods); in vitro KC error bars, s.d. from N = 3 measurements. g, Heatmap of \({\widetilde{K}}_{{\rm{C}}}\) values for all mutants for which the coefficient of variation is less than 1 (N = 5,687 mutants, 65% of total). Two positions with high-affinity mutations are highlighted in the inset expanded below. Variants for which the \({\widetilde{K}}_{{\rm{C}}}\) fits had a coefficient of variation above 1 are in grey. h, Two-dimensional histogram of mutant \({\widetilde{K}}_{{\rm{C}}}\) and \({\widetilde{V}}_{\max }\) values from g with hexagonal bins. Dashed lines, WT values.

We validated our \({\widetilde{K}}_{{\rm{C}}}\) estimates by purifying a set of seven mutants chosen to span a range of predicted \({\widetilde{K}}_{{\rm{C}}}\) values and measuring their CO2 affinities in vitro (Fig. 3e,f). Unexpectedly, for several mutants, the KC values measured in vitro were substantially lower (higher affinity) than expected from our previous estimates on the basis of fitness data. For example, the \({\widetilde{K}}_{{\rm{C}}}\) of V266T was around 130 μM, but KC was determined to be roughly 80 μM CO2 (Fig. 3f,g; highlighted box). Four mutations stood out in our analysis for having especially low \({\widetilde{K}}_{{\rm{C}}}\): A102Y, V266T, A289C and A289T (Fig. 4a).

Fig. 4: Single-amino acid mutations can traverse the functional landscape.
figure 4

a, \({\widetilde{K}}_{{\rm{C}}}\) versus effect size for each mutant. Effect size is the difference between the mutant \({\widetilde{K}}_{{\rm{C}}}\) and WT KC divided by the coefficient of variation of \({\widetilde{K}}_{{\rm{C}}}\). b, PDB structure 9RUB; inset on the C2 symmetry axis is expanded below. Each position appears twice due to proximity to the C2 axis. c, kcat versus KC of the indicated mutants (as measured by 14C assay) versus all measured rubiscos from refs. 6,12). Shaded regions indicate known ranges of \({\widetilde{K}}_{{\rm{C}}}\) values for plants and algae in green and Form II bacterial rubiscos in pink. Star, WT R.rubrum; triangles, mutants A102Y and V266T.

Our estimates of \({\widetilde{V}}_{\max }\) correlated with fitness (r = 0.93; Extended Data Fig. 6h), indicating that it is the primary driver of rubisco flux. However, Vmax = kcat × [rubisco] so variation in Vmax can have two potential causes: rubisco expression level and kcat. \({\widetilde{V}}_{\max }\) estimates report the product of those two factors.

We further found that \({\widetilde{V}}_{\max }\) and \({\widetilde{K}}_{{\rm{C}}}\) estimates anti-correlate for variants with near-wild-type kinetics where the estimates are most reliable (Fig. 3h). This correlation implies that, in the absence of selective pressure, most single-amino acid mutations impair CO2 affinity and Vmax in tandem. It is important to note that, since the CO2 addition step in catalysis is thought to be irreversible29 and there is no binding site for CO2 in the enzyme30, all measured affinities reflect CO2 on-rates. The observed anticorrelation between \({\widetilde{V}}_{\max }\) and \({\widetilde{K}}_{{\rm{C}}}\) may therefore be related to subtle changes in the electronics of the active site or the geometry of the bound sugar substrate before or during bond formation with CO2. It is also possible that these effects are caused by changes to enzyme stability.

Mutations at three positions (A289C, A102Y, V266T, A289T) induced strong improvements in CO2 affinity in vivo (Figs. 3g and 4a). Other mutations at these same positions reduced affinity (for example, V266G, A102F and A289G; Fig. 3c–g). These three positions are not part of the active site and sit near the C2 axis of the rubisco homodimer interface (Fig. 4b). In this region of the structure, residues are in closest proximity to ‘themselves’, that is, to their counterpart residue in the other monomer of the homodimer. The role these amino acids play in CO2 entry into the active site, active site conformation or electrostatics remains unclear.

In vitro measurements confirmed that V266T and A102Y possess improved CO2 affinities (we were unable to purify A289C). This correspondence between \({\mathop{K}\limits^{ \sim }}_{{\rm{C}}}\) measured in vivo and KC measured in vitro stands in contrast to mutations with \({\widetilde{V}}_{\max }\), where follow-up biochemistry (Extended Data Fig. 8b and Supplementary Data File 1) did not show faster kcat values. Variants with improved \({\widetilde{V}}_{\max }\) were probably improved through higher protein expression. Unlike Vmax, the affinity parameter is independent of enzyme concentration so \({\widetilde{K}}_{{\rm{C}}}\) predictions are expected to be more accurate. V266T and A102Y both exhibit roughly proportional reductions in catalytic rate (Fig. 4c, Extended Data Fig. 9c and Extended Data Table 2). These mutations had no effect on CO2 versus O2 specificity (Extended Data Fig. 9a,c and Extended Data Table 2) indicating that the ‘cost’ of improved affinity is paid for in catalytic rate alone. A102Y had a reduced KM,RuBP, whereas that of V266T did not change from wild type. It is unclear what relationship, if any, there is in the shifts in KC and KM,RuBP. Overall, the kcat and KC measurements place these mutants outside the range heretofore measured among bacterial Form II variants and at the edge of the distribution of plants and algae.

Conclusion

Among the narrow range of sequences measured here, it was possible to identify mutants with substantially improved CO2 affinity, indicating that the enzyme parameter landscape is rugged, with apparent gain-of-function readily accessible. Form I plant rubiscos typically share less than 50% identity with Form II bacterial rubiscos (more than 200 mutations; Extended Data Fig. 8d) and are thought to have evolved under a different set of selective constraints. Furthermore, Form I and II rubiscos have different oligomeric states and Form II rubiscos lack the small subunit characteristic of Form I, so it is surprising that it is possible to traverse the functional space between them with just one amino acid change.

In this study, we were unable to account for two factors of metabolic flux through rubisco: protein expression and side-reactivity with oxygen. Fitness correlation with known kcat values (Fig. 1e and Extended Data Fig. 6a) and our in vitro measurements (Extended Data Fig. 6b) indicate that the data are predictive, even without knowledge of expression. However, mutations such as I164T cause differences in protein expression as a function of IPTG induction (Extended Data Fig. 2f,h) which has an effect on the relative growth rate as compared with wild type (Extended Data Fig. 2g). Indeed, when we examined mutations with fitness values higher than those of wild type, we observed a consistent regression in their kcat rates measured in vitro (Extended Data Fig. 6b, inset). We interpret this trend to indicate that some fraction of mutations have a small or no effect on kcat while modestly improving expression levels. Further work is required to measure and account for this effect16. The side reaction of rubisco is also important, as increasing the oxygen concentration from 10% to 20% causes Δrpi to decline in growth rate and yield (Extended Data Fig. 2e), presumably because of 2-phosphoglycolate production. The effect of oxygen on individual mutants may be determined through an oxygen titration and library selection.

In R.rubrum, the present-day sequence evolved under constraints that include endogenous regulation, environmental selective pressure and possible trade-offs between enzymatic parameters. Various trade-offs have been proposed in the catalytic mechanism of rubisco10,12, including one between catalytic rate and CO2 affinity11. The reductions in kcat (but not SC/O) observed in the mutants with the highest CO2 affinity is consistent with such a trade-off (Fig. 4c). A selection of a library of higher order mutants that spans a wider range of rubisco functional possibilities could confirm or reject a trade-off. The trade-offs in bacterial rubiscos may also constrain the evolution of plant rubiscos. However, previous work comparing the sequence-to-function map of related proteins found substantial context dependence on the effects of mutations18. Due to advancements in expressing plant rubiscos in E.coli31, it may be possible to use this assay to understand the biochemical constraints of the organisms responsible for nearly all of terrestrial photosynthesis1.

The overall space of rubiscos remains largely unexplored, raising the question of whether natural evolution has already produced rubiscos optimized for every environment. Δrpi may permit a higher throughput exploration of sequence space to find regions that are constrained by different trade-offs and produce substantial engineering improvements.

Methods

Strains

We cloned in a combination of E.coli TOP10 cells, DH5α and NEB Turbo cells. Protein expression was carried out using BL21(DE3). Δrpi was produced previously7 from the BW25113 strain. An rpiB knockout was obtained from the Keio collection. rpiA and the edd gene were knocked out through P1 transduction and subsequent curing of the kanamycin marker with pCP20 (ref. 32). The result of these three knockouts, ΔrpiABΔedd, was Δrpi. The EDD deletion makes the strain rubisco dependent when grown on gluconate—a feature we did not make use of in this study.

Plasmids

Sequences and further details about plasmids used in this study can be found in Supplementary Data 3.

pUC19_rbcL

The rubisco mutant library was assembled in a standard pUC19 vector. This plasmid was used as a PCR template for each of the 11 sub-library ligation destination sites.

NP-11-64-1

Selections were conducted using a plasmid designed for this study with a p15 origin, chloramphenicol resistance, LacI controlling rubisco expression, TetR controlling PRK expression and a barcode.

NP-11-63

Protein overexpression in BL21(DE3) cells was conducted using pET28 with a SUMO domain upstream of the expressed gene6. pSF1389 is the plasmid that expresses the necessary SUMOlase, bdSENP1, from Brachypodium distachyon.

Primers

All primers were purchased from IDT and the oligonucleotide pool was purchased from Twist Bioscience. For sequences, see Supplementary Data 3.

Library design and construction

The R.rubrum rubisco sequence was codon-optimized for E.coli and mutated systematically by means of the scheme outlined in Extended Data Fig. 3. The rubisco gene was split into 11 pieces. For each of those pieces (around 200 base pairs (bp) each) all point mutants were designed and synthesized as oligonucleotide pools. Eleven oligo sub-library pools, containing all single mutants within their respective region of around 200 bp, were purchased from Twist Bioscience and each sub-library was amplified individually using Kapa Hifi polymerase with a cycle number of 15. Each rubisco gene fragment was inserted into a corresponding linearized pUC19 destination vector, containing the remainder of the rubisco sequence flanking the insert, through golden gate assembly. This assembly generated 11 sub-libraries of the full-length R.rubrum rubisco gene, with each sub-library containing a region of approximately 200 bp including all single mutants. Each of these 11 rubisco libraries were transformed separately into E.coli TOP10 cells and, in each case, more than 10,000 transformants were scraped from agar plates to ensure oversampling of the roughly 1,000 variants in each sub-library. Plasmids were purified from each sub-library and mixed together at equal molar ratios to generate the full protein sequence library.

To produce the final library for assay, a selection plasmid containing an induction system for rubisco and PRK (Tac- and Tet-inducible, respectively) was amplified with primers that included a random 30 nucleotide barcode. The linearized plasmid amplicon and the library were cut with BsaI and BsmBI, respectively, ligated together and transformed into TOP10 cells. Plasmid was purified by scraping around 500,000 colonies and transformed in triplicate into Δrpi cells. These transformations were grown in 2× yeast extract tryptone medium to log phase (optical density (OD) = 0.6) and frozen as 25% glycerol stocks.

Bacterial growth analysis

Bacterial strains were grown overnight in 2× yeast extract tryptone medium to saturation and then backdiluted. Once cultures reached exponential growth (0.3 < OD600 < 0.8) they were diluted into 150 μl of M9 media in 96-well plates with 25 μg ml−1 chloramphenicol and the indicated concentrations of anhydrotetracycline and IPTG to a final OD600 of 0.005 or 0.0005. Growth was monitored in a Spark plate reader (Tecan) while maintaining 37 °C and the indicated O2 and CO2 concentrations. Shaking consisted of alternating 5 min of orbital and 5 min of double orbital modes and measurements were collected every 10 min. Growth yields were calculated up to 40 h and growth rates were calculated as the growth rate between OD600 values of 0.001 and 0.01 (the most consistently exponential range in our curves).

Long-read sequencing analysis

The plasmid library was cut with SacII and sent for Sequel II PacBio sequencing. Reads were aligned and grouped by their barcodes. All reads of a given barcode were aligned and a consensus sequence was obtained using SAMtools33. Consensus sequences were retained if they were WT or had one mutation that matched the designed library. Any mutation in the backbone invalidated a barcode. A lookup table was generated to link each barcode to its associated mutation. The data analysis methods described in this study are publicly available at https://github.com/SavageLab/rubiscodms.

Library characterization and screening

Selections were performed by diluting 200 μl of glycerol stock with an OD of around 0.25 into 5 ml of M9 minimal medium with added chloramphenicol (25 μg ml−1), glycerol (0.4%), 20 μM IPTG and 20 nM anhydrotetracycline. These cultures were grown in 11 ml culture tubes at 37 °C in a Percival AR-22 growth chamber at different CO2 concentrations on a New Brunswick Scientific Innova 2000 shaker at 250 rpm at an angle of 60°. Cultures were grown until they reached an OD at 5 ml of 1.2 ± 0.2. This corresponds to a 100-fold expansion of the cells, that is, between six and seven doublings.

Cultures before and after selection were spun down and we lysed the cells and performed a standard plasmid extraction protocol using QIAprep Spin Miniprep Kit (Qiagen). Illumina amplicons were generated by PCR of the barcode region. These amplicons were sequenced using a NextSeq P3 kit.

Calculation of variant enrichment

Variant enrichments were computed from the log ratio of barcode read counts. The enrichment calculations include two processing parameters: a minimum count threshold (cmin) and a pseudo-count constant (αp). The count threshold is the minimum number of barcode reads that must be observed either pre- or post-selection for the barcode to be included in the enrichment calculation. The pseudo-count constant is used to add a small positive value to each barcode count to circumvent division by zero errors. We use a pseudo-count value that is weighted by the total number of reads in each condition. For the jth variant and the individual barcodes, i, passing the threshold condition the variant enrichment is calculated as,

$${e}_{j}={\rm{median}}\left({\log }_{10}\left(\frac{{N}_{f,i}+{N}_{f,{\rm{tot}}}{\alpha }_{{\rm{p}}}}{{N}_{0,i}+{N}_{0,{\rm{tot}}}{\alpha }_{{\rm{p}}}}\right)-{\log }_{10}\left(\frac{{N}_{f,{\rm{tot}}}}{{N}_{0,{\rm{tot}}}}\right)\right)$$
(1)

To identify optimal values for these parameters, we computed the variant enrichments across a two-dimensional parameter sweep of cmin and αp to find the combination that resulted in the maximum mean Pearson correlation coefficient across all replicates at each condition. These were cmin = 5 and αp = 3.65 × 10−7 (average of 0.3 pseudo-counts after multiplying by the total number of reads in each experiment, N0,tot or Nf,tot), leading to a correlation coefficient of 0.978. Variant enrichment, ej, was then calculated for every mutant using equation (1).

The variant enrichments were then normalized such that wild type has an enrichment value of 1 in all conditions and catalytically dead mutants have a median enrichment of 0. For the ‘dead’ variant enrichment, we computed the median enrichment for all mutations at the catalytic positions K191, K166, K329, D193, E194 and H287. The normalized enrichments at each condition were computed as

$${e}_{j,{\rm{norm}}}=\frac{{e}_{j}-{\widetilde{e}}_{{\rm{dead}}}}{{e}_{{\rm{WT}}}-{\widetilde{e}}_{{\rm{dead}}}}$$
(2)

where ej is the enrichment of the jth variant as given in equation (1), eWT is the wild-type enrichment and \({\widetilde{e}}_{{\rm{dead}}}\) is the median enrichment across all mutants of the catalytic residues listed above.

Michaelis–Menten fits to enrichment data

The DMS library enrichments across different CO2 concentrations were used to estimate Michaelis–Menten kinetic parameters for every variant. Guided by the linear relationship between growth rate and kcat observed in Fig. 1e, we assume that the cell growth rate is proportional to the rubisco enzyme velocity to derive the CO2 titration fits (see ‘Derivation of Michaelis–Menten fit’, equation (S1))

$${e}_{{\rm{m}}{\rm{u}}{\rm{t}},{\rm{n}}{\rm{o}}{\rm{r}}{\rm{m}}}([{{\rm{C}}{\rm{O}}}_{2}])=\frac{{\mathop{V}\limits^{ \sim }}_{\text{max},{\rm{m}}{\rm{u}}{\rm{t}}}({\mathop{K}\limits^{ \sim }}_{{\rm{C}},{\rm{W}}{\rm{T}}}+[{{\rm{C}}{\rm{O}}}_{2}])}{{\mathop{V}\limits^{ \sim }}_{\text{max},{\rm{W}}{\rm{T}}}({\mathop{K}\limits^{ \sim }}_{{\rm{C}},{\rm{m}}{\rm{u}}{\rm{t}}}+[{{\rm{C}}{\rm{O}}}_{2}])}$$
(3)

\({\widetilde{V}}_{\text{max},{\rm{mut}}}/{\widetilde{V}}_{\text{max},{\rm{WT}}}\) is the ratio of mutant maximum velocity relative to wild type, \({\widetilde{K}}_{{\rm{C}},{\rm{WT}}}\) is the wild-type KC for which we used the value 149 μM, and \({\widetilde{K}}_{{\rm{C}},{\rm{mut}}}\) is the mutant KC. The titration curves in triplicate for each variant were fit to equation (3) using non-linear least squares curve fitting while requiring both Vmax and KC,mut to be positive.

We noted that the \({\widetilde{K}}_{{\rm{C}}}\) fits to certain variants—particularly ones with low \({\widetilde{V}}_{\max }\)—were sensitive to the choice of processing parameters cmin and αp. Given the semi-arbitrary nature of these parameters, this is clearly an undesirable dependence and engenders low confidence in the inferred \({\widetilde{K}}_{{\rm{C}}}\) values. To account for this uncertainty we conducted a parameter sweep (with 11 different cmin values linearly spaced between 0 and 50, and 10 αp values log spaced between 1 × 10−9 and 1 × 10−6), and computed the variant enrichments for all combinations of these parameters. We then performed ten subsamplings of the replicates for all parameter sets and performed the ratiometric Michaelis–Menten fit. From this set of 1,100 \({\widetilde{K}}_{{\rm{C}}}\) fit values for each variant we computed a quartile-based coefficient of variation that was used as a figure of merit for the \({\widetilde{K}}_{{\rm{C}}}\).

Multiple sequence alignment

A multiple sequence alignment (MSA) of the broader rubisco family beyond Form II rubiscos was created using the profile HMM homology search tool jackhmmer34. Starting with the R.rubrum rubisco sequence, jackhmmer applied five search iterations with a bit score threshold of 0.5 bits per residue against the UniRef100 database of non-redundant protein sequences35. To compute phylogenetic conservation at each position, for each possible amino acid we computed the fraction of the total sequences that had that amino acid at the corresponding position of the MSA. The phylogenetic conservation is the maximum fraction, where the maximum is taken over all possible amino acids. Thus, if a position has an alanine in 90% of the sequences of the MSA, the phylogenetic conservation will be 0.9.

Protein purification

E.coli BL21(DE3) cells were transformed with pET28 (encoding the desired rubisco with a 14× His and SUMO affinity tag) and pGro plasmids (Takara). Colonies were grown at 37 °C in 100 ml of 2× yeast extract tryptone medium under kanamycin selection (50 μg ml−1) to an OD of 0.3–1. Arabinose (1 mM) was added to each culture, which was then incubated at 16 °C for 30 min. Protein expression was induced with IPTG (Millipore) at 100 μM and cells were grown overnight at 18 °C. Cultures were spun down (15 min; 4,000g; 4 °C) and purified as reported6. In brief, cultures were spun down and lysed using BPER II (Thermo Fisher). Lysates were centrifuged to remove insoluble fraction. Rubisco was purified by His-tag purification using Ni-NTA resin (Thermo Fisher) and eluted by SUMO tag cleavage with bdSUMO protease (as produced in ref. 6). Purified proteins were concentrated and stored at 4 °C until kinetic measurement (within 24 h). Samples were resolved by SDS–polyacrylamide gel electrophoresis to ensure purity.

Rubisco spectrophotometric assay

Both kcat and KC measurements use the same coupled-enzyme mixture wherein the phosphorylation and subsequent reduction of 1,3-bisphosphoglycerate—the product of RuBP carboxylation—was coupled to NADH oxidation, which can be followed through 340-nm absorbance. Following Kubien et al.36 and Davidi et al.6, the reaction mixture (Extended Data Table 1) contains buffer at 100 mM, pH 8, 20 mM MgCl2, 0.5 mM dithiothreitol, 2 mM ATP, 10 mM creatine phosphate, 1.7 mM NADH, 1 mM EDTA and 20 U ml−1 each of phosphoglycerate kinase, glyceraldehyde-3-phosphate dehydrogenase and creatine phosphokinase. Reaction volumes are 150 μl and samples are shaken once before absorbance measurements begin. Absorbance measurements are collected on a SPARK plate reader with O2 and CO2 control (Tecan). The extinction coefficient of NADH in the plate reader was determined through a standard curve of NADH solutions of known concentration (determined by a Genesys 20 spectrophotometer with a standard 1-cm path length, Thermo Fisher). Absorbance decline over time gives a rate of NADH oxidation and therefore a carboxylation rate. Because rubisco produces two molecules of 3-phosphoglycerate for every carboxylation reaction, we assume a 2:1 ratio of NADH oxidation rate to carboxylation rate.

Spectrophotometric measurements of k cat

The carboxylation rate constant (kcat) of each rubisco was measured using methods established previously6. In brief, rubisco was activated by incubation for 15 min at room temperature with CO2 (4%) and O2 (0.4%) and added (final concentration of 80 nM) to aliquots of appropriately diluted assay mix (Extended Data Table 1) containing different 2-carboxy-d-arabinitol-1,5-bisphosphate (CABP) concentrations pre-equilibrated in a plate reader (Infinite 200 PRO; TECAN) at 30 °C, under the same gas concentrations. After 15 min, RuBP (final concentration of 1 mM) was added to the reaction mix and the absorbance at 340 nm was measured to quantify the carboxylation rates. A linear regression model was used to plot reaction rates as a function of CABP concentration. The kcat was calculated by dividing the y intercept (reaction rates) by the x intercept (concentration of active sites). Protein was purified in triplicate for kcat determination.

Spectrophotometric measurements of K C

Purified rubisco mutants were activated (40 mM bicarbonate and 20 mM MgCl2) and added to a 96-well plate along with assay mix (Extended Data Table 1, in this case the same concentration of HEPES pH 8 buffer was used but EPPS can be substituted). Bicarbonate was added for a range of concentrations (1.5, 2.5, 4.2, 7, 11.6, 19.4, 32.4, 54, 90 and 150 mM). Plates and RuBP were pre-equilibrated at 0.3% O2 and 0% CO2 at room temperature. RuBP was added to a final concentration of 1.25 mM with water serving as a control for each replicate. NADH oxidation was measured by A340 as in the kcat assay. Absorbance curves were analysed using a custom script to perform a hyper-parameter search to choose a square in which to take the slope as carboxylation rate that best represented most of the monotonic decrease in A340. KC was derived by fitting the Michaelis–Menten curve using a non-linear least squares method. Error bars were determined depending on replicates: (1) multi day replicates: Michaelis–Menten fits were made for each replicate, s.e. and median were calculated on the basis of these fits. (2) Triplicates: absorbance data were fit to extract initial rates using different hyper-parameters and the median of these fits was used subsequently. Three different sets of initial rates were calculated on the basis of the technical replicates: one based on the median absorbance values, one based on the median minus the s.d., and one based on the median plus the s.d. Michaelis–Menten fits to these three sets of calculated rates were made and error bars show the difference between the low boundary, median and high boundary set.

Spectrophotometric measurements of K M,RuBP

KM,RuBP was determined in a similar manner to kcat and KC. A titration of RuBP concentrations was used to generate rate-saturation curves under an atmosphere of 5% CO2 and 0.5% O2. Simple linear regression was used to fit the absorbance decays. KM,RuBP was derived by fitting the Michaelis–Menten curve using a non-linear least squares method. Error was determined from the square root of the diagonals of the covariance matrix during fitting. The values from spectrophotometric assays are reported in Fig. 3f and Extended Data Figs. 6b,d and 9b.

Radiometric measurements of K C and k cat

14CO2 fixation assays were conducted as in ref. 6 with minor modifications. Assay buffer (100 mM EPPS-NaOH pH 8, 20 mM MgCl2, 1 mM EDTA) was sparged with N2 gas. Rubisco, purified as described above, was diluted to around 10 μM (quantified using ultraviolet absorbance) in the assay buffer. It was then diluted with one volume of assay buffer containing 40 mM NaH14CO3 to activate. Reactions (0.5 ml) were conducted at 25 °C in 7.7-ml septum-capped glass scintillation vials (Perkin-Elmer) with 100 μg ml−1 carbonic anhydrase, 1 mM RuBP and NaH14CO3 concentrations ranging from 0.4 to 17 mM (which corresponds to 15–215 μM CO2). The assay was initiated by the addition of a 20-μl aliquot of activated rubisco and stopped after 2 min by the addition of 200 μl 50% (v/v) formic acid.

The specific activity of 14CO2 was measured by performing a 1-h assay at the highest 14CO2 concentration containing 10 nmol of RuBP. Reactions were dried on a heat block, resuspended in 1 ml water and mixed with 3 ml Ultima Gold XR scintillant for quantification with a Hidex scintillation counter.

The rubisco active site concentration used in each assay was quantified in duplicate by a [14C]-2-CABP binding assay. A 10-μl sample of the roughly 10 μM rubisco solution was activated in assay buffer containing 40 mM cold NaHCO3 (final volume 100 μl) for at least 10 min. Then, 1.5 μl of 1.8 mM 14C-carboxypentitol bisphosphate was added and incubated for at least 1 h at 25 °C. [14C]-2-CABP bound rubisco was separated from free [14C]-2-CPBP by size exclusion chromatography (Sephadex G-50 Fine, GE Healthcare) and quantified by scintillation counting.

The data were fit to the Michaelis–Menten equation using the concatenated data of three to four experiments performed on different days. This assay was used to determine the values in Fig. 4c and Extended Data Fig. 9c.

Membrane inlet mass spectrometry determination of rubisco specificity

The method described in ref. 37 was adapted for a membrane inlet mass spectrometry (MIMS) instrument (Bay Instruments). The O2 ion signal was calibrated by measuring the 32 m/z ion at atmospheric O2 and at ‘zero’ O2. An atmospheric O2 calibrant was achieved by equilibrating the MIMS buffer (200 mM Hepes pH 8, 100 mM NaCl, 20 mM MgCl2) with air for 1 h at 25 °C. The ‘zero’ O2 ion signal was determined by then adding approximately 5 mg Na2S2O6 to the cuvette. CO2 was calibrated by adding various amounts of NaHCO3 to a solution of 100 mM HCl and recording the 44 m/z ion signal. In both cases, linear fits of ion counts to gas concentrations provided a simple conversion to determine gas concentrations and consumption rates. These calibrations had to be performed on every day in which the assay was used.

Rubisco enzymes were activated in 20 mM Hepes pH 8, 100 mM NaCl, 20 mM MgCl2 and 20 mM NaHCO3. Activated enzyme was added to 630 μl of MIMS buffer equilibrated with air at a concentration of 1.2 μM. Bovine carbonic anhydrase (Sigma Aldrich) was added at a final concentration of 0.3 mg ml−1 and NaHCO3 was added to a final concentration of 4 mM. The reaction was stirred in the sealed MIMS reaction chamber for approximately 2 min to collect a pre-reaction signal. The reaction was initiated by the addition of 2 mM RuBP. O2 and CO2 consumption rates were background corrected and converted to reaction velocities through conversion using the coefficients determined during calibration. Specificities were determined in triplicate by the following equation: SC/O = νC[O2]/νO[DIC], where DIC is the dissolved inorganic carbon pool.

Quantification of soluble enzyme concentration by immunoblotting

The Δrpi strain with wild-type rubisco was grown under selective conditions (overnight at 37 °C in M9 medium with 0.4% glycerol and 20 nM aTc) with varying IPTG concentrations at 5% CO2 for 24 h. Afterwards, turbid cultures were centrifuged (10 min; 4,000g; 4 °C) culminating in roughly 20 mg pellet per sample. Pellets were lysed with 200 μl of BPER II and supernatant was transferred into a fresh tube and mixed with SDS loading dye. A Bio-Rad RTA Transfer Kit for Trans-Blot Turbo Low Fluorescence PVDF was used in combination with the Trans-Blot Turbo Transfer System. The PVDF membrane was carefully cut between 50 and 70 kDa post-blocking using a razor blade. Primary anti-RbcL II Rubisco large subunit Form II Antibody from Agrisera (1:10,000) and DnaK Antibody from Abcam (1:5,000) were incubated separately. Secondary horseradish-peroxidase-conjugated antibodies Donkey anti-mouse for DnaK (Santa Cruz Biotechnology) and Goat pAB to RB IgG horseradish peroxidase (Abcam were both used at 1:10,000). Subsequently, Bio-Rad Clarity Max Western ECL substrates were applied and the final results were imaged using a GelDoc (Bio-Rad).

Mutant fitness outlier

I190T (Fig. 1e) was the only outlier in our comparison of in vitro kcat measurements from the literature and our fitness data. Because the value was reported without error estimates38, we re-measured the kcat of this mutant and found it to be 4.24 s−1, which is 52% of the wild-type value, down from 80% previously reported. Still, the value seems to be anomalous compared with the rest of the trend (Extended Data Fig. 6b). One potential explanation is that the mutation at that position has a strong negative effect on protein expression. Another possibility, given that I190T is adjacent to the key active site lysine, K191, is that I190T causes a negative effect on lysine carbamylation that is, for some reason, more pronounced in vivo than in vitro.

Derivation of Michaelis–Menten fit

Following Stiffler et al.20 we assume that the differences in bacterial growth rate are proportional to the differences in growth-limiting enzymatic activity.

$${\mu }_{{\rm{mut}}}-{\mu }_{{\rm{WT}}}\propto {v}_{{\rm{mut}}}^{ru}-{v}_{{\rm{WT}}}^{ru}$$
(S1)

Under the presumption of log-phase growth, the expected log ratio of reads after elapsed time t and normalized to the wild-type reads is given by

$${e}_{{\rm{mut}}}={\log }_{10}\left(\frac{{N}_{{\rm{mut}},f}}{{N}_{{\rm{mut}},0}}\right)-{\log }_{10}\left(\frac{{N}_{{\rm{WT}},f}}{{N}_{{\rm{WT,}}0}}\right)$$
(S2)

(Note that equation (S2) would also contain a normalization factor to account for the total number of reads obtained for the pre- and post-selection conditions. It is, however, a common factor for both the mutant and wild-type counts and therefore cancels out. Furthermore, the real analysis also includes pseudo-counts, which are omitted here in the derivation of the fit equation for simplicity. Substituting in the condition of exponential growth, that is, \({N}_{i,f}={N}_{i,0}\,{e}^{{\mu }_{i}{t}}\), and simplifying yields,

$${e}_{{\rm{mut}}}=\frac{t}{{\rm{ln}}10}({\mu }_{{\rm{mut}}}-{\mu }_{{\rm{WT}}})$$
(S3)

To normalize the enrichments, we divide by the log enrichment of the wild-type counts relative to the median enrichment of variants with mutated catalytic residues (and thus catalytically dead rubisco). We then add one for the convention that dead variants be centred at an enrichment of 0 and that wild-type be at an enrichment of 1. Thus, the normalized mutant enrichment is,

$${e}_{{\rm{mut}},{\rm{norm}}}=\frac{{\log }_{10}\left(\frac{{N}_{{\rm{mut}},f}}{{N}_{{\rm{mut}},0}}\right)-{\log }_{10}\left(\frac{{N}_{{\rm{WT}},f}}{{N}_{{\rm{WT}},0}}\right)}{{\log }_{10}\left(\frac{{N}_{{\rm{WT}},f}}{{N}_{{\rm{WT}},0}}\right)-\left\langle {\log }_{10}\left(\frac{{N}_{{\rm{dead}},f}}{{N}_{{\rm{dead}},0}}\right)\right\rangle }+1$$
(S4)

Then substituting equation (S3) we obtain,

$${e}_{{\rm{mut,norm}}}=\frac{{\mu }_{{\rm{mut}}}-{\mu }_{{\rm{WT}}}}{{\mu }_{{\rm{WT}}}-\underline{{\mu }_{{\rm{dead}}}}}+1$$
(S5)

Using the assumption in equation (S1) and the fact that the enzyme velocity of dead mutants is 0, we obtain the expected normalized enrichment as a function of the rubisco velocities,

$${e}_{{\rm{mut}},{\rm{norm}}}=\frac{{v}_{{\rm{mut}}}}{{v}_{{\rm{WT}}}}$$
(S6)

Finally, using the Michaelis–Menten equation we obtain the predicted enrichments as a function of CO2 concentration and the enzyme kinetic parameters.

$${e}_{{\rm{mut}},{\rm{norm}}}([{{\rm{CO}}}_{2}])=\frac{{V}_{\max ,{\rm{mut}}}({K}_{{\rm{M}},{\rm{WT}}}+[{{\rm{CO}}}_{2}])}{{V}_{\max ,{\rm{WT}}}({K}_{{\rm{M}},{\rm{mut}}}+[{{\rm{CO}}}_{2}])}$$
(S7)

Thus, in practice, we use equation (S7) as the fit equation to the normalized enrichment values for each variant across a range of CO2 concentrations. For each we have, as fit parameters, the ratio of maximum velocities between the mutant and wild type, Vmax,mut/Vmax,wt, and the mutant KC with the wild-type KC set to the literature value of 149 μM.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.