Main

Genome editing technologies, including those derived from prokaryotic CRISPR–Cas systems, have revolutionized life science research and are poised to transform medicine and agriculture. Single-protein CRISPR–Cas effectors, including the widely adopted Cas9 nuclease from Streptococcus pyogenes (SpCas9), have been used in biotechnology owing to their simplicity, robustness and compact form. To diversify the CRISPR toolbox and expand editing capabilities, new systems have been mined across diverse microbial and viral genomes. Although these systems have been sought for specific properties, such as small size or extended protein stability in biofluids3,4, they typically exhibit tradeoffs in critical attributes such as basal activity in target cells, protospacer-adjacent motif (PAM) selectivity, thermal optima or in vitro biochemical properties, ultimately limiting their reach4,5,6,7.

Repurposed CRISPR systems have been optimized for biotechnology using a range of protein engineering approaches, including directed evolution and structure-guided mutagenesis. Directed evolution of CRISPR–Cas proteins has proven extremely powerful yet can be limited by the rugged and non-convex nature of the fitness landscapes8,9,10,11, along with the difficulty of implementing selection-based screening in human cells. Structure-guided rational mutagenesis offers an alternative or synergistic approach that has proven successful for improving Cas9 basal activity and specificity in human cells12,13,14,15. Similar results may be achievable with structure-conditioned protein sequence design models16,17. However, both of these approaches depend on explicit structural hypotheses, either in the form of mechanistic understanding for rational mutagenesis or structures representing key functional states for computational design, which are difficult to obtain for functions more complex than simple binding interactions.

Protein language models eschew explicit structural hypotheses and instead learn the co-evolutionary blueprint underlying protein function18. When pretrained on large sets of diverse protein sequences, language models (LMs) learn to represent structure and function without supervision2,19. Subsequent fine-tuning of these models yields family-specific specialists that generate new proteins adhering to the functional constraints of their family yet diverging substantially in sequence space. This approach has been validated through the design of functional lysozymes20 and demonstrated in silico across several families21. Related work has also shown that co-evolutionary models22,23,24 of individual protein families can be used to design new, highly active sequences23. Compared to protein LMs, these family-specific models are less computationally expensive to train but are typically restricted to pairwise coupling terms (limiting their expressiveness) and do not leverage sequences from other protein families. Despite the considerable utility of such sequence-based approaches, it remains to be seen how well either strategy performs for protein families with several complex functions, such as CRISPR–Cas effectors.

In this work, we demonstrate that LMs can effectively generate diverse CRISPR–Cas proteins spanning a broad set of families. Moreover, we demonstrate that generated type II effector proteins can assemble as functional gene editors in human cells, despite being hundreds of mutations away from any known natural protein. We perform extensive characterization of one exemplar editor, which we denote OpenCRISPR-1, and show that it is highly functional and specific.

LMs generate diverse CRISPR–Cas proteins

Generative protein LMs are typically pretrained on large datasets of natural protein sequences spanning diverse phylogenies and functions2. These models are capable of generating realistic protein sequences that reflect the distribution and properties of natural proteins25. However, for specific applications, such as the generation of gene editors, it is necessary to steer generation towards particular subsets of protein families of interest (Fig. 1a).

Fig. 1: Generation of diverse Cas protein families.
Fig. 1: Generation of diverse Cas protein families.
Full size image

a, Overview of the language-modelling approach to design CRISPR–Cas systems. LMs learn the general constraints of protein evolution through pretraining on diverse proteins spanning the evolutionary tree and then are specialized for design by fine-tuning on Cas protein and nucleic acid data. b, Expansion of the sequence diversity for 45 Cas protein families, measured by the number of clusters (at 70% sequence identity (70%ID)) for natural proteins and clusters from generated sequences. Stacked bars are coloured by the source of the sequences making up their clusters (CRISPR–Cas Atlas, recovered from CRISPR–Cas mining; generated Cas, 4 million generated proteins from this study). Heatmap indicates the natural distribution of each protein family across different types of CRISPR–Cas systems. c, AlphaFold2 was used to predict structures for 2,000 randomly selected generated proteins. The scatterplot shows the distribution of mean pLDDT and the %ID to natural proteins from the CRISPR–Cas Atlas.

To this end, we performed exhaustive data mining to construct a dataset of curated CRISPR operons, including Cas proteins, CRISPR arrays, trans-activating CRISPR RNAs (tracrRNAs) and PAMs (Extended Data Fig. 1). We refer to this resource as the CRISPR–Cas Atlas. Using a custom pipeline, we searched 26.2 terabases of assembled microbial genomes and metagenomes, spanning diverse phyla and biomes, to uncover 1,246,088 CRISPR–Cas operons, including more than 389,000 single-effector systems classified as type II, type V or type VI. Our resource displayed expanded natural diversity compared to curated databases such as CRISPRCasDB and CasPDB (Extended Data Fig. 1e), as well as UniProt, the world’s largest protein resource. Across all Cas families, the CRISPR–Cas Atlas has on average 2.7× more protein clusters than UniProt using a 70% sequence identity (%ID) clustering threshold and even greater expansions for families such as Cas9 (4.1×), Cas12a (6.7×) and Cas13 (7.1×).

To generate new CRISPR–Cas proteins, we fine-tuned the ProGen2-base LM2 on the CRISPR–Cas Atlas, balancing for protein family representation and sequence cluster size (Fig. 1a). From this model, we generated 4 million sequences; half were generated directly from the model (unconditional), whereas the other half were prompted with up to 50 residues from the N or C terminus of a natural protein to guide generation towards a particular family (conditional). After strict filtering and sequence clustering (Supplementary Fig. 1), we found that the generated sequences represented a 4.8-fold expansion of diversity compared to natural proteins from the CRISPR–Cas Atlas (Fig. 1b). For families with few natural proteins, such as Cas13 and Cas12a, generated sequences represent an 8.4- and 6.2-fold increase in diversity, respectively. Among the sequences guided towards a particular family, we typically observed near-perfect adherence to the family of interest with 50 or fewer residues provided, indicating that generation is steerable with minimal context. Given the conservative nature of the filters applied to the generated sequences, we expect that the reported increases in diversity represent a lower bound; however, as the generated sequences diverge further from natural examples, detecting realistic proteins becomes more difficult.

We next evaluated the rate of new cluster generation by the model with respect to the number of sequences sampled. For each family, we counted the number of distinct clusters in subsets ranging from one sequence to the full set (Supplementary Fig. 2). In general, the rate of new cluster generation for generated sequences significantly outpaced that of natural sequences. Among the generated sequences for each family that survived filtering, the identity to the nearest natural protein was typically between 40% and 60% (Fig. 1c). To investigate the novelty of these generated sequences, we calculated the cumulative sequence identity explained by increasing numbers of natural proteins (Extended Data Fig. 2). Compared to natural CRISPR–Cas proteins, the generated sequences displayed similar levels of chimerism, indicating that LMs produce sequences with novelty akin to that of evolution. Despite considerable deviation in sequence space, the generated proteins were confidently predicted by AlphaFold2, with 81.65% of structures having a mean predicted local-distance difference test (pLDDT) above 80—although AlphaFold2 is known to fold non-functional proteins with high confidence as well26. For a small number of sequences, we observed high similarity to natural proteins in the CRISPR–Cas Atlas but low structure prediction confidence, owing to limited homology in the ColabFold sequence database used for predictions. (Fig. 1c). Finally, we investigated the structural composition of the generated sequences and found that many were predicted to adopt folds highly similar to natural proteins from the same family (Extended Data Fig. 3), indicating they may be functional.

LMs generate diverse type II effectors

Although many CRISPR–Cas proteins have been leveraged for genome editing27,28, Cas9 remains the most widely used. To generate new Cas9-like sequences, we prompted the CRISPR–Cas model with 50 residues from the N or C terminus of Cas9s sampled from the CRISPR–Cas Atlas. However, only 27.6% of these prompted generations survived our strict sequence viability filters. To more efficiently and accurately generate viable Cas9-like sequences, we fine-tuned another LM using only the 238,913 Cas9 sequences from the CRISPR–Cas Atlas (Fig. 1a and Extended Data Fig. 1). This model produced viable Cas9-like sequences at twice the rate of the CRISPR–Cas model (54.2%; Supplementary Fig. 3a) and did not require any prompting.

To explore the latent sequence distribution of type II effectors, we used the Cas9 model to generate 1 million Cas9 proteins. The resulting viable generations (n = 542,042) were clustered together with natural Cas9s at 40%ID and used as input to construct a maximum-likelihood phylogenetic tree (Fig. 2a). The resulting landscape was dominated by generated proteins, which made up 94.1% of the total phylogenetic diversity (as measured by cumulative branch length) and resulted in a 10.3-fold increase in diversity relative to the entire CRISPR–Cas Atlas (Fig. 2b). New phylogenetic groups were distributed across the tree, indicating that the model has captured the known natural diversity of Cas9 and is not overfitting to any particular lineage. Generated sequences diverged from the CRISPR–Cas Atlas, with an average identity of only 56.8% to any natural sequence (Fig. 2c). Further, we found that the generated proteins displayed cumulative identity trends similar to natural Cas9s (Extended Data Fig. 4), indicating that the chimeric novelty produced by the LM was similar in form to what would be expected from discovery of new natural Cas9s. Although the number of generated proteins is large, the number of new clusters does not seem to have saturated (Supplementary Fig. 3b), indicating that many more Cas9-like proteins could be generated. Next, we analysed the phylogenetic distribution of Cas9 orthologues that have been biochemically characterized7 or used as genome editors29. We observed considerable diversity of generated proteins in the vicinity of these characterized orthologues, indicating that the model is capable of generating proteins with a variety of functional properties.

Fig. 2: LMs generate complete type II effector systems.
Fig. 2: LMs generate complete type II effector systems.
Full size image

a, Phylogenetic tree of natural and generated proteins clustered at 40%ID (n = 15,340 cluster representatives). Biochemically characterized Cas9s from ref. 7 are labelled, and Cas9 proteins used as genome editors are shown in bold29. Lineages are coloured black if they contain any natural protein or green if they are exclusively represented by generated proteins. b, Pie chart indicates the percent of phylogenetic diversity represented by natural or generated proteins. Phylogenetic diversity was calculated as the cumulative branch length of subtrees represented by a given set of sequences. c, Distribution of the identity of generated Cas9 to the nearest protein in the CRISPR–Cas Atlas. d, Comparison of protein length between natural and generated proteins in the same 50%ID clusters. e, Fraction of generated and natural Cas9 proteins containing key functional domains according to structural searches with Foldseek against SCOPe families. In total, 79.2% and 48.2% of natural and generated proteins were functionally complete, respectively. f, Predicted structure for new Cas9-like protein selected from a 30%ID cluster with 423 members composed entirely of generated sequences. Despite high sequence novelty (39.2%ID to CRISPR–Cas Atlas), the predicted structure bears structural resemblance to Nme1Cas9 (Protein Data Bank ID 6JE9, template modelling score (TM-score) = 0.72). g, Naturally occurring and generated crRNAs and tracrRNAs were obtained for a set of ten effector proteins. h, sgRNAs were formed from RNA components and embedded into a two-dimensional space by t-distributed stochastic neighbour embedding32 according to the pairwise edit distances. Each point represents an sgRNA sequence, with colours corresponding to source protein. Tree scale bar, 1.0.

To further assess the viability of the generated proteins, we compared the sequence lengths of generated and natural sequences (Fig. 2d and Supplementary Fig. 3c). Overall, generated sequences closely matched the length of natural proteins from the same protein cluster, with a Pearson correlation of 0.97 (Fig. 2d). To assess the structural viability of generated Cas9-like proteins, we used AlphaFold2 (ref. 30) to predict the structures of 5,000 generated and 5,000 natural sequences, with one sequence each being selected randomly from the largest 70%ID protein clusters among each group. Most structures were predicted confidently (with 99.4% having mean pLDDT above 80), including for Cas9-like proteins with as low as 60%ID to any natural protein (Extended Data Fig. 5a), and showed significant overlap with experimentally determined structures from the Protein Data Bank (Extended Data Fig. 5b). By aligning the structures against curated families from the SCOPe database31, we confirmed the presence of core Cas9 domains in most generated proteins and at a similar rate as naturals (Fig. 2e). This included the HNH and RuvC nuclease domains (100% and 52.1%, respectively), which are responsible for DNA cleavage, as well as the PAM-interacting domain (PID; 92.9%) and target recognition (REC) lobe (99.9%) (Fig. 2e). Although RuvC is a conserved and essential Cas9 domain, its detection may be underestimated because of difficulty in identifying its short, split subdomains. This structural and functional completeness extended to even the most divergent proteins, including a subset that belonged to 30%ID clusters composed entirely of generated proteins. One such protein had a predicted structure resembling Nme1Cas9 (template modelling score 0.71) despite sharing only 25.8%ID to Nme1Cas9 and a maximum of 39.2%ID to any protein in the CRISPR–Cas Atlas (Fig. 2f).

Type II systems are also dependent on a guide RNA (gRNA) that is required for target recognition and cleavage. The gRNA is composed of a targeting RNA sequence (spacer), CRISPR RNA (crRNA) repeat and tracrRNA. The tracrRNA and crRNA components are typically derived from natural systems, and the spacer sequence is programmed to match the target DNA site for gene editing applications. From the CRISPR–Cas Atlas, we collected 112,212 type II effector proteins for which we could confidently identify, orient and align the corresponding crRNA and tracrRNA sequences. These data were used to train a sequence-to-sequence gRNA model that conditionally generates crRNA and tracrRNA sequences for a given protein (Fig. 1a). As an initial validation of the gRNA model, we designed ten gRNAs for a set of effector proteins previously used as genome editors (Fig. 2g). Each of the designed gRNAs, as well as the natural gRNAs from metagenome mining, were formatted into single-guide RNAs (sgRNAs) and embedded according to their pairwise edit distances with t-distributed stochastic neighbour embedding32. We observed that the model-designed sgRNAs were most similar to the naturally derived sgRNAs for each protein (Fig. 2h). As further validation, we found crRNA:tracrRNA pairs often formed the canonical duplex (Supplementary Fig. 4) and that the model could accurately predict the compatibility of sgRNAs between diverse Cas9 orthologues (Extended Data Fig. 6). Together, these results indicate that the model can be used to generate functional sgRNAs for generated Cas9-like proteins.

Designed editors function in human cells

To generate Cas9-like proteins for experimental characterization, we used a constrained generation strategy wherein we prompted the LM fine-tuned on Cas9 proteins using either the N-terminal segment or the C-terminal PID of SpCas9 (Supplementary Fig. 5). In doing so, we reasoned that the generated sequences would maintain compatibility with both the PAM and sgRNA preferences of SpCas9, facilitating direct comparison of protein activity across the same genomic targets with the same sgRNA. In total, we generated 200,000 and 150,000 Cas9-like proteins prompted by the N-terminal segments and C-terminal PIDs, respectively. From this set, we selected 82 SpCas9 PID-conditioned sequences and 127 fully generated sequences, created by combining our generated N-terminal segments and PID domains, on the basis of an average of pretrained and Cas9-fine-tuned LM log likelihoods and auxiliary predictors of compatibility with SpCas9’s PAM and tracrRNA. In total, we selected 209 Cas9-like proteins for subsequent functional analysis in human cells (Fig. 3a).

Fig. 3: Generated nucleases function as gene editors in human cells.
Fig. 3: Generated nucleases function as gene editors in human cells.
Full size image

a, Phylogenetic tree of natural Cas9 proteins, ancestral reconstructions and generated effector proteins near SpCas9. Annotations surrounding the tree indicate selection criteria used to identify 48 generated proteins for further characterization. b, Editing efficiency (indel rate relative to SpCas9) of 209 generated proteins across three target sites: HEK3 (i), HEK2 (ii) and CD3G_1 (iii). Sequences are ordered according to relative indel rates, with the number of sequences showing activity and surpassing SpCas9 indicated on the x axis. c, Mutational Levenshtein distances from the nearest natural protein in the CRISPR–Cas Atlas and SpCas9 for 131 generated proteins with observed editing activity. The Levenshtein distance is the minimal number of edits between two sequences, including substitutions, insertions and deletions. d, On- and off-target editing efficiency for SpCas9 and 48 generated proteins. Points correspond to on- or off-target editing at five sites (AAVs1, FANCF, HEK2, HEK3, VEGFA; with three off targets per site). Bars reflect the median of all on- and off-target editing. e, On- and off-target editing efficiency for natural Cas9s, high-fidelity variants, chimeric sequences, consensus designs, ancestral reconstructions (rec.), HMM emissions, arDCA designs, LigandMPNN designs and generated proteins from this work. Each point represents the average on- or off-target editing at five sites (with three off targets per site) for a single protein. f, Genome-wide off-target analysis using SITE-Seq, measured at four enzyme concentrations. Points represent the percentage of total cleavage events for each guide that occurred at on-target sites. Bars represent the median across sites. Tree scale bar, 1.0.

We next set out to explore whether our Cas9-like proteins were able to mediate genome editing in human cells. We tested 209 sequences by cotransfecting HEK293T cells with nuclease plasmids and SpCas9 sgRNA plasmids, targeting one of three previously characterized target sites and inferred DNA repair outcomes after three days using a Sanger-sequencing-based method33. Across all three sites, we observed a wide range of editing efficiencies, with a subset of Cas9-like proteins showing activity on par with or higher than SpCas9 (Fig. 3b). In contrast with the edit distance to SpCas9, LM scores were highly predictive of enzyme activity, separating active and inactive enzymes at the HEK3 target site with an area under the receiver operating characteristic curve value of 0.83 (Supplementary Fig. 6). Among the set of active nucleases, we observed significant sequence deviation from SpCas9 and the nearest natural proteins in the CRISPR–Cas Atlas (Fig. 3c).

For further characterization, we next selected 48 Cas9-like proteins that were fully generated (both N- and C-terminal domains), displayed moderate or high insertion and deletion (indel) rates across one or more sites and were substantially different from any natural or engineered enzyme in patent databases (lens.org) (Fig. 3a). This set comprised five distinct clusters at 90%ID, with each protein sharing 77.5–87.1%ID to the nearest natural Cas9. In each cluster, the generated proteins were between one and 40 mutations from any other (median seven mutations). We assayed the editing efficiency and specificity of these 48 proteins across a panel of previously characterized SpCas9 on- and off-target sites (n = 5 and n = 15, respectively)34,35 alongside SpCas9. Nuclease- and sgRNA-expressing plasmids were cotransfected in HEK293T cells, and DNA repair outcomes were measured after three days using next-generation sequencing of amplicons (NGS). We observed both high editing efficiency and specificity with many of our generated nucleases (Fig. 3d), with some even outperforming SpCas9.

To contextualize the performance of our language-modelling strategy for designing Cas9-like proteins, we also tested a set of natural and designed proteins with similar levels of sequence novelty with respect to SpCas9 (Supplementary Fig. 7). Natural sequences (ten total) with between 57%ID and 71%ID to SpCas9 between were identified from the CRISPR–Cas Atlas and UniRef100. For alternative design approaches, we considered sequences from evolutionary methods (14 total from consensus sequence design, hidden Markov model (HMM) emission36 and arDCA24) and structure-based methods (five from LigandMPNN17; Supplementary Fig. 8). Finally, we included nine sequences from the literature, including four high-fidelity SpCas9 variants37 and five SpCas9 ancestral reconstructions38. To facilitate testing at common NGG PAM target sites, we replaced the PID of all comparison sequences with that of SpCas9 except literature-reported variants, which already target NGG PAMs. These 43 sequences were tested alongside five generated proteins at the same set of on- and off-target sites as the broader panel of generated proteins (Fig. 3e and Extended Data Fig. 7). For natural proteins, we observed a range of activity levels, with most being considerably less active than SpCas9. One natural protein, Streptococcus uberis Cas9, displayed high on-target activity and reduced off targets at these sites, reflecting the practical utility of mining natural sequence diversity to uncover highly functional nucleases. Among the evolutionary-based design strategies, we observed a general trend towards lower success rates as methods became more expressive. The most highly active sequences came from consensus design, followed by ancestral reconstruction and chimeric design, whereas sequences from HMM emission and arDCA were largely inactive. Our language-modelling-based strategy stands in contrast to this trend, yielding many highly active proteins despite being less conservative than methods like consensus design. Finally, for the structure-based designs from LigandMPNN, we observed no activity, probably owing to the method’s dependence on static structures and lack of evolutionary constraints on the complex requisite functions of Cas9 proteins.

Our top hit, PF-CAS-182, displayed levels of activity comparable to SpCas9 at on-target sites (median indel rates of 56.4% versus 47.1%) while having a 95% reduction in editing at known SpCas9 off-target sites (median indel rates of 0.32% versus 6.1%). Based on its compelling performance, we nominated PF-CAS-182 as the OpenCRISPR-1 protein. To obtain an unbiased estimate of genome-wide off-target activity, we purified OpenCRISPR-1 and SpCas9 proteins and used these as input to SITE-Seq using the same five gRNA used in Fig. 3e (AAVs1, HEK2, HEK3, FANCF and VEGFA) at four different ribonucleoprotein concentrations. Consistent with our initial cell-based analysis, we observed that a substantially higher proportion of cleavage events occurred at on-target sites for OpenCRISPR-1 compared to SpCas9 across all ribonucleoprotein concentrations and gRNAs tested (Fig. 3f). Importantly, the OpenCRISPR-1 off targets were a subset of the SpyCas9 off targets (Extended Data Fig. 8), strongly indicating that OpenCRISPR-1 does not generate new cleavage patterns. OpenCRISPR-1 did not share any mutations with eight previously engineered high-fidelity Cas9 variants37, indicating that this enzyme achieves a low off-target profile by means of a distinct set of molecular interactions.

OpenCRISPR-1 lacked previously identified immunodominant and subdominant SpCas9 T cell epitopes for HLA-A*02:01, indicating that it may be less immunogenic than SpCas939 (Supplementary Fig. 9). To test this hypothesis, we performed an iELISA to measure the amount of human antibody bound to OpenCRISPR-1 and two other generated proteins (PF-CAS-151 and PF-CAS-189) (Extended Data Fig. 9). Plates were coated with 1 µg ml−1 protein concentrations, and serum samples from 40 healthy donors were diluted from 100-fold to 1,600-fold. iELISA quantification showed lower immunogenicity for all generated Cas9-like proteins compared to SpCas9 at one or more dilution levels. These results indicate that proteins designed with machine learning have the potential to be less immunogenic than pathogen-derived genome editors such as SpCas9.

OpenCRISPR-1 was 1,380 residues in length and considerably diverged from both SpCas9 (403 mutations) and any natural protein in the CRISPR–Cas Atlas (182 mutations). Alignment to NCBI-nr showed top hits to proteins from several Streptococcus spp. (Streptococcus cristatus, S. pyogenes and Streptococcus sanguinis), but none exceeding a sequence identity of 86.3%. Taken together, these three natural Cas9s yielded a cumulative identity of 98.3% for OpenCRISPR-1, in line with what would be expected for a protein with 80–90%ID to the nearest natural (Supplementary Fig. 10). Template-based AlphaFold2 (ref. 30) predictions of the catalytic state40 of OpenCRISPR-1 illustrated that most of the mutations were concentrated at the solvent-exposed surface of the protein, with only a fraction located at the protein–nucleic acid interface (Extended Data Fig. 10a,b). Most critical nucleic-acid-coordinating residues and nuclease-site components were preserved, demonstrating the capability of the model to accurately constrain all necessary catalytic and interaction sites (Supplementary Table 1). In addition to point mutations, OpenCRISPR-1 contained two loop insertions in the REC1 and HNH domains. The function of these inserts remains unknown; however, the nine-residue positively charged insertion in the REC1 domain may interact with the phosphate backbone of both the repeat:anti-repeat segment of the gRNA and the PAM-proximal region of the target DNA (Extended Data Fig. 10c). This insertion is analogous to sequence graftings of positively charged loops between natural Cas9 orthologues to boost activity41. The four-residue insertion in the HNH domain is compatible with all the experimentally elucidated catalytic states40 and may have a role in stabilizing the cleavage checkpoint state (Extended Data Fig. 10d).

To more thoroughly characterize OpenCRISPR-1, we used a cell-based assay with NGS measurement of indel rates to screen 98 previously characterized SpCas9 target sites harbouring either NGG or non-NGG PAMs15,42 (Supplementary Table 3). After quality control, we were able to characterize nuclease performance at 92 of these sites (NGG PAMs, n = 49; non-NGG PAMs n = 43). Our measurements of activity for SpCas9 at these targets were moderately below reported levels15, probably owing to the transfection efficiency observed from lipofection (Supplementary Fig. 11). In agreement with our previous experiment, OpenCRISPR-1 displayed comparable levels of on-target activity across sites bearing the NGG PAM (Fig. 4a,b) and resulted in a similar distribution of DNA repair outcomes (Supplementary Fig. 12). Interestingly, OpenCRISPR-1 exhibited several-fold reduction in activity at genomic sites bearing a mismatch in the PAM (two-sided Wilcoxon rank sum P value = 0.0005) (Fig. 4c). These results indicate that OpenCRISPR-1 has activity comparable to SpCas9 at on-target sites while avoiding double-strand breaks at sites with mismatches in either the PAM or target regions.

Fig. 4: Characterization of OpenCRISPR-1 across PAMs, guides and base editing.
Fig. 4: Characterization of OpenCRISPR-1 across PAMs, guides and base editing.
Full size image

a,b, On-target editing efficiency (indel formation) of OpenCRISPR-1 (OC-1) protein at NGG (n = 49) and non-NGG PAMs (n = 43) (a). OpenCRISPR-1 exhibits comparable activity at targets with an NGG PAM but lower editing at sites lacking an NGG PAM (b). c, Relative activity of SpCas9 to OpenCRISPR-1 across sites with different PAMs (NGG, n = 49; NGC, n = 11; NGT, n = 10; NGA, n = 10; NAG, n = 9; NTG, n = 2; NCG, n = 1). d, Adenine base editors were created by attaching deaminase domains to the N terminus of OpenCRISPR-1 and SpCas9 nickase variants (D10A mutation for both proteins). e, Adenine base editing efficiency (A-to-G) at three target sites: HEK2 (i), T39 (ii), CD3G_1 (iii). ABE8.20 is a highly active deaminase from directed evolution, whereas PF-DEAM-1 and PF-DEAM-2 were generated from LMs. Across all target sites and with distinct deaminases, OpenCRISPR-1 nickase shows compatibility with base editing. f, Editing efficiency at HEK3 target site with designed sgRNAs (green) and SpCas9’s sgRNA (grey). Four of five generated proteins displayed increased editing efficiency with design sgRNAs. g, Change in editing efficiency compared to SpCas9’s sgRNA. The majority of designed sgRNAs yield performance that is not significantly different from SpCas9’s guide, whereas a subset either significantly improves or worsens editing efficiency (t-test P value < 0.05).

Next, we investigated whether OpenCRISPR-1 could be used in a base editing system. Base editors have emerged as powerful systems for modifying single nucleotides in the genome without the complications of generating double-strand breaks. We converted OpenCRISPR-1 to a putative target-strand nickase (containing D10A mutation) and fused it to a previously engineered adenosine deaminase (ABE8.20) commonly used in base editing43 (Fig. 4d). We tested for editing in HEK293T cells using plasmid delivery, with sgRNAs targeting three genomic loci containing adenines in the editing window (that is, position 3–9 in the spacer). We observed robust A-to-G conversion with the OpenCRISPR-1 base editor on all three target sites (35–60% editing rate), which was comparable to an ABE8.20 base editor system using SpCas9 nickase (Fig. 4e) and without resulting in indel formation (Supplementary Fig. 13).

We next set out to engineer a fully synthetic base editor system, including the deaminase domain. Towards this goal, we trained models based on TadA-like proteins from UniProtKB44 and BFD30 and generated a series of synthetic adenine deaminases with 55–80%ID to any known natural deaminase, including both engineered variants and the Escherichia coli TadA from which these variants were evolved43,45. Initial screening experiments with SpCas9 nickase showed that a subset of generated deaminases were active at several targets in the human genome (Supplementary Fig. 14). We then tested two of our most active deaminases (PF-DEAM-1 and PF-DEAM-2) fused to the N terminus of either SpCas9 or OpenCRISPR-1 nickases. Our generated deaminases showed A-to-G editing levels comparable to ABE8.20 with both nickase scaffolds while producing minimal bystander edits (Supplementary Fig. 15). Similarly narrow editing windows were observed in early adenine base editors evolved from ecTadA through directed evolution45 but were eventually traded for overall activity in later rounds of selection43.

Although all of the experiments up to this point had used the SpCas9 sgRNA scaffold, we reasoned that this might not be optimal for OpenCRISPR-1 and other generated Cas9-like proteins, which contained hundreds of mutations that could potentially disrupt RNA interactions. Using the gRNA model described previously, we designed 14 generated sgRNAs for each of five generated Cas9-like proteins (including OpenCRISPR-1) and tested for editing in HEK293T cells at one target site (HEK3). Generated sgRNA sequences exhibited high sequence conservation compared to SpCas9’s sgRNA (Supplementary Fig. 16), with most mutations occurring in flexible regions of the secondary structure (for example, loops or linker regions). Overall, we observed enhanced editing with 31 designed sgRNAs relative to SpCas9’s sgRNA (Fig. 4f), including significant improvements for four sgRNAs for two of five variants (Fig. 4g). To confirm these findings, we conducted a similar editing experiment using two generated proteins at two more sites, again finding comparable editing with most designed sgRNAs and a small number yielding statistically significant improvements to editing (Supplementary Fig. 17g,h). Designed sgRNAs were generally compatible with SpCas9, but only one showed significant improvements in editing efficiency (Supplementary Fig. 17d–f). We found that OpenCRISPR-1, which had shown consistently high editing efficiency in previous experiments, performed similarly with a designed sgRNA or SpCas9’s sgRNA. These results indicate that this generated protein may be applicable either as part of a fully generated gene editor or as a drop-in replacement for SpCas9 in existing editing systems.

Discussion

Gene editing technologies adapted from natural prokaryotic antiviral systems have enabled precise, programmable manipulation of genetic material across research, therapeutic and industrial applications. Although evolution has created a massive diversity of CRISPR–Cas proteins, identifying the best natural protein for a given application (if it exists) remains a principal bottleneck in the design of more advanced gene editing systems. Generative LMs for DNA46 or proteins2 offer an alternative paradigm wherein models learn from natural diversity and can be steered towards the most promising regions of sequence space. This approach allows us to diversify existing lineages of interest or explore regions of sequence space that were not visited by evolution. In this work, we focused on generating type II effectors in the phylogenetic neighbourhood of SpCas9, ultimately yielding the OpenCRISPR-1 editing system. Our results indicate that OpenCRISPR-1 may provide a viable alternative to SpCas9 for use in gene editing technologies, with similar editing behaviour and compatibility with systems like base editing. In the future, it will be important to examine OpenCRISPR-1 activity across a range of experimental conditions, cell types and delivery methods to more thoroughly characterize robustness11,37,47.

As part of this work, we curated the CRISPR–Cas Atlas—a large resource of CRISPR systems. Datasets like the CRISPR–Cas Atlas are critical for refining the general learnings of pretrained protein LMs into a functional blueprint for design. Although we focused primarily on type II effector proteins, our exploratory results indicate that effectors from other Class 2 systems (for example, Cas12a, Cas12f and Cas13) may be amenable to the same approach. In some cases, these alternative systems have unique properties that would benefit gene editing applications (for example, reduced size of Cas12f or RNA interference of Cas13). Aside from fine-tuning protein LMs for generation, we envision that the CRISPR–Cas Atlas could be used to model specific properties of gene editors, such as nuclease size, PAM preference, tracrRNA compatibility, thermostability or temperature-dependent activity. For instance, a model to predict PAM preference could enable efficient engineering of target- or allele-specific editors. The capability of generative LMs to produce diverse, highly functional nuclease proteins, as demonstrated in this work, provides a foundation from which to pursue these fit-for-purpose editors.

Computational protein design has advanced considerably in recent years with the development of increasingly sophisticated deep learning algorithms. These improvements have been achieved through integration of more powerful tools into design pipelines that have remained largely unchanged48. Specifically, the design of protein function typically begins with an explicit structural hypothesis that is translated into a set of constraints to guide a search for satisfying sequences49. This approach has largely reduced some design problems, such as de novo design of protein binders50, to practice. However, for the design of complex functions as embodied by the gene editors in this work, structure-based approaches do not offer a straightforward solution. By contrast, LMs provide an implicit means of modelling protein function (and thus structure) through sequence alone18.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.