Fig. 2: Synthetic mega-active PiggyBac generation using protein language model fine-tuning. | Nature Biotechnology

Fig. 2: Synthetic mega-active PiggyBac generation using protein language model fine-tuning.

From: Discovery and protein language model-guided design of hyperactive transposases

Fig. 2

a, Overview of the fine-tuning and sequence generation pipeline. The Progen2-base model was fine-tuned on a set of over 10,000 PiggyBac orthologs identified through the bioprospecting pipeline. Over 100,000 sequences were generated with a sequence identity between 35% and 99% to the HyPB. Sequences were then filtered using a set of basic (gray) and PiggyBac-specific (green) amino acid sequence metrics and scored using a set of scores based on structural (orange) and deep learning (blue) metrics to select a final subset of 22 sequences for experimental validation. b, Distribution of four key metrics (sequence identity, pLDDT, ProteinMPNN score and ESM1v score) for natural sequences from the HyPB cluster at 60% identity (orange) and sequences generated from our progen-ft model (blue) after filtering. The violin plots represent the entire distribution of scores for the two sets of sequences and the internal box plot represents the quartiles for each score, with the center being the median, the bottom and top being the first and third quartiles, respectively, and the whiskers going 1.5× the interquartile range from the top and bottom. Ft, Fourier transform. c, Relative excision for progen-ft-generated variants normalized to HyPB activity (highlighted in green), measured by a transposon excision fluorescence assay. Bars reflect the mean relative excision over the four trials and points represent the mean relative excision of replicates in each trial. Data are presented as the mean values, with n = 5. d, Correlations between calculated and measured features to relative excision of the progen-ft-generated variants. Significant correlations are highlighted in dark blue. Correlation was measured with Pearson’s correlation. e, Targeted integration with top pLLM-generated mutants, measured by a targeted transposon integration GFP reconstitution assay that measures integration of a 1/2 GFP reporter cargo upstream of a stably integrated 2/2 GFP in HEK293T reporter cell line. Triple-mutant (×3) versions of the transposases were made by selecting the residues corresponding to R372A;K375A;D450N in HyPB. Data are presented as the mean values ± 95% CI, with n = 3. f, Targeted transposon integration measured by digital PCR assay in C2C12 mouse myoblast cell lines at TTR and PCSK9 loci for top AI-designed transposases. The sum of integration in both orientations is shown. Data are presented as the mean values ± 95% CI, with n = 2. g, Nontargeted transposon integration measured by fluorescence assay in primary T cells for top bioprospected ortholog Poetur 7 days after electroporation. Data are presented as the mean values ± 95% CI, with n = 2. h, Nontargeted integration of a GFP cargo in primary T cells with HyPB and top synthetic sequences transposases 7 days after cell electroporation. Data are presented as the mean values ± 95% CI, with n = 3.

Back to article page