Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Ultra-high-throughput mapping of genetic design space

Abstract

Massively parallel genetic screens have been used to map sequence-to-function relationships for a variety of genetic elements1,2,3,4,5. However, as these approaches interrogate only short sequences, it remains challenging to perform high-throughput assays on constructs containing combinations of multiple sequence elements arranged across multi-kb length scales. Overcoming this barrier could accelerate synthetic biology; by screening diverse gene circuit designs and learning ‘composition to function’ mappings, genetic part composability rules could be revealed, enabling rapid identification of behaviour-optimized design variants6,7. Here we introduce CLASSIC (combining long- and short-range sequencing to investigate genetic complexity), a genetic screening platform that combines long- and short-read next-generation sequencing (NGS) modalities to quantitatively assess pools of constructs of arbitrary length containing diverse genetic part compositions. We show that CLASSIC can measure expression profiles of over 105 gene circuit designs (from 5–20 kb) in a single experiment in human cells. The resulting datasets can be used to train machine-learning models that accurately predict circuit behaviour across expansive circuit design landscapes, revealing part composability rules that govern circuit performance. Our study shows that, by expanding the throughput of each design–build–test–learn cycle, CLASSIC enhances the pace and scale of synthetic biology and establishes an experimental basis for data-driven design of complex genetic systems.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Using CLASSIC to systematically map the design space of complex genetic programs.
Fig. 2: Using CLASSIC to quantitatively profile a synthetic gene circuit design landscape.
Fig. 3: ML-aided mapping of single-input circuit design space reveals gene circuit design rules.
Fig. 4: ML-guided exploration of multi-input gene circuit behaviour.
Fig. 5: Analysis of digital logic gene circuit design rules in >109-member design space.

Similar content being viewed by others

Data availability

All Nanopore and Illumina sequencing datasets generated in this study are available from the Sequencing Read Archive (BioProject: PRJNA1347054).

Code availability

All custom scripts used for Nanopore sequencing data analysis are available at GitHub (https://github.com/cbashorlab/WIMPY). Code associated with Illumina data analysis and model training are available at GitHub (https://github.com/cbashorlab/CLASSIC). All other scripts used to generate any analysis in addition to those provided above are available on request.

References

  1. de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2020).

    Article  PubMed  Google Scholar 

  2. Castillo-Hair, S. et al. Optimizing 5′UTRs for mRNA-delivered gene editing using deep learning. Nat. Commun. 15, 5284 (2024).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  3. Angenent-Mari, N. M., Garruss, A. S., Soenksen, L. R., Church, G. & Collins, J. J. A deep learning approach to programmable RNA switches. Nat. Commun. 11, 5057 (2020).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  4. Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet. 54, 283–294 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Jones, E. M. et al. Structural and functional characterization of G protein-coupled receptors with deep mutational scanning. eLife 9, e54895 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Zhang, C., Tsoi, R. & You, L. Addressing biological uncertainties in engineering gene circuits. Integr. Biol. 8, 456–464 (2016).

    Article  Google Scholar 

  7. Kitano, S., Lin, C., Foo, J. L. & Chang, M. W. Synthetic biology: learning the way toward high-precision biological design. PLoS Biol. 21, e3002116 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  8. English, M. A., Gayet, R. V. & Collins, J. J. Designing biological circuits: synthetic biology within the operon model and beyond. Annu. Rev. Biochem. 90, 221–244 (2021).

    Article  PubMed  Google Scholar 

  9. Mahata, B. et al. Compact engineered human mechanosensitive transactivation modules enable potent and versatile synthetic transcriptional control. Nature Methods 20, 1716–1728 (2023).

  10. Slusarczyk, A. L., Lin, A. & Weiss, R. Foundations for the design and implementation of synthetic genetic circuits. Nat. Rev. Genet. 13, 406–420 (2012).

    Article  PubMed  Google Scholar 

  11. Bashor, C. J. & Collins, J. J. Understanding biological regulation through synthetic biology. Annu. Rev. Biophys. 47, 399–423 (2018).

    Article  PubMed  Google Scholar 

  12. Bashor, C. J., Hilton, I. B., Bandukwala, H., Smith, D. M. & Veiseh, O. Engineering the next generation of cell-based therapeutics. Nat. Rev. Drug Discov. 21, 655–675 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Beitz, A. M., Oakes, C. G. & Galloway, K. E. Synthetic gene circuits as tools for drug discovery. Trends Biotechnol. 40, 210–225 (2022).

    Article  PubMed  Google Scholar 

  14. Kitada, T., DiAndreth, B., Teague, B. & Weiss, R. Programming gene and engineered-cell therapies with synthetic biology. Science 359, eaad1067 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Cameron, D. E., Bashor, C. J. & Collins, J. J. A brief history of synthetic biology. Nat. Rev. Microbiol. 12, 381–390 (2014).

    Article  PubMed  Google Scholar 

  16. Yeung, E. et al. Biophysical constraints arising from compositional context in synthetic gene networks. Cell Syst. 5, 11–24 (2017).

    Article  PubMed  Google Scholar 

  17. Lou, C., Stanton, B., Chen, Y. J., Munsky, B. & Voigt, C. A. Ribozyme-based insulator parts buffer synthetic circuits from genetic context. Nat. Biotechnol. 30, 1137–1142 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Muller, I. E. et al. Gene networks that compensate for crosstalk with crosstalk. Nat. Commun. 10, 4028 (2019).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  19. Kinney, J. B., Murugan, A., Callan, C. G. Jr. & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA 107, 9158–9163 (2010).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  20. Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA 110, 14024–14029 (2013).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  22. Taskiran, I. I. et al. Cell-type-directed design of synthetic enhancers. Nature 626, 212–220 (2024).

    Article  ADS  PubMed  Google Scholar 

  23. Gosai, S. J. et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 634, 1211–1220 (2024).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  24. Agarwal, V. et al. Massively parallel characterization of transcriptional regulatory elements. Nature 639, 411–420 (2025).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  25. Bogard, N., Linder, J., Rosenberg, A. B. & Seelig, G. A deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 91–106 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Khoroshkin, M. et al. A generative framework for enhanced cell-type specificity in rationally designed mRNAs. Preprint at bioRxiv https://doi.org/10.1101/2024.12.31.630783 (2024).

  27. Gera, T., Jonas, F., More, R. & Barkai, N. Evolution of binding preferences among whole-genome duplicated transcription factors. eLife 11, e73225 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  28. DelRosso, N. et al. Large-scale mapping and mutagenesis of human transcriptional effector domains. Nature 616, 365–372 (2023).

  29. Zhou, Y. et al. Encoding genetic circuits with DNA barcodes paves the way for machine learning-assisted metabolite biosensor response curve profiling in yeast. ACS Synth. Biol. 11, 977–989 (2022).

    Article  PubMed  Google Scholar 

  30. Wong, A. S., Choi, G. C., Cheng, A. A., Purcell, O. & Lu, T. K. Massively parallel high-order combinatorial genetics in human cells. Nat. Biotechnol. 33, 952–961 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Matreyek, K. A. et al. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nat. Genet. 50, 874–882 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Liu, H. et al. Magic pools: parallel assessment of transposon delivery vectors in bacteria. mSystems 3, e00143-17 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  33. Weber, E., Engler, C., Gruetzner, R., Werner, S. & Marillonnet, S. A modular cloning system for standardized assembly of multigene constructs. PLoS ONE 6, e16765 (2011).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  34. Duportet, X. et al. A platform for rapid prototyping of synthetic gene networks in mammalian cells. Nucleic Acids Res. 42, 13440–13451 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Petitclerc, D. et al. The effect of various introns and transcription terminators on the efficiency of expression vectors in various cultured cell lines and in the mammary gland of transgenic mice. J. Biotechnol. 40, 169–178 (1995).

    Article  PubMed  Google Scholar 

  36. Khalil, A. S. et al. A synthetic biology framework for programming eukaryotic transcription functions. Cell 150, 647–658 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Maeder, M. L., Thibodeau-Beganny, S., Sander, J. D., Voytas, D. F. & Joung, J. K. Oligomerized pool engineering (OPEN): an ‘open-source’ protocol for making customized zinc-finger arrays. Nat. Protoc. 4, 1471–1501 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Li, H. S. et al. Multidimensional control of therapeutic human cell function with synthetic gene circuits. Science 378, 1227–1234 (2022).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  39. Feil, R., Wagner, J., Metzger, D. & Chambon, P. Regulation of Cre recombinase activity by mutated estrogen receptor ligand-binding domains. Biochem. Biophys. Res. Commun. 237, 752–757 (1997).

    Article  ADS  PubMed  Google Scholar 

  40. Bashor, C. J. et al. Complex signal processing in synthetic gene circuits using cooperative regulatory assemblies. Science 364, 593–597 (2019).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  41. Donahue, P. S. et al. The COMET toolkit for composing customizable genetic programs in mammalian cells. Nat. Commun. 11, 779 (2020).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  42. Muldoon, J. J. et al. Model-guided design of mammalian genetic programs. Sci. Adv. 7, eabe9375 (2021).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  43. Kabadi, A. M. & Gersbach, C. A. Engineering synthetic TALE and CRISPR/Cas9 transcription factors for regulating gene expression. Methods 69, 188–197 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  44. La Russa, M. F. & Qi, L. S. The new state of the art: Cas9 for gene activation and repression. Mol. Cell. Biol. 35, 3800–3809 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Sadowski, I., Ma, J., Triezenberg, S. & Ptashne, M. GAL4-VP16 is an unusually potent transcriptional activator. Nature 335, 563–564 (1988).

    Article  ADS  PubMed  Google Scholar 

  46. Shin, Y. et al. Spatiotemporal control of intracellular phase transitions using light-activated optoDroplets. Cell 168, 159–171 (2017).

    Article  PubMed  Google Scholar 

  47. Schneider, N. et al. Liquid-liquid phase separation of light-inducible transcription factors increases transcription activation in mammalian cells and mice. Sci. Adv. 7, eabd3568 (2021).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  48. Gossen, M. & Bujard, H. Tight control of gene expression in mammalian cells by tetracycline-responsive promoters. Proc. Natl Acad. Sci. USA 89, 5547–5551 (1992).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  49. Shin, Y. & Brangwynne, C. P. Liquid phase condensation in cell physiology and disease. Science 357, eaaf4382 (2017).

    Article  PubMed  Google Scholar 

  50. Tycko, J. et al. Development of compact transcriptional effectors using high-throughput measurements in diverse contexts. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02442-6 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Tague, E. P., Dotson, H. L., Tunney, S. N., Sloas, D. C. & Ngo, J. T. Chemogenetic control of gene expression and cell signaling with antiviral drugs. Nat. Methods 15, 519–522 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Jiang, K. et al. Rapid in silico directed evolution by a protein language model with EVOLVEpro. Science 387, eadr6006 (2025).

    Article  PubMed  Google Scholar 

  53. Lin, J., Luo, R. & Pinello, L. EPInformer: a scalable deep learning framework for gene expression prediction by integrating promoter-enhancer sequences with multimodal epigenomic data. Preprint at bioRxiv https://doi.org/10.1101/2024.08.01.606099 (2024).

  54. Wimmer, E., Mueller, S., Tumpey, T. M. & Taubenberger, J. K. Synthetic viruses: a new opportunity to understand and prevent viral disease. Nat. Biotechnol. 27, 1163–1172 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  55. Brophy, J. A. & Voigt, C. A. Principles of genetic circuit design. Nat. Methods 11, 508–520 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  56. Pinglay, S. et al. Synthetic regulatory reconstitution reveals principles of mammalian Hox cluster regulation. Science 377, eabk2820 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Voigt, C. A. Synthetic biology 2020-2030: six commercially-available products that are changing our world. Nat. Commun. 11, 6379 (2020).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  58. Valeri, J. A. et al. Sequence-to-function deep learning frameworks for engineered riboregulators. Nat. Commun. 11, 5058 (2020).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  59. Hollerer, S. et al. Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping. Nat. Commun. 11, 3551 (2020).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  60. Rai, K., Wang, Y., O’Connell, R. W., Patel, A. B. & Bashor, C. J. Using machine learning to enhance and accelerate synthetic biology. Curr. Opin. Biomed. Eng. 31, 100553 (2024).

  61. Karst, S. M. et al. High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat. Methods 18, 165–169 (2021).

    Article  PubMed  Google Scholar 

  62. Chung, C. T., Niemela, S. L. & Miller, R. H. One-step preparation of competent Escherichia coli: transformation and storage of bacterial cells in the same solution. Proc. Natl Acad. Sci. USA 86, 2172–2175 (1989).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  63. Parrish, J. R. et al. High-throughput cloning of Campylobacter jejuni ORfs by in vivo recombination in Escherichia coli. J. Proteome Res. 3, 582–586 (2004).

    Article  PubMed  Google Scholar 

  64. Currin, A. et al. Highly multiplexed, fast and accurate nanopore sequencing for verification of synthetic DNA constructs and sequence libraries. Synth. Biol. 4, ysz025 (2019).

    Article  Google Scholar 

  65. De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  66. Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).

    Article  PubMed  Google Scholar 

  67. Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819–823 (2013).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  68. Hermann, M. et al. Binary recombinase systems for high-resolution conditional mutagenesis. Nucleic Acids Res. 42, 3894–3907 (2014).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank O. Igoshin, A. Patel, Y. Lagisetty, S. Singh and the members of the Bashor laboratory for discussions. This work was supported by grants from NIH R01 EB029483 (C.J.B.), NIH R01 EB032272 (C.J.B.), ONR N00014-21-1-4006 (C.J.B.) and funding from the Robert J. Kleberg Jr and Helen C. Kleberg Foundation (C.J.B.). This work was also supported by the Genetic Design and Engineering Center (GDEC) at Rice University, which is funded by CPRIT RP210116. R.W.O. was supported by a graduate fellowship from the American Heart Association (917746). B.K. was supported by a NLM Training Program in Biomedical Informatics and Data Science fellowship (T15LM007093-31) and by NIH grant P01-AI15299901. K.D.C. was supported by NSF EF-2126387 and the Ken Kennedy Institute Computational Science & Engineering Recruiting Fellowship. T.J.T. was supported by NSF grants IIS-2239114 and EF-2126387, NIH grant P01-AI152999 and AI2Health cluster funding from Ken Kennedy Institute, Rice University. P.M. and J.W.R were supported by NIH R35GM119461 (P.M.).

Author information

Authors and Affiliations

Authors

Contributions

R.W.O., K.R. and C.J.B. conceived the study. R.W.O. and K.R. carried out the experiments and developed the analysis software, with assistance from T.C.P., Y.W., L.B.C.B., K.D.S., J.A.W., S.L., T.H.Z., E.M.R. and A.S.; R.W.O., K.R. and T.C.P. developed the modular cloning scheme and LP cell line, with assistance from S.L.; B.K., K.D.C. and T.J.T. helped to develop the barcoding scheme and analysis software. R.W.O., K.R., T.C.P., Y.W., J.W.R., P.M. and C.J.B. analysed the data. C.J.B. supervised the study. R.W.O., K.R. and C.J.B wrote the manuscript, with input from all of the authors.

Corresponding author

Correspondence to Caleb J. Bashor.

Ethics declarations

Competing interests

A provisional patent application that covers technologies described in this Article has been filed by Rice University.

Peer review

Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Overview of 166K-member gene circuit library construction.

a, Timeline of library construction process from level 0 part fragments to level 3 libraries. b, Schematic of the construction strategy for each level of the assembly process (grey boxes), with number of cells indicated for assembly transformation and plating (brown circles, 10 cm plates; brown rectangles, autoclave trays), colony scraping, and HEK-LP transfection (pink circles, 10 cm plates). The inputs and products for each assembly level are represented to the right. c, Nanopore sequencing read-length distribution of level 3 single-input inducible circuit library, with the relative proportion of each identifiable DNA product denoted: library members (pink), re-circularized level 3 destination vector (orange), empty level 3 destination vector (dark grey), contamination from level 0, 1, or 2 genetic parts (light grey), or unidentified DNA species (blue).

Extended Data Fig. 2 166k-member library balance and barcoding analysis metrics.

a, Confusion matrices showing the percentage of reads unambiguously assigned to each genetic part (on-diagonal) and reads ambiguously linked to two parts (off-diagonal) following level 3 library assembly. Individual values representing <0.5% of the total reads are not shown. The percentage of reads in which a part identity could not be determined are shown at the bottom of the respective part confusion matrices. b, Number of barcodes determined to be uniquely mapped to a single composition (“unique”) or multiple compositions (“non-unique”), as determined from Nanopore sequencing analysis of the level 3 library.

Extended Data Fig. 3 Comparison and optimization of ML models.

a, Top: schematic outlining the process of generating training (light grey), validation (blue), test (purple), and isolate (navy) sets. CLASSIC data sets are first divided into high- (>12 barcodes, dark grey) or general-quality (<12 barcodes, light grey) sets before generating the training, validation, and test splits. Bottom: comparison of the performance of 5 different model classes (linear regression, quadratic regression, random forest (RF), convolutional neural network (CNN), and multi-layer perceptron (MLP)) for predicting circuit behaviour using varying amounts of training data (x-axis), as monitored using test (purple line), validation (light blue line), and isolate (navy line) set r2 values (y-axis) (see supplementary text section S3.5). Grey shaded region on each plot represents a regime in which the training set is dominated by general-quality reads. For the selected model class (MLP), training curves of the root mean squared error (RMSE) and loss for the validation set are provided. b, Basal (purple line) and induced (navy line) r2 values for predicted vs observed CLASSIC expression from the trained MLP. Insets represent CLASSIC vs predicted measurements for > 1 (bottom left, r2 = 0.43) and > 12 (top middle, r2 = 0.80) barcodes per composition, and the number of compositions for increasing number of barcodes per composition (bottom right). c, Hyperparameter optimization (HPO) of the MLP, monitored using the validation set: learning rate (LR, y-axis) for different numbers of layers (x-axis) with 4 layers and a learning rate of 5 × 10−2 providing highest r2 (red outline) (top); solver choice, with SGDM leading to the highest r2 (red bar) (bottom left); momentum, with a momentum of 0.9 providing the highest r2 (red circle) (bottom right). d, Comparing ground-truth basal expression (left), induced expression (middle), and fold change measurements (right) (x-axis) with HPO MLP predictions (y-axis) for the isolate set (r2 = 0.96, r2 = 0.91, MAE = 0.22, respectively, n = 40).

Extended Data Fig. 4 Summary of cell lines constructed from the 166k-library design space to validate predictions from the MLP model.

a, Basal and induced eGFP expression levels for each constructed cell line circuit composition overlaid onto single-input behaviour space (clonally isolated, green; constructed out-of-sample, red; constructed in-sample, teal; contour for 97.5% of compositions in the MLP-predicted behaviour space, grey). Dotted lines separate behavioural regions of interest: low basal (<500 AU), purple arrow; high induction (>70k AU) blue; high fold-change (HFC) (>25x, green). b, Comparing fold-change values for MLP model-predictions (red, left, n = 136) or CLASSIC measurements (green, right, n = 121) with ground truth cell line measurements. c, Residual between model and CLASSIC measurements. Heatmap corresponds to the manhattan distance between residuals of basal and induced expression values calculated across a 20 × 20 grid in the behaviour space. Variants were assigned to grids based on CLASSIC measured values. d, Basal and induced expression behaviour for the top 10 highest error compositions from the individual variants shown in (a). CLASSIC measurements (grey) and corresponding ground truth values measured from constructed cell lines(green) are shown. Lines (blue dashed) link each pair of CLASSIC and ground truth expression.

Extended Data Fig. 5 Clustering analysis of HFC compositions.

a, Gap test for cluster number (left) and subsequent UMAP projection of HFC variants grouped into 3 cluster (cluster A, blue; cluster B, red; cluster C, purple) (middle left). Cluster similarity scores from 100 independent k-means clustering outcomes for all compositions, and adjusted rand score for all pairs of clustering results (right). Means represent the population cluster similarity and adjusted rand index respectively. b, Part usage frequency for variants in each cluster. c, Mapping of basal and induced eGFP expression of compositions from each of the three clusters, overlaid on a contour constructed from 97.5% of the data from the MLP-predicted behaviour space (see Fig. 4a) (grey fill). d, Distribution of basal (dotted line) and induced eGFP expression (solid line) (bottom axis), as well as fold change values (grey line, top axis) for each cluster. Circles represent the median values, boxes span the 25th to 75th percentiles, and the upper and lower whiskers represent the median +/− 1.5x IQR (line ends). Median values are shown to the left of the plot. Sample sizes (n): Cluster A = 5,018, Cluster B = 452, and Cluster C = 62.

Extended Data Fig. 6 Fine-tuning model.

a, Schematic depicting a proposed method for expanding the model-predicted design space (red) to include 2 new parts (NFZ and no IDP) by fine-tuning the model using small libraries of new parts (green) (left). Representation of the position of the new parts in the synTF architecture (right). b, Schematic outlining the assembly strategy to retroactively add the TA NFZ to the design space. Transparent green boxes signify individual plasmids or plasmid pools that contain the new part. c, Schematic outlining the assembly strategy to retroactively add IDP-less variants the design space. Transparent green boxes signify individual plasmids or plasmid pools that contain the new part. d, eGFP distributions for the new libraries to explore this sub-space. e, Table of the number of cells sorted into each bin for both inducer conditions during flowSeq. f, 8 individually constructed variants from the sub-space to validate CLASSIC measurements. Grey region, ERCH; Green square, HFC region. g & h, Comparison of basal eGFP expression predictions and CLASSIC measurements for a high-quality test set of compositions lacking an IDP (panel g) or containing an NFZ TA (panel h), using either a base model (white dots with black outline, r2 = 0.90 or r2 = 0.81, respectively) or a fine-tuned model (purple dots, r2 = 0.94 or r2 = 0.89, respectively) (left). Breakdown of the basal (purple) and induced (teal) expression prediction accuracy with increasing amounts of fine-tuning data from the IDP lacking (g) or NFZ-containing (h) libraries, as assessed by monitoring the test set r2 (middle). A 2D map outlining the amounts of base library and no IDP library (g) or NFZ-containing (h) data required for optimal fine-tuning of the base model, as determined by the test set r2 (right). i, 11 individually constructed variants from the sampled (teal) and un-sampled (red) expanded design space to validate fine-tuned model predictions. Grey region, ERCH; Green square, HFC region.

Extended Data Fig. 7 Hyperparameter tuning and validation of base MLP model.

a, Hyperparameter optimization for the multi-layer fully connected neural network. Validation r2 values for (left) 2D combinations of learning rate (y-axis) and number of layers (x-axis) for varied amounts of training data used, (top right) momentum parameter for stochastic gradient descent with momentum (SGDM), and (bottom right) different solvers. Most optimal parameter from each scan is shown in red. b, Training curves showing RMSE (top) and loss (bottom) as a function of training iteration for the validation set. Training was stopped with a validation patience parameter of 300 iterations. c, Comparison of model predictions to randomly isolated clones from the library. MAE: mean absolute error. d, Scatter plots of the test set from the base model for each of the four input conditions (basal: light grey, navy: OHT only, orange: GZV only, green: both inducers). r2, Pearson’s r2.

Extended Data Fig. 8 Validation of model predictions with individual measurements.

a, Flow cytometry was used to measure basal, OHT-induced, GZV-induced, and dual-induced eGFP expression levels for 36 individually constructed cell lines harbouring integrated circuit compositions sampled from across the multi-input library behaviour space. Green bar plots (flow cytometry measurement) and dotted red outlines (model predictions) are shown for each circuit for all four conditions (far left, basal; middle left, 4-OHT-induction; middle right, GZV-induction; far right, dual induction). Bars represent the mean of the expression distribution for a single measurement. KL-divergence (DKL) from AND (top) or OR (bottom) gate shown below each plot (prediction, red; measurement, green). Numbers in grey circles represent an index for that circuit. A legend explaining the layout of each plot is shown in a grey rectangle. b, The AND-OR coordinates of each cell line (green dots), superimposed on the contour of the design space (grey).

Extended Data Fig. 9 Extended AND-gate clustering analysis of the multi-input library.

a, Clustering of the multi-input AND-like behaviour space (top left) and part usage across the clusters. b, AND cluster expression distributions across the 4 input conditions (basal, black; 4-OHT, navy; GZV, orange; Both, green), represented by boxplots outlining the interquartile range (IQR) (box), the median (white band), and the median +/− 1.5x IQR (line ends). Median values are shown to the left of the plot. Sample sizes (n): Cluster A = 11,167, Cluster B = 9,854, and Cluster C = 3,627. c, Cluster stability analysis of compositions in the AND-like behaviour space by computing the cluster similarity index (top) across 100 UMAP projections and cluster calculations, and adjusted rand index (bottom) across every pairwise combination of clustering results across the 100 UMAP projections and cluster calculations.

Extended Data Fig. 10 Extended OR-gate clustering analysis of the multi-input library.

a, Clustering of the multi-input OR-like behaviour space (top left) and part usage across the clusters. b, OR cluster expression distributions across the 4 input conditions (basal, black; 4-OHT, navy; GZV, orange; Both, green), represented by boxplots outlining the interquartile range (IQR) (box), the median (white band), and the median +/− 1.5x IQR (line ends). Median values are shown to the left of the plot. Sample sizes (n): Cluster A = 9,240, Cluster B = 7138, Cluster C = 2,908, and Cluster D = 1,545. c, Cluster stability analysis of compositions in the OR-like behaviour space by computing the cluster similarity index (top) across 100 UMAP projections and cluster calculations, and adjusted rand index (bottom) across every pairwise combination of clustering results across the 100 UMAP projections and cluster calculations.

Supplementary information

Supplementary Information

Supplementary Notes and Supplementary Figures supporting the Article and its Extended Data Figures.

Reporting Summary

Peer Review file

Supplementary Table 1

DNA sequences used in this study. This includes genetic parts and primers.

Supplementary Table 2

Flow cytometry measurements, model predictions and CLASSIC measurements (where applicable) for individually constructed variants and isolated cell lines.

Supplementary Table 3

Part use, MI and clustering information from the single-input and multi-input libraries.

Supplementary Table 4

A breakeven table for calculating the cost of CLASSIC experiments of varying sizes and complexity.

Supplementary Table 5

A list of published single- and dual-input inducible circuits, as well as information such as cell type, integration method, enrichment strategy, inducer molecule, output and FC (where applicable).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rai, K., O’Connell, R.W., Piepergerdes, T.C. et al. Ultra-high-throughput mapping of genetic design space. Nature (2026). https://doi.org/10.1038/s41586-025-09933-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41586-025-09933-9

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing