Ultra-high-throughput mapping of genetic design space

Rai, Kshitij; O’Connell, Ronan W.; Piepergerdes, Trenton C.; Wang, Yiduo; Brown, Lucas B. C.; Samra, Kian D.; Wilson, Jack A.; Lin, Shujian; Zhang, Thomas H.; Ramos, Eduardo M.; Sun, Andrew; Kille, Bryce; Curry, Kristen D.; Rocks, Jason W.; Treangen, Todd J.; Mehta, Pankaj; Bashor, Caleb J.

doi:10.1038/s41586-025-09933-9

Article
Published: 14 January 2026

Ultra-high-throughput mapping of genetic design space

Nature volume 650, pages 1035–1044 (2026)Cite this article

11k Accesses
6 Citations
100 Altmetric
Metrics details

Subjects

Abstract

Massively parallel genetic screens have been used to map sequence-to-function relationships for a variety of genetic elements^1,2,3,4,5. However, as these approaches interrogate only short sequences, it remains challenging to perform high-throughput assays on constructs containing combinations of multiple sequence elements arranged across multi-kb length scales. Overcoming this barrier could accelerate synthetic biology; by screening diverse gene circuit designs and learning ‘composition to function’ mappings, genetic part composability rules could be revealed, enabling rapid identification of behaviour-optimized design variants^6,7. Here we introduce CLASSIC (combining long- and short-range sequencing to investigate genetic complexity), a genetic screening platform that combines long- and short-read next-generation sequencing (NGS) modalities to quantitatively assess pools of constructs of arbitrary length containing diverse genetic part compositions. We show that CLASSIC can measure expression profiles of over 10⁵ gene circuit designs (from 5–20 kb) in a single experiment in human cells. The resulting datasets can be used to train machine-learning models that accurately predict circuit behaviour across expansive circuit design landscapes, revealing part composability rules that govern circuit performance. Our study shows that, by expanding the throughput of each design–build–test–learn cycle, CLASSIC enhances the pace and scale of synthetic biology and establishes an experimental basis for data-driven design of complex genetic systems.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Using CLASSIC to systematically map the design space of complex genetic programs.**

**Fig. 2: Using CLASSIC to quantitatively profile a synthetic gene circuit design landscape.**

**Fig. 3: ML-aided mapping of single-input circuit design space reveals gene circuit design rules.**

**Fig. 4: ML-guided exploration of multi-input gene circuit behaviour.**

**Fig. 5: Analysis of digital logic gene circuit design rules in >10⁹-member design space.**

Genetic controllers for enhancing the evolutionary longevity of synthetic gene circuits in bacteria

Article Open access 29 September 2025

Scalable recombinase-based gene expression cascades

Article Open access 11 May 2021

A genome-wide atlas of human cell morphology

Article Open access 27 January 2025

Data availability

All Nanopore and Illumina sequencing datasets generated in this study are available from the Sequencing Read Archive (BioProject: PRJNA1347054).

Code availability

All custom scripts used for Nanopore sequencing data analysis are available at GitHub (https://github.com/cbashorlab/WIMPY). Code associated with Illumina data analysis and model training are available at GitHub (https://github.com/cbashorlab/CLASSIC). All other scripts used to generate any analysis in addition to those provided above are available on request.

References

de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2020).
Article PubMed Google Scholar
Castillo-Hair, S. et al. Optimizing 5′UTRs for mRNA-delivered gene editing using deep learning. Nat. Commun. 15, 5284 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Angenent-Mari, N. M., Garruss, A. S., Soenksen, L. R., Church, G. & Collins, J. J. A deep learning approach to programmable RNA switches. Nat. Commun. 11, 5057 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet. 54, 283–294 (2022).
Article CAS PubMed PubMed Central Google Scholar
Jones, E. M. et al. Structural and functional characterization of G protein-coupled receptors with deep mutational scanning. eLife 9, e54895 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhang, C., Tsoi, R. & You, L. Addressing biological uncertainties in engineering gene circuits. Integr. Biol. 8, 456–464 (2016).
Article Google Scholar
Kitano, S., Lin, C., Foo, J. L. & Chang, M. W. Synthetic biology: learning the way toward high-precision biological design. PLoS Biol. 21, e3002116 (2023).
Article CAS PubMed PubMed Central Google Scholar
English, M. A., Gayet, R. V. & Collins, J. J. Designing biological circuits: synthetic biology within the operon model and beyond. Annu. Rev. Biochem. 90, 221–244 (2021).
Article CAS PubMed Google Scholar
Mahata, B. et al. Compact engineered human mechanosensitive transactivation modules enable potent and versatile synthetic transcriptional control. Nature Methods 20, 1716–1728 (2023).
Slusarczyk, A. L., Lin, A. & Weiss, R. Foundations for the design and implementation of synthetic genetic circuits. Nat. Rev. Genet. 13, 406–420 (2012).
Article CAS PubMed Google Scholar
Bashor, C. J. & Collins, J. J. Understanding biological regulation through synthetic biology. Annu. Rev. Biophys. 47, 399–423 (2018).
Article CAS PubMed Google Scholar
Bashor, C. J., Hilton, I. B., Bandukwala, H., Smith, D. M. & Veiseh, O. Engineering the next generation of cell-based therapeutics. Nat. Rev. Drug Discov. 21, 655–675 (2022).
Article CAS PubMed PubMed Central Google Scholar
Beitz, A. M., Oakes, C. G. & Galloway, K. E. Synthetic gene circuits as tools for drug discovery. Trends Biotechnol. 40, 210–225 (2022).
Article CAS PubMed Google Scholar
Kitada, T., DiAndreth, B., Teague, B. & Weiss, R. Programming gene and engineered-cell therapies with synthetic biology. Science 359, eaad1067 (2018).
Article PubMed PubMed Central Google Scholar
Cameron, D. E., Bashor, C. J. & Collins, J. J. A brief history of synthetic biology. Nat. Rev. Microbiol. 12, 381–390 (2014).
Article CAS PubMed Google Scholar
Yeung, E. et al. Biophysical constraints arising from compositional context in synthetic gene networks. Cell Syst. 5, 11–24 (2017).
Article CAS PubMed Google Scholar
Lou, C., Stanton, B., Chen, Y. J., Munsky, B. & Voigt, C. A. Ribozyme-based insulator parts buffer synthetic circuits from genetic context. Nat. Biotechnol. 30, 1137–1142 (2012).
Article CAS PubMed PubMed Central Google Scholar
Muller, I. E. et al. Gene networks that compensate for crosstalk with crosstalk. Nat. Commun. 10, 4028 (2019).
Article ADS PubMed PubMed Central Google Scholar
Kinney, J. B., Murugan, A., Callan, C. G. Jr. & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA 107, 9158–9163 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA 110, 14024–14029 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Taskiran, I. I. et al. Cell-type-directed design of synthetic enhancers. Nature 626, 212–220 (2024).
Article ADS CAS PubMed Google Scholar
Gosai, S. J. et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 634, 1211–1220 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Agarwal, V. et al. Massively parallel characterization of transcriptional regulatory elements. Nature 639, 411–420 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Bogard, N., Linder, J., Rosenberg, A. B. & Seelig, G. A deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 91–106 (2019).
Article CAS PubMed PubMed Central Google Scholar
Khoroshkin, M. et al. A generative framework for enhanced cell-type specificity in rationally designed mRNAs. Preprint at bioRxiv https://doi.org/10.1101/2024.12.31.630783 (2024).
Gera, T., Jonas, F., More, R. & Barkai, N. Evolution of binding preferences among whole-genome duplicated transcription factors. eLife 11, e73225 (2022).
Article CAS PubMed PubMed Central Google Scholar
DelRosso, N. et al. Large-scale mapping and mutagenesis of human transcriptional effector domains. Nature 616, 365–372 (2023).
Zhou, Y. et al. Encoding genetic circuits with DNA barcodes paves the way for machine learning-assisted metabolite biosensor response curve profiling in yeast. ACS Synth. Biol. 11, 977–989 (2022).
Article CAS PubMed Google Scholar
Wong, A. S., Choi, G. C., Cheng, A. A., Purcell, O. & Lu, T. K. Massively parallel high-order combinatorial genetics in human cells. Nat. Biotechnol. 33, 952–961 (2015).
Article CAS PubMed PubMed Central Google Scholar
Matreyek, K. A. et al. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nat. Genet. 50, 874–882 (2018).
Article CAS PubMed PubMed Central Google Scholar
Liu, H. et al. Magic pools: parallel assessment of transposon delivery vectors in bacteria. mSystems 3, e00143-17 (2018).
Article PubMed PubMed Central Google Scholar
Weber, E., Engler, C., Gruetzner, R., Werner, S. & Marillonnet, S. A modular cloning system for standardized assembly of multigene constructs. PLoS ONE 6, e16765 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Duportet, X. et al. A platform for rapid prototyping of synthetic gene networks in mammalian cells. Nucleic Acids Res. 42, 13440–13451 (2014).
Article CAS PubMed PubMed Central Google Scholar
Petitclerc, D. et al. The effect of various introns and transcription terminators on the efficiency of expression vectors in various cultured cell lines and in the mammary gland of transgenic mice. J. Biotechnol. 40, 169–178 (1995).
Article CAS PubMed Google Scholar
Khalil, A. S. et al. A synthetic biology framework for programming eukaryotic transcription functions. Cell 150, 647–658 (2012).
Article CAS PubMed PubMed Central Google Scholar
Maeder, M. L., Thibodeau-Beganny, S., Sander, J. D., Voytas, D. F. & Joung, J. K. Oligomerized pool engineering (OPEN): an ‘open-source’ protocol for making customized zinc-finger arrays. Nat. Protoc. 4, 1471–1501 (2009).
Article CAS PubMed PubMed Central Google Scholar
Li, H. S. et al. Multidimensional control of therapeutic human cell function with synthetic gene circuits. Science 378, 1227–1234 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Feil, R., Wagner, J., Metzger, D. & Chambon, P. Regulation of Cre recombinase activity by mutated estrogen receptor ligand-binding domains. Biochem. Biophys. Res. Commun. 237, 752–757 (1997).
Article ADS CAS PubMed Google Scholar
Bashor, C. J. et al. Complex signal processing in synthetic gene circuits using cooperative regulatory assemblies. Science 364, 593–597 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Donahue, P. S. et al. The COMET toolkit for composing customizable genetic programs in mammalian cells. Nat. Commun. 11, 779 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Muldoon, J. J. et al. Model-guided design of mammalian genetic programs. Sci. Adv. 7, eabe9375 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Kabadi, A. M. & Gersbach, C. A. Engineering synthetic TALE and CRISPR/Cas9 transcription factors for regulating gene expression. Methods 69, 188–197 (2014).
Article CAS PubMed PubMed Central Google Scholar
La Russa, M. F. & Qi, L. S. The new state of the art: Cas9 for gene activation and repression. Mol. Cell. Biol. 35, 3800–3809 (2015).
Article PubMed PubMed Central Google Scholar
Sadowski, I., Ma, J., Triezenberg, S. & Ptashne, M. GAL4-VP16 is an unusually potent transcriptional activator. Nature 335, 563–564 (1988).
Article ADS CAS PubMed Google Scholar
Shin, Y. et al. Spatiotemporal control of intracellular phase transitions using light-activated optoDroplets. Cell 168, 159–171 (2017).
Article CAS PubMed Google Scholar
Schneider, N. et al. Liquid-liquid phase separation of light-inducible transcription factors increases transcription activation in mammalian cells and mice. Sci. Adv. 7, eabd3568 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Gossen, M. & Bujard, H. Tight control of gene expression in mammalian cells by tetracycline-responsive promoters. Proc. Natl Acad. Sci. USA 89, 5547–5551 (1992).
Article ADS CAS PubMed PubMed Central Google Scholar
Shin, Y. & Brangwynne, C. P. Liquid phase condensation in cell physiology and disease. Science 357, eaaf4382 (2017).
Article PubMed Google Scholar
Tycko, J. et al. Development of compact transcriptional effectors using high-throughput measurements in diverse contexts. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02442-6 (2024).
Article PubMed PubMed Central Google Scholar
Tague, E. P., Dotson, H. L., Tunney, S. N., Sloas, D. C. & Ngo, J. T. Chemogenetic control of gene expression and cell signaling with antiviral drugs. Nat. Methods 15, 519–522 (2018).
Article CAS PubMed PubMed Central Google Scholar
Jiang, K. et al. Rapid in silico directed evolution by a protein language model with EVOLVEpro. Science 387, eadr6006 (2025).
Article CAS PubMed Google Scholar
Lin, J., Luo, R. & Pinello, L. EPInformer: a scalable deep learning framework for gene expression prediction by integrating promoter-enhancer sequences with multimodal epigenomic data. Preprint at bioRxiv https://doi.org/10.1101/2024.08.01.606099 (2024).
Wimmer, E., Mueller, S., Tumpey, T. M. & Taubenberger, J. K. Synthetic viruses: a new opportunity to understand and prevent viral disease. Nat. Biotechnol. 27, 1163–1172 (2009).
Article CAS PubMed PubMed Central Google Scholar
Brophy, J. A. & Voigt, C. A. Principles of genetic circuit design. Nat. Methods 11, 508–520 (2014).
Article CAS PubMed PubMed Central Google Scholar
Pinglay, S. et al. Synthetic regulatory reconstitution reveals principles of mammalian Hox cluster regulation. Science 377, eabk2820 (2022).
Article CAS PubMed PubMed Central Google Scholar
Voigt, C. A. Synthetic biology 2020-2030: six commercially-available products that are changing our world. Nat. Commun. 11, 6379 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Valeri, J. A. et al. Sequence-to-function deep learning frameworks for engineered riboregulators. Nat. Commun. 11, 5058 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Hollerer, S. et al. Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping. Nat. Commun. 11, 3551 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Rai, K., Wang, Y., O’Connell, R. W., Patel, A. B. & Bashor, C. J. Using machine learning to enhance and accelerate synthetic biology. Curr. Opin. Biomed. Eng. 31, 100553 (2024).
Karst, S. M. et al. High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat. Methods 18, 165–169 (2021).
Article CAS PubMed Google Scholar
Chung, C. T., Niemela, S. L. & Miller, R. H. One-step preparation of competent Escherichia coli: transformation and storage of bacterial cells in the same solution. Proc. Natl Acad. Sci. USA 86, 2172–2175 (1989).
Article ADS CAS PubMed PubMed Central Google Scholar
Parrish, J. R. et al. High-throughput cloning of Campylobacter jejuni ORfs by in vivo recombination in Escherichia coli. J. Proteome Res. 3, 582–586 (2004).
Article CAS PubMed Google Scholar
Currin, A. et al. Highly multiplexed, fast and accurate nanopore sequencing for verification of synthetic DNA constructs and sequence libraries. Synth. Biol. 4, ysz025 (2019).
Article CAS Google Scholar
De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).
Article PubMed PubMed Central Google Scholar
Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
Article CAS PubMed Google Scholar
Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819–823 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Hermann, M. et al. Binary recombinase systems for high-resolution conditional mutagenesis. Nucleic Acids Res. 42, 3894–3907 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank O. Igoshin, A. Patel, Y. Lagisetty, S. Singh and the members of the Bashor laboratory for discussions. This work was supported by grants from NIH R01 EB029483 (C.J.B.), NIH R01 EB032272 (C.J.B.), ONR N00014-21-1-4006 (C.J.B.) and funding from the Robert J. Kleberg Jr and Helen C. Kleberg Foundation (C.J.B.). This work was also supported by the Genetic Design and Engineering Center (GDEC) at Rice University, which is funded by CPRIT RP210116. R.W.O. was supported by a graduate fellowship from the American Heart Association (917746). B.K. was supported by a NLM Training Program in Biomedical Informatics and Data Science fellowship (T15LM007093-31) and by NIH grant P01-AI15299901. K.D.C. was supported by NSF EF-2126387 and the Ken Kennedy Institute Computational Science & Engineering Recruiting Fellowship. T.J.T. was supported by NSF grants IIS-2239114 and EF-2126387, NIH grant P01-AI152999 and AI2Health cluster funding from Ken Kennedy Institute, Rice University. P.M. and J.W.R were supported by NIH R35GM119461 (P.M.).

Author information

These authors contributed equally: Kshitij Rai, Ronan W. O’Connell

Authors and Affiliations

Department of Bioengineering, Rice University, Houston, TX, USA
Kshitij Rai, Ronan W. O’Connell, Trenton C. Piepergerdes, Yiduo Wang, Lucas B. C. Brown, Kian D. Samra, Jack A. Wilson, Shujian Lin, Thomas H. Zhang, Eduardo M. Ramos, Andrew Sun, Todd J. Treangen & Caleb J. Bashor
Graduate Program in Systems, Synthetic and Physical Biology, Rice University, Houston, TX, USA
Kshitij Rai & Lucas B. C. Brown
Graduate Program in Bioengineering, Rice University, Houston, TX, USA
Ronan W. O’Connell, Trenton C. Piepergerdes & Yiduo Wang
Department of Computer Science, Rice University, Houston, TX, USA
Bryce Kille, Kristen D. Curry & Todd J. Treangen
Ken Kennedy Institute, Rice University, Houston, TX, USA
Jason W. Rocks, Todd J. Treangen & Caleb J. Bashor
Rice Synthetic Biology Institute, Houston, TX, USA
Todd J. Treangen & Caleb J. Bashor
Department of Physics, Boston University, Boston, MA, USA
Pankaj Mehta
Biological Design Center, Boston University, Boston, MA, USA
Pankaj Mehta
Faculty of Computing and Data Science, Boston University, Boston, MA, USA
Pankaj Mehta
Department of Biosciences, Rice University, Houston, TX, USA
Caleb J. Bashor

Authors

Kshitij Rai
View author publications
Search author on:PubMed Google Scholar
Ronan W. O’Connell
View author publications
Search author on:PubMed Google Scholar
Trenton C. Piepergerdes
View author publications
Search author on:PubMed Google Scholar
Yiduo Wang
View author publications
Search author on:PubMed Google Scholar
Lucas B. C. Brown
View author publications
Search author on:PubMed Google Scholar
Kian D. Samra
View author publications
Search author on:PubMed Google Scholar
Jack A. Wilson
View author publications
Search author on:PubMed Google Scholar
Shujian Lin
View author publications
Search author on:PubMed Google Scholar
Thomas H. Zhang
View author publications
Search author on:PubMed Google Scholar
Eduardo M. Ramos
View author publications
Search author on:PubMed Google Scholar
Andrew Sun
View author publications
Search author on:PubMed Google Scholar
Bryce Kille
View author publications
Search author on:PubMed Google Scholar
Kristen D. Curry
View author publications
Search author on:PubMed Google Scholar
Jason W. Rocks
View author publications
Search author on:PubMed Google Scholar
Todd J. Treangen
View author publications
Search author on:PubMed Google Scholar
Pankaj Mehta
View author publications
Search author on:PubMed Google Scholar
Caleb J. Bashor
View author publications
Search author on:PubMed Google Scholar

Contributions

R.W.O., K.R. and C.J.B. conceived the study. R.W.O. and K.R. carried out the experiments and developed the analysis software, with assistance from T.C.P., Y.W., L.B.C.B., K.D.S., J.A.W., S.L., T.H.Z., E.M.R. and A.S.; R.W.O., K.R. and T.C.P. developed the modular cloning scheme and LP cell line, with assistance from S.L.; B.K., K.D.C. and T.J.T. helped to develop the barcoding scheme and analysis software. R.W.O., K.R., T.C.P., Y.W., J.W.R., P.M. and C.J.B. analysed the data. C.J.B. supervised the study. R.W.O., K.R. and C.J.B wrote the manuscript, with input from all of the authors.

Corresponding author

Correspondence to Caleb J. Bashor.

Ethics declarations

Competing interests

A provisional patent application that covers technologies described in this Article has been filed by Rice University.

Peer review

Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Overview of 166K-member gene circuit library construction.

a, Timeline of library construction process from level 0 part fragments to level 3 libraries. b, Schematic of the construction strategy for each level of the assembly process (grey boxes), with number of cells indicated for assembly transformation and plating (brown circles, 10 cm plates; brown rectangles, autoclave trays), colony scraping, and HEK-LP transfection (pink circles, 10 cm plates). The inputs and products for each assembly level are represented to the right. c, Nanopore sequencing read-length distribution of level 3 single-input inducible circuit library, with the relative proportion of each identifiable DNA product denoted: library members (pink), re-circularized level 3 destination vector (orange), empty level 3 destination vector (dark grey), contamination from level 0, 1, or 2 genetic parts (light grey), or unidentified DNA species (blue).

Extended Data Fig. 2 166k-member library balance and barcoding analysis metrics.

a, Confusion matrices showing the percentage of reads unambiguously assigned to each genetic part (on-diagonal) and reads ambiguously linked to two parts (off-diagonal) following level 3 library assembly. Individual values representing <0.5% of the total reads are not shown. The percentage of reads in which a part identity could not be determined are shown at the bottom of the respective part confusion matrices. b, Number of barcodes determined to be uniquely mapped to a single composition (“unique”) or multiple compositions (“non-unique”), as determined from Nanopore sequencing analysis of the level 3 library.

Extended Data Fig. 3 Comparison and optimization of ML models.

a, Top: schematic outlining the process of generating training (light grey), validation (blue), test (purple), and isolate (navy) sets. CLASSIC data sets are first divided into high- (>12 barcodes, dark grey) or general-quality (<12 barcodes, light grey) sets before generating the training, validation, and test splits. Bottom: comparison of the performance of 5 different model classes (linear regression, quadratic regression, random forest (RF), convolutional neural network (CNN), and multi-layer perceptron (MLP)) for predicting circuit behaviour using varying amounts of training data (x-axis), as monitored using test (purple line), validation (light blue line), and isolate (navy line) set r² values (y-axis) (see supplementary text section S3.5). Grey shaded region on each plot represents a regime in which the training set is dominated by general-quality reads. For the selected model class (MLP), training curves of the root mean squared error (RMSE) and loss for the validation set are provided. b, Basal (purple line) and induced (navy line) r² values for predicted vs observed CLASSIC expression from the trained MLP. Insets represent CLASSIC vs predicted measurements for > 1 (bottom left, r² = 0.43) and > 12 (top middle, r² = 0.80) barcodes per composition, and the number of compositions for increasing number of barcodes per composition (bottom right). c, Hyperparameter optimization (HPO) of the MLP, monitored using the validation set: learning rate (LR, y-axis) for different numbers of layers (x-axis) with 4 layers and a learning rate of 5 × 10⁻² providing highest r² (red outline) (top); solver choice, with SGDM leading to the highest r² (red bar) (bottom left); momentum, with a momentum of 0.9 providing the highest r² (red circle) (bottom right). d, Comparing ground-truth basal expression (left), induced expression (middle), and fold change measurements (right) (x-axis) with HPO MLP predictions (y-axis) for the isolate set (r² = 0.96, r² = 0.91, MAE = 0.22, respectively, n = 40).

Extended Data Fig. 4 Summary of cell lines constructed from the 166k-library design space to validate predictions from the MLP model.

a, Basal and induced eGFP expression levels for each constructed cell line circuit composition overlaid onto single-input behaviour space (clonally isolated, green; constructed out-of-sample, red; constructed in-sample, teal; contour for 97.5% of compositions in the MLP-predicted behaviour space, grey). Dotted lines separate behavioural regions of interest: low basal (<500 AU), purple arrow; high induction (>70k AU) blue; high fold-change (HFC) (>25x, green). b, Comparing fold-change values for MLP model-predictions (red, left, n = 136) or CLASSIC measurements (green, right, n = 121) with ground truth cell line measurements. c, Residual between model and CLASSIC measurements. Heatmap corresponds to the manhattan distance between residuals of basal and induced expression values calculated across a 20 × 20 grid in the behaviour space. Variants were assigned to grids based on CLASSIC measured values. d, Basal and induced expression behaviour for the top 10 highest error compositions from the individual variants shown in (a). CLASSIC measurements (grey) and corresponding ground truth values measured from constructed cell lines(green) are shown. Lines (blue dashed) link each pair of CLASSIC and ground truth expression.

Extended Data Fig. 5 Clustering analysis of HFC compositions.

a, Gap test for cluster number (left) and subsequent UMAP projection of HFC variants grouped into 3 cluster (cluster A, blue; cluster B, red; cluster C, purple) (middle left). Cluster similarity scores from 100 independent k-means clustering outcomes for all compositions, and adjusted rand score for all pairs of clustering results (right). Means represent the population cluster similarity and adjusted rand index respectively. b, Part usage frequency for variants in each cluster. c, Mapping of basal and induced eGFP expression of compositions from each of the three clusters, overlaid on a contour constructed from 97.5% of the data from the MLP-predicted behaviour space (see Fig. 4a) (grey fill). d, Distribution of basal (dotted line) and induced eGFP expression (solid line) (bottom axis), as well as fold change values (grey line, top axis) for each cluster. Circles represent the median values, boxes span the 25^th to 75^th percentiles, and the upper and lower whiskers represent the median +/− 1.5x IQR (line ends). Median values are shown to the left of the plot. Sample sizes (n): Cluster A = 5,018, Cluster B = 452, and Cluster C = 62.

Extended Data Fig. 6 Fine-tuning model.

a, Schematic depicting a proposed method for expanding the model-predicted design space (red) to include 2 new parts (NFZ and no IDP) by fine-tuning the model using small libraries of new parts (green) (left). Representation of the position of the new parts in the synTF architecture (right). b, Schematic outlining the assembly strategy to retroactively add the TA NFZ to the design space. Transparent green boxes signify individual plasmids or plasmid pools that contain the new part. c, Schematic outlining the assembly strategy to retroactively add IDP-less variants the design space. Transparent green boxes signify individual plasmids or plasmid pools that contain the new part. d, eGFP distributions for the new libraries to explore this sub-space. e, Table of the number of cells sorted into each bin for both inducer conditions during flowSeq. f, 8 individually constructed variants from the sub-space to validate CLASSIC measurements. Grey region, ERCH; Green square, HFC region. g & h, Comparison of basal eGFP expression predictions and CLASSIC measurements for a high-quality test set of compositions lacking an IDP (panel g) or containing an NFZ TA (panel h), using either a base model (white dots with black outline, r² = 0.90 or r² = 0.81, respectively) or a fine-tuned model (purple dots, r² = 0.94 or r² = 0.89, respectively) (left). Breakdown of the basal (purple) and induced (teal) expression prediction accuracy with increasing amounts of fine-tuning data from the IDP lacking (g) or NFZ-containing (h) libraries, as assessed by monitoring the test set r² (middle). A 2D map outlining the amounts of base library and no IDP library (g) or NFZ-containing (h) data required for optimal fine-tuning of the base model, as determined by the test set r² (right). i, 11 individually constructed variants from the sampled (teal) and un-sampled (red) expanded design space to validate fine-tuned model predictions. Grey region, ERCH; Green square, HFC region.

Extended Data Fig. 7 Hyperparameter tuning and validation of base MLP model.

a, Hyperparameter optimization for the multi-layer fully connected neural network. Validation r² values for (left) 2D combinations of learning rate (y-axis) and number of layers (x-axis) for varied amounts of training data used, (top right) momentum parameter for stochastic gradient descent with momentum (SGDM), and (bottom right) different solvers. Most optimal parameter from each scan is shown in red. b, Training curves showing RMSE (top) and loss (bottom) as a function of training iteration for the validation set. Training was stopped with a validation patience parameter of 300 iterations. c, Comparison of model predictions to randomly isolated clones from the library. MAE: mean absolute error. d, Scatter plots of the test set from the base model for each of the four input conditions (basal: light grey, navy: OHT only, orange: GZV only, green: both inducers). r², Pearson’s r².

Extended Data Fig. 8 Validation of model predictions with individual measurements.

a, Flow cytometry was used to measure basal, OHT-induced, GZV-induced, and dual-induced eGFP expression levels for 36 individually constructed cell lines harbouring integrated circuit compositions sampled from across the multi-input library behaviour space. Green bar plots (flow cytometry measurement) and dotted red outlines (model predictions) are shown for each circuit for all four conditions (far left, basal; middle left, 4-OHT-induction; middle right, GZV-induction; far right, dual induction). Bars represent the mean of the expression distribution for a single measurement. KL-divergence (D_KL) from AND (top) or OR (bottom) gate shown below each plot (prediction, red; measurement, green). Numbers in grey circles represent an index for that circuit. A legend explaining the layout of each plot is shown in a grey rectangle. b, The AND-OR coordinates of each cell line (green dots), superimposed on the contour of the design space (grey).

Extended Data Fig. 9 Extended AND-gate clustering analysis of the multi-input library.

a, Clustering of the multi-input AND-like behaviour space (top left) and part usage across the clusters. b, AND cluster expression distributions across the 4 input conditions (basal, black; 4-OHT, navy; GZV, orange; Both, green), represented by boxplots outlining the interquartile range (IQR) (box), the median (white band), and the median +/− 1.5x IQR (line ends). Median values are shown to the left of the plot. Sample sizes (n): Cluster A = 11,167, Cluster B = 9,854, and Cluster C = 3,627. c, Cluster stability analysis of compositions in the AND-like behaviour space by computing the cluster similarity index (top) across 100 UMAP projections and cluster calculations, and adjusted rand index (bottom) across every pairwise combination of clustering results across the 100 UMAP projections and cluster calculations.

Extended Data Fig. 10 Extended OR-gate clustering analysis of the multi-input library.

a, Clustering of the multi-input OR-like behaviour space (top left) and part usage across the clusters. b, OR cluster expression distributions across the 4 input conditions (basal, black; 4-OHT, navy; GZV, orange; Both, green), represented by boxplots outlining the interquartile range (IQR) (box), the median (white band), and the median +/− 1.5x IQR (line ends). Median values are shown to the left of the plot. Sample sizes (n): Cluster A = 9,240, Cluster B = 7138, Cluster C = 2,908, and Cluster D = 1,545. c, Cluster stability analysis of compositions in the OR-like behaviour space by computing the cluster similarity index (top) across 100 UMAP projections and cluster calculations, and adjusted rand index (bottom) across every pairwise combination of clustering results across the 100 UMAP projections and cluster calculations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Notes and Supplementary Figures supporting the Article and its Extended Data Figures.

Reporting Summary (download PDF )

Peer Review file (download PDF )

Supplementary Table 1 (download XLSX )

DNA sequences used in this study. This includes genetic parts and primers.

Supplementary Table 2 (download XLSX )

Flow cytometry measurements, model predictions and CLASSIC measurements (where applicable) for individually constructed variants and isolated cell lines.

Supplementary Table 3 (download XLSX )

Part use, MI and clustering information from the single-input and multi-input libraries.

Supplementary Table 4 (download XLSX )

A breakeven table for calculating the cost of CLASSIC experiments of varying sizes and complexity.

Supplementary Table 5 (download XLSX )

A list of published single- and dual-input inducible circuits, as well as information such as cell type, integration method, enrichment strategy, inducer molecule, output and FC (where applicable).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rai, K., O’Connell, R.W., Piepergerdes, T.C. et al. Ultra-high-throughput mapping of genetic design space. Nature 650, 1035–1044 (2026). https://doi.org/10.1038/s41586-025-09933-9

Download citation

Received: 16 June 2023
Accepted: 19 November 2025
Published: 14 January 2026
Version of record: 14 January 2026
Issue date: 26 February 2026
DOI: https://doi.org/10.1038/s41586-025-09933-9