Enabling global-scale nucleic acid repositories through versatile, scalable biochemical selection from room-temperature archives

Berleant, Joseph D.; Banal, James L.; Rao, Dhriti K.; Bathe, Mark

doi:10.1038/s41467-026-69402-3

Article
Open access
Published: 14 February 2026

Enabling global-scale nucleic acid repositories through versatile, scalable biochemical selection from room-temperature archives

Nature Communications , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Conventional storage and retrieval of nucleic acid specimens, particularly unstable RNA, rely on costly cold-chain infrastructure and inefficient robotic handling, inhibiting large-scale nucleic acid archives needed for global genomic biobanking. We introduce a scalable room-temperature storage system with minimal physical footprint that enables database-like queries on encapsulated, barcoded, and pooled nucleic acid samples. Queries incorporate numerical ranges, categorical filters, and combinations thereof, advancing beyond previous demonstrations of single-sample retrieval or Boolean classifiers. We evaluate this system on ninety-six mock SARS-CoV-2 genomic samples barcoded with theoretical patient data including age, location, and diagnostic state, demonstrating rapid, scalable retrieval. We further demonstrate storage and sequencing of human patient-derived nucleic acid samples, illustrating applicability to clinical genomic analysis. By avoiding freezer-based storage and retrieval, this approach scales to millions of samples without loss of fidelity or throughput, enabling large-scale pathogen and genomic repositories in under-resourced or isolated regions of the US and worldwide.

Data availability

Raw sequencing data from human-derived samples have been deposited in the NCBI BioProject database under accession number PRJNA1344794: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1344794. Processed match counts to each internal barcode for each experiment are available on Zenodo at https://doi.org/10.5281/zenodo.10501347⁶³. Raw datasets are available on Zenodo at https://doi.org/10.5281/zenodo.17516191⁶⁴. Figure source data are provided in this paper.

Code availability

Data analysis scripts with processed outputs are archived on Zenodo and are available at https://doi.org/10.5281/zenodo.10501347⁶³ and on the GitHub repository https://github.com/lcbb/BiosampleSQL under the MIT license. The version of this repository associated with this publication is archived on Zenodo and is accessible at https://doi.org/10.5281/zenodo.17402438⁶⁵.

References

Kreier, F. The myriad ways sewage surveillance is helping fight COVID around the world. Nature https://doi.org/10.1038/d41586-021-01234-1 (2021).
Collins, F. S. & Varmus, H. A New Initiative on Precision Medicine. N. Engl. J. Med. 372, 793–795 (2015).
Google Scholar
Vargas, A. J. & Harris, C. C. Biomarker development in the precision medicine era: lung cancer as a case study. Nat. Rev. Cancer 16, 525–537 (2016).
Google Scholar
Tarazona, S., Arzalluz-Luque, A. & Conesa, A. Undisclosed, unmet and neglected challenges in multi-omics studies. Nat. Comput. Sci. 1, 395–402 (2021).
Google Scholar
Lee, S. B. et al. Assessing a novel room temperature DNA storage medium for forensic biological samples. Forensic Sci. Int. Genet. 6, 31–40 (2012).
Google Scholar
Ryder, O. A., McLaren, A., Brenner, S., Zhang, Y.-P. & Benirschke, K. DNA Banks for endangered animal species. Science 288, 275–277 (2000).
Google Scholar
Brandies, P., Peel, E., Hogg, C. J. & Belov, K. The value of reference genomes in the conservation of threatened species. Genes 10, 846 (2019).
Google Scholar
Kieffer, C., Genot, A. J., Rondelez, Y. & Gines, G. Molecular computation for molecular classification. Adv. Biol. 7, 2200203 (2023).
Google Scholar
Zhang, D. Y. & Seelig, G. Dynamic DNA nanotechnology using strand-displacement reactions. Nat. Chem. 3, 103–113 (2011).
Google Scholar
Lopez, R., Wang, R. & Seelig, G. A molecular multi-gene classifier for disease diagnostics. Nat. Chem. 10, 746–754 (2018).
Google Scholar
Zhang, C. et al. Cancer diagnosis with DNA molecular computation. Nat. Nanotechnol. 15, 709–715 (2020).
Google Scholar
Yin, F. et al. DNA-framework-based multidimensional molecular classifiers for cancer diagnosis. Nat. Nanotechnol. 18, 677–686 (2023).
Google Scholar
Roundtree, I. A. & He, C. RNA epigenetics—chemical messages for posttranscriptional gene regulation. Curr. Opin. Chem. Biol. 30, 46–51 (2016).
Google Scholar
Kan, R. L., Chen, J. & Sallam, T. Crosstalk between epitranscriptomic and epigenetic mechanisms in gene regulation. Trends Genet. 38, 182–193 (2022).
Google Scholar
Helm, M. & Motorin, Y. Detecting RNA modifications in the epitranscriptome: predict and validate. Nat. Rev. Genet. 18, 275–291 (2017).
Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Google Scholar
Elliott, P., Peakman, T. C. & Biobank, U. K. The UK Biobank sample handling and storage protocol for the collection, processing and archiving of human blood and urine. Int. J. Epidemiol. 37, 234–244 (2008).
Google Scholar
Bull, R. A. et al. Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis. Nat. Commun. 11, 6272 (2020).
Google Scholar
Minogue, T. D., Koehler, J. W., Stefan, C. P. & Conrad, T. A. Next-generation sequencing for biodefense: biothreat detection, forensics, and the clinic. Clin. Chem. 65, 383–392 (2019).
Google Scholar
Whitmore, L. et al. Inadvertent human genomic bycatch and intentional capture raise beneficial applications and ethical concerns with environmental DNA. Nat. Ecol. Evol. 7, 873–888 (2023).
Google Scholar
Opitz, L. et al. Impact of RNA degradation on gene expression profiling. BMC Med. Genomics 3, 36 (2010).
Google Scholar
Gallego Romero, I., Pai, A. A., Tung, J. & Gilad, Y. RNA-seq: impact of RNA degradation on transcript quantification. BMC Biol. 12, 42 (2014).
Google Scholar
Mendy, M. et al. Biospecimens and Biobanking in Global Health. Glob. Health Pathol. 38, 183–207 (2018).
Google Scholar
Ziyatdinov, A. et al. Genotyping, sequencing and analysis of 140,000 adults from Mexico City. Nature 622, 784–793 (2023).
Google Scholar
Wall, J. D. et al. The GenomeAsia 100K project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
Google Scholar
Naslavsky, M. S. et al. Whole-genome sequencing of 1,171 elderly admixed individuals from Brazil. Nat. Commun. 13, 1004 (2022).
Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Google Scholar
Bick, A. G. et al. Genomic data in the All of Us Research Program. Nature 627, 340–346 (2024).
Google Scholar
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
Google Scholar
Tomek, K. J. et al. Driving the scalability of DNA-based information storage systems. ACS Synth. Biol. 8, 1241–1248 (2019).
Google Scholar
Banal, J. L. & Bathe, M. Scalable nucleic acid storage and retrieval using barcoded microcapsules. ACS Appl. Mater. Interfaces 13, 49729–49736 (2021).
Google Scholar
Banal, J. L. et al. Random access DNA memory using Boolean search in an archival file storage system. Nat. Mater. 20, 1272–1280 (2021).
Google Scholar
Organick, L. et al. Probing the physical limits of reliable DNA data retrieval. Nat. Commun. 11, 616 (2020).
Google Scholar
Xu, Q., Schlabach, M. R., Hannon, G. J. & Elledge, S. J. Design of 240,000 orthogonal 25mer DNA barcode probes. Proc. Natl. Acad. Sci. USA 106, 2289–2294 (2009).
Google Scholar
Porichis, F. et al. High-throughput detection of miRNAs and gene-specific mRNA at the single-cell level by flow cytometry. Nat. Commun. 5, 5641 (2014).
Google Scholar
Goldstein, E., Lipsitch, M. & Cevik, M. On the effect of age on the transmission of SARS-CoV-2 in households, schools, and the community. J. Infect. Dis. 223, 362–369 (2021).
Google Scholar
Fauver, J. R. et al. Coast-to-coast spread of SARS-CoV-2 during the Early Epidemic in the United States. Cell 181, 990–996 (2020).
Google Scholar
Kishi, J. Y. et al. SABER amplifies FISH: enhanced multiplexed imaging of RNA and DNA in cells and tissues. Nat. Methods 16, 533–544 (2019).
Google Scholar
Player, A. N., Shen, L.-P., Kenny, D., Antao, V. P. & Kolberg, J. A. Single-copy gene detection using branched DNA (bDNA) in situ hybridization. J. Histochem. Cytochem. 49, 603–611 (2001).
Google Scholar
Tao, K. et al. The biological and clinical significance of emerging SARS-CoV-2 variants. Nat. Rev. Genet. 22, 757–773 (2021).
Google Scholar
Bei, Y. et al. Overcoming variant mutation-related impacts on viral sequencing and detection methodologies. Front. Med. 9, 989913 (2022).
Google Scholar
Karthikeyan, S. et al. Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission. Nature 609, 101–108 (2022).
Google Scholar
Lagerborg, K. A. et al. Synthetic DNA spike-ins (SDSIs) enable sample tracking and detection of inter-sample contamination in SARS-CoV-2 sequencing workflows. Nat. Microbiol. 7, 108–119 (2022).
Google Scholar
Kubik, S. et al. Recommendations for accurate genotyping of SARS-CoV-2 using amplicon-based sequencing of clinical samples. Clin. Microbiol. Infect. 27, 1036.e1–1036.e8 (2021).
Google Scholar
Rosenthal, S. H. et al. Development and validation of a high throughput SARS-CoV-2 whole genome sequencing workflow in a clinical laboratory. Sci. Rep. 12, 2054 (2022).
Google Scholar
BigQuery public datasets. Google Cloud https://cloud.google.com/bigquery/public-data.
Open Datasets Documentation - Tutorials, API reference - Azure - Azure Open Datasets. https://learn.microsoft.com/en-us/azure/open-datasets/.
Open Data on AWS. https://aws.amazon.com/opendata/.
The Nucleic Acid Observatory Consortium. A global nucleic acid observatory for biodefense and planetary health. Preprint at arXiv:2108.02678 (2021).
Azenta Life Sciences. Cryogenic Storage Solutions in Life Sciences. https://www.azenta.com/learning-center/resources/cryogenic-storage-solutions-life-sciences-comprehensive-guide-decision-making (2024).
Bee, C. et al. Molecular-level similarity search brings computing to DNA data storage. Nat. Commun. 12, 4764 (2021).
Google Scholar
Eldjarn, G. H. et al. Large-scale plasma proteomics comparisons through genetics and disease associations. Nature 622, 348–358 (2023).
Google Scholar
Zhao, T. et al. Spatial genomics enables multi-modal study of clonal heterogeneity in tissues. Nature 601, 85–91 (2022).
Google Scholar
Hunter, J. D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 9, 90–95 (2007).
Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Google Scholar
Knuth, D. E. The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations. (Addison-Wesley, 2005).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Google Scholar
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Google Scholar
Wilm, A. et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 40, 11189–11201 (2012).
Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Google Scholar
Aksamentov, I., Roemer, C., Hodcroft, E. & Neher, R. Nextclade: clade assignment, mutation calling and quality control for viral genomes. J. Open Source Softw. 6, 3773 (2021).
Google Scholar
Berleant, J. D., Banal, J. L., Rao, D. K. & Bathe, M. Enabling global-scale nucleic acid repositories through versatile, scalable biochemical selection from room-temperature archives. Zenodo https://doi.org/10.5281/ZENODO.10501347 (2025).
Berleant, J. D., Banal, J. L., Rao, D. K. & Bathe, M. Full datasets from: Enabling global-scale nucleic acid repositories through versatile, scalable biochemical selection from room-temperature archives. Zenodo https://doi.org/10.5281/ZENODO.17516191 (2025).
Berleant, J. D., Banal, J. L., Rao, D. K. & Bathe, M. lcbb/BiosampleSQL: Publication release. Zenodo https://doi.org/10.5281/ZENODO.17402438 (2025).
NIAID Visual & Medical Arts. Eppendorf Tube. NIAID NIH BIOART Source. bioart.niaid.nih.gov/bioart/143 (2024).
NIAID Visual & Medical Arts. 96 Well Plate. NIAID NIH BIOART source. bioart.niaid.nih.gov/bioart/7 (2024).
NIAID Visual & Medical Arts. Next gen sequencer. NIAID NIH BIOART source. bioart.niaid.nih.gov/bioart/386 (2024).

Download references

Acknowledgements

M.B. and J.D.B. were supported by the Office of Naval Research (N00014-21-1-4013), the Army Research Office (ICB Subaward KK1954), and the National Science Foundation (CBET-1729397, OAC-1940231, and CCF-1956054). Additional funding to M.B. was provided through the National Science Foundation (CCF-2403100) and to J.D.B. through a National Science Foundation Graduate Research Fellowship (Grant No. 1122374). J.L.B. acknowledges support in part by the National Science Foundation SBIR Phase I 2136447, UCSF Parnassus Flow CoLab RRID:SCR_018206, DRC Center Grant NIH P30 DK063720, UCSF Center for Advanced Technology at Mission Bay, and Illumina. This research was also supported by a core center grant from the National Institute of Environmental Health Sciences, National Institutes of Health (P30-ES002109). We are grateful to T.B. Schardl and C.E. Leiserson (MIT CSAIL) for useful discussions on DNA barcoding. We thank G. Paradis, M. Jennings, and M. Griffin of the Flow Cytometry Core at the Koch Institute at the Massachusetts Institute of Technology (MIT) for flow sorting assistance. We are grateful to Delaware Diagnostics Labs for providing us de-identified clinical SARS-CoV-2 samples. We thank G. Tikhorimov for providing access to a Beckman Coulter Labcyte Echo 550. We are grateful to Ella Maru Studio, Inc. for assistance in creating the airport schematic in Fig. 1a.

Author information

James L. Banal
Present address: Cache DNA, Inc., San Carlos, CA, USA
These authors contributed equally: Joseph D. Berleant, James L. Banal.

Authors and Affiliations

Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
Joseph D. Berleant, James L. Banal & Mark Bathe
University of Cambridge, Cambridge, UK
Dhriti K. Rao
Broad Institute of MIT and Harvard, Cambridge, MA, USA
Mark Bathe

Authors

Joseph D. Berleant
View author publications
Search author on:PubMed Google Scholar
James L. Banal
View author publications
Search author on:PubMed Google Scholar
Dhriti K. Rao
View author publications
Search author on:PubMed Google Scholar
Mark Bathe
View author publications
Search author on:PubMed Google Scholar

Contributions

M.B., J.D.B., and J.L.B. conceived the sample storage system. J.D.B. designed the sample barcoding scheme and query language architecture. J.L.B. designed and implemented sample synthesis, FAS selection, and post-processing after selection of mock patient samples, and encapsulation and barcoding of clinical SARS-CoV-2 samples. D.K.R. prepared samples for sequencing and analyzed the data. J.L.B. and J.D.B. performed data analysis after querying and calculation of summary statistics. M.B. supervised the entire project. All authors contributed equally to the writing of the manuscript.

Corresponding author

Correspondence to Mark Bathe.

Ethics declarations

Competing interests

The Massachusetts Institute of Technology has filed a patent related to this work on behalf of J.L.B., M.B., J.D.B., and additional inventors (US Patent App. 17/836,726). J.L.B. and M.B. are co-founders and equity shareholders of Cache DNA, Inc. (Cache). J.L.B. is an employee of Cache and an independent contractor of OpenAI. D.K.R. was an intern at Cache for the period of this work.

Peer review

Peer review information

Nature Communications thanks Fajia Sun and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Dataset 1

Supplementary Dataset 2

Supplementary Dataset 3

Supplementary Dataset 4

Supplementary Dataset 5

Supplementary Dataset 6

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Berleant, J.D., Banal, J.L., Rao, D.K. et al. Enabling global-scale nucleic acid repositories through versatile, scalable biochemical selection from room-temperature archives. Nat Commun (2026). https://doi.org/10.1038/s41467-026-69402-3

Download citation

Received: 02 April 2025
Accepted: 28 January 2026
Published: 14 February 2026
DOI: https://doi.org/10.1038/s41467-026-69402-3