Abstract
Conventional storage and retrieval of nucleic acid specimens, particularly unstable RNA, rely on costly cold-chain infrastructure and inefficient robotic handling, inhibiting large-scale nucleic acid archives needed for global genomic biobanking. We introduce a scalable room-temperature storage system with minimal physical footprint that enables database-like queries on encapsulated, barcoded, and pooled nucleic acid samples. Queries incorporate numerical ranges, categorical filters, and combinations thereof, advancing beyond previous demonstrations of single-sample retrieval or Boolean classifiers. We evaluate this system on ninety-six mock SARS-CoV-2 genomic samples barcoded with theoretical patient data including age, location, and diagnostic state, demonstrating rapid, scalable retrieval. We further demonstrate storage and sequencing of human patient-derived nucleic acid samples, illustrating applicability to clinical genomic analysis. By avoiding freezer-based storage and retrieval, this approach scales to millions of samples without loss of fidelity or throughput, enabling large-scale pathogen and genomic repositories in under-resourced or isolated regions of the US and worldwide.
Data availability
Raw sequencing data from human-derived samples have been deposited in the NCBI BioProject database under accession number PRJNA1344794: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1344794. Processed match counts to each internal barcode for each experiment are available on Zenodo at https://doi.org/10.5281/zenodo.1050134763. Raw datasets are available on Zenodo at https://doi.org/10.5281/zenodo.1751619164. Figure source data are provided in this paper.
Code availability
Data analysis scripts with processed outputs are archived on Zenodo and are available at https://doi.org/10.5281/zenodo.1050134763 and on the GitHub repository https://github.com/lcbb/BiosampleSQL under the MIT license. The version of this repository associated with this publication is archived on Zenodo and is accessible at https://doi.org/10.5281/zenodo.1740243865.
References
Kreier, F. The myriad ways sewage surveillance is helping fight COVID around the world. Nature https://doi.org/10.1038/d41586-021-01234-1 (2021).
Collins, F. S. & Varmus, H. A New Initiative on Precision Medicine. N. Engl. J. Med. 372, 793–795 (2015).
Vargas, A. J. & Harris, C. C. Biomarker development in the precision medicine era: lung cancer as a case study. Nat. Rev. Cancer 16, 525–537 (2016).
Tarazona, S., Arzalluz-Luque, A. & Conesa, A. Undisclosed, unmet and neglected challenges in multi-omics studies. Nat. Comput. Sci. 1, 395–402 (2021).
Lee, S. B. et al. Assessing a novel room temperature DNA storage medium for forensic biological samples. Forensic Sci. Int. Genet. 6, 31–40 (2012).
Ryder, O. A., McLaren, A., Brenner, S., Zhang, Y.-P. & Benirschke, K. DNA Banks for endangered animal species. Science 288, 275–277 (2000).
Brandies, P., Peel, E., Hogg, C. J. & Belov, K. The value of reference genomes in the conservation of threatened species. Genes 10, 846 (2019).
Kieffer, C., Genot, A. J., Rondelez, Y. & Gines, G. Molecular computation for molecular classification. Adv. Biol. 7, 2200203 (2023).
Zhang, D. Y. & Seelig, G. Dynamic DNA nanotechnology using strand-displacement reactions. Nat. Chem. 3, 103–113 (2011).
Lopez, R., Wang, R. & Seelig, G. A molecular multi-gene classifier for disease diagnostics. Nat. Chem. 10, 746–754 (2018).
Zhang, C. et al. Cancer diagnosis with DNA molecular computation. Nat. Nanotechnol. 15, 709–715 (2020).
Yin, F. et al. DNA-framework-based multidimensional molecular classifiers for cancer diagnosis. Nat. Nanotechnol. 18, 677–686 (2023).
Roundtree, I. A. & He, C. RNA epigenetics—chemical messages for posttranscriptional gene regulation. Curr. Opin. Chem. Biol. 30, 46–51 (2016).
Kan, R. L., Chen, J. & Sallam, T. Crosstalk between epitranscriptomic and epigenetic mechanisms in gene regulation. Trends Genet. 38, 182–193 (2022).
Helm, M. & Motorin, Y. Detecting RNA modifications in the epitranscriptome: predict and validate. Nat. Rev. Genet. 18, 275–291 (2017).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Elliott, P., Peakman, T. C. & Biobank, U. K. The UK Biobank sample handling and storage protocol for the collection, processing and archiving of human blood and urine. Int. J. Epidemiol. 37, 234–244 (2008).
Bull, R. A. et al. Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis. Nat. Commun. 11, 6272 (2020).
Minogue, T. D., Koehler, J. W., Stefan, C. P. & Conrad, T. A. Next-generation sequencing for biodefense: biothreat detection, forensics, and the clinic. Clin. Chem. 65, 383–392 (2019).
Whitmore, L. et al. Inadvertent human genomic bycatch and intentional capture raise beneficial applications and ethical concerns with environmental DNA. Nat. Ecol. Evol. 7, 873–888 (2023).
Opitz, L. et al. Impact of RNA degradation on gene expression profiling. BMC Med. Genomics 3, 36 (2010).
Gallego Romero, I., Pai, A. A., Tung, J. & Gilad, Y. RNA-seq: impact of RNA degradation on transcript quantification. BMC Biol. 12, 42 (2014).
Mendy, M. et al. Biospecimens and Biobanking in Global Health. Glob. Health Pathol. 38, 183–207 (2018).
Ziyatdinov, A. et al. Genotyping, sequencing and analysis of 140,000 adults from Mexico City. Nature 622, 784–793 (2023).
Wall, J. D. et al. The GenomeAsia 100K project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
Naslavsky, M. S. et al. Whole-genome sequencing of 1,171 elderly admixed individuals from Brazil. Nat. Commun. 13, 1004 (2022).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Bick, A. G. et al. Genomic data in the All of Us Research Program. Nature 627, 340–346 (2024).
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
Tomek, K. J. et al. Driving the scalability of DNA-based information storage systems. ACS Synth. Biol. 8, 1241–1248 (2019).
Banal, J. L. & Bathe, M. Scalable nucleic acid storage and retrieval using barcoded microcapsules. ACS Appl. Mater. Interfaces 13, 49729–49736 (2021).
Banal, J. L. et al. Random access DNA memory using Boolean search in an archival file storage system. Nat. Mater. 20, 1272–1280 (2021).
Organick, L. et al. Probing the physical limits of reliable DNA data retrieval. Nat. Commun. 11, 616 (2020).
Xu, Q., Schlabach, M. R., Hannon, G. J. & Elledge, S. J. Design of 240,000 orthogonal 25mer DNA barcode probes. Proc. Natl. Acad. Sci. USA 106, 2289–2294 (2009).
Porichis, F. et al. High-throughput detection of miRNAs and gene-specific mRNA at the single-cell level by flow cytometry. Nat. Commun. 5, 5641 (2014).
Goldstein, E., Lipsitch, M. & Cevik, M. On the effect of age on the transmission of SARS-CoV-2 in households, schools, and the community. J. Infect. Dis. 223, 362–369 (2021).
Fauver, J. R. et al. Coast-to-coast spread of SARS-CoV-2 during the Early Epidemic in the United States. Cell 181, 990–996 (2020).
Kishi, J. Y. et al. SABER amplifies FISH: enhanced multiplexed imaging of RNA and DNA in cells and tissues. Nat. Methods 16, 533–544 (2019).
Player, A. N., Shen, L.-P., Kenny, D., Antao, V. P. & Kolberg, J. A. Single-copy gene detection using branched DNA (bDNA) in situ hybridization. J. Histochem. Cytochem. 49, 603–611 (2001).
Tao, K. et al. The biological and clinical significance of emerging SARS-CoV-2 variants. Nat. Rev. Genet. 22, 757–773 (2021).
Bei, Y. et al. Overcoming variant mutation-related impacts on viral sequencing and detection methodologies. Front. Med. 9, 989913 (2022).
Karthikeyan, S. et al. Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission. Nature 609, 101–108 (2022).
Lagerborg, K. A. et al. Synthetic DNA spike-ins (SDSIs) enable sample tracking and detection of inter-sample contamination in SARS-CoV-2 sequencing workflows. Nat. Microbiol. 7, 108–119 (2022).
Kubik, S. et al. Recommendations for accurate genotyping of SARS-CoV-2 using amplicon-based sequencing of clinical samples. Clin. Microbiol. Infect. 27, 1036.e1–1036.e8 (2021).
Rosenthal, S. H. et al. Development and validation of a high throughput SARS-CoV-2 whole genome sequencing workflow in a clinical laboratory. Sci. Rep. 12, 2054 (2022).
BigQuery public datasets. Google Cloud https://cloud.google.com/bigquery/public-data.
Open Datasets Documentation - Tutorials, API reference - Azure - Azure Open Datasets. https://learn.microsoft.com/en-us/azure/open-datasets/.
Open Data on AWS. https://aws.amazon.com/opendata/.
The Nucleic Acid Observatory Consortium. A global nucleic acid observatory for biodefense and planetary health. Preprint at arXiv:2108.02678 (2021).
Azenta Life Sciences. Cryogenic Storage Solutions in Life Sciences. https://www.azenta.com/learning-center/resources/cryogenic-storage-solutions-life-sciences-comprehensive-guide-decision-making (2024).
Bee, C. et al. Molecular-level similarity search brings computing to DNA data storage. Nat. Commun. 12, 4764 (2021).
Eldjarn, G. H. et al. Large-scale plasma proteomics comparisons through genetics and disease associations. Nature 622, 348–358 (2023).
Zhao, T. et al. Spatial genomics enables multi-modal study of clonal heterogeneity in tissues. Nature 601, 85–91 (2022).
Hunter, J. D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 9, 90–95 (2007).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Knuth, D. E. The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations. (Addison-Wesley, 2005).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Wilm, A. et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 40, 11189–11201 (2012).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Aksamentov, I., Roemer, C., Hodcroft, E. & Neher, R. Nextclade: clade assignment, mutation calling and quality control for viral genomes. J. Open Source Softw. 6, 3773 (2021).
Berleant, J. D., Banal, J. L., Rao, D. K. & Bathe, M. Enabling global-scale nucleic acid repositories through versatile, scalable biochemical selection from room-temperature archives. Zenodo https://doi.org/10.5281/ZENODO.10501347 (2025).
Berleant, J. D., Banal, J. L., Rao, D. K. & Bathe, M. Full datasets from: Enabling global-scale nucleic acid repositories through versatile, scalable biochemical selection from room-temperature archives. Zenodo https://doi.org/10.5281/ZENODO.17516191 (2025).
Berleant, J. D., Banal, J. L., Rao, D. K. & Bathe, M. lcbb/BiosampleSQL: Publication release. Zenodo https://doi.org/10.5281/ZENODO.17402438 (2025).
NIAID Visual & Medical Arts. Eppendorf Tube. NIAID NIH BIOART Source. bioart.niaid.nih.gov/bioart/143 (2024).
NIAID Visual & Medical Arts. 96 Well Plate. NIAID NIH BIOART source. bioart.niaid.nih.gov/bioart/7 (2024).
NIAID Visual & Medical Arts. Next gen sequencer. NIAID NIH BIOART source. bioart.niaid.nih.gov/bioart/386 (2024).
Acknowledgements
M.B. and J.D.B. were supported by the Office of Naval Research (N00014-21-1-4013), the Army Research Office (ICB Subaward KK1954), and the National Science Foundation (CBET-1729397, OAC-1940231, and CCF-1956054). Additional funding to M.B. was provided through the National Science Foundation (CCF-2403100) and to J.D.B. through a National Science Foundation Graduate Research Fellowship (Grant No. 1122374). J.L.B. acknowledges support in part by the National Science Foundation SBIR Phase I 2136447, UCSF Parnassus Flow CoLab RRID:SCR_018206, DRC Center Grant NIH P30 DK063720, UCSF Center for Advanced Technology at Mission Bay, and Illumina. This research was also supported by a core center grant from the National Institute of Environmental Health Sciences, National Institutes of Health (P30-ES002109). We are grateful to T.B. Schardl and C.E. Leiserson (MIT CSAIL) for useful discussions on DNA barcoding. We thank G. Paradis, M. Jennings, and M. Griffin of the Flow Cytometry Core at the Koch Institute at the Massachusetts Institute of Technology (MIT) for flow sorting assistance. We are grateful to Delaware Diagnostics Labs for providing us de-identified clinical SARS-CoV-2 samples. We thank G. Tikhorimov for providing access to a Beckman Coulter Labcyte Echo 550. We are grateful to Ella Maru Studio, Inc. for assistance in creating the airport schematic in Fig. 1a.
Author information
Authors and Affiliations
Contributions
M.B., J.D.B., and J.L.B. conceived the sample storage system. J.D.B. designed the sample barcoding scheme and query language architecture. J.L.B. designed and implemented sample synthesis, FAS selection, and post-processing after selection of mock patient samples, and encapsulation and barcoding of clinical SARS-CoV-2 samples. D.K.R. prepared samples for sequencing and analyzed the data. J.L.B. and J.D.B. performed data analysis after querying and calculation of summary statistics. M.B. supervised the entire project. All authors contributed equally to the writing of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The Massachusetts Institute of Technology has filed a patent related to this work on behalf of J.L.B., M.B., J.D.B., and additional inventors (US Patent App. 17/836,726). J.L.B. and M.B. are co-founders and equity shareholders of Cache DNA, Inc. (Cache). J.L.B. is an employee of Cache and an independent contractor of OpenAI. D.K.R. was an intern at Cache for the period of this work.
Peer review
Peer review information
Nature Communications thanks Fajia Sun and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Berleant, J.D., Banal, J.L., Rao, D.K. et al. Enabling global-scale nucleic acid repositories through versatile, scalable biochemical selection from room-temperature archives. Nat Commun (2026). https://doi.org/10.1038/s41467-026-69402-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-026-69402-3