Abstract
Oligonucleotide multiplicity is an inherent property of current DNA synthesis technology. Composite letter DNA storage exploits this property to improve logical density and reduce costs. However, letter indistinguishability and high molecular diversity pose challenges for reliable recovery. Here, we formulate a composite letter constellation model, named DNA diamond, consisting of 15 decomposable points. Inspired by set partitioning in telecommunications, we propose a two-stage letter detection framework that partitions these letters into four distinguishable subsets based on their discrete entropy. Furthermore, we incorporate encoded double-end indices to eliminate crosstalk between synthesis sites and simultaneously apply length filtering to suppress error propagation during readout. We validate the eight-letter and 15-letter composite letter DNA storage under DNA diamond model, each with 10,000 composite strands. The eight-letter system achieves a payload density of 2.5 bits per letter and enables error-free recovery at 14× coverage, surpassing the storage density of prior six-letter systems while requiring lower coverage. The full 15-letter constellation enables 3.125 bits per letter for payload with error-free recovery at 33× coverage, corresponding to a density of 2.23 bits per letter for payload plus indices. The proposed decomposable DNA diamond model advances a practical and scalable framework for high-density composite DNA data storage.
Data availability
The encoded composite letter sequences are available via Zenodo at https://doi.org/10.5281/zenodo.1735030747. The sequencing data (FASTQ format) from the array-based synthesis pools have been deposited in the Sequence Read Archive under accession number PRJNA1345374, and are also available via Zenodo at https://doi.org/10.5281/zenodo.1735030747. The sequencing data from the column-based synthesis experiments have been deposited in the Sequence Read Archive under accession number PRJNA1258704, and are also available via Zenodo at https://doi.org/10.5281/zenodo.1533715748. Source data are provided with this paper.
Code availability
The source code for composite letter detection and data readout is publicly available and has been deposited in GitHub at https://github.com/TJU-QiGe/Two-stage-composite-letter-detection-method-using-set-partitioning, under MIT license. The specific version of the code associated with this publication is archived in Zenodo and is accessible via https://doi.org/10.5281/zenodo.1790599349. This implementation makes use of several third-party software packages under their respective licenses, including RS codes by Morelos-Zaragoza, R. (https://www.eccpage.com), seqtk by Li, H. (https://github.com/lh3/seqtk), and edlib by Šošić, M. (https://github.com/Martinsos/edlib).
References
Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).
Bar-Lev, D., Sabary, O. & Yaakobi, E. The zettabyte era is in our DNA. Nat. Comput. Sci. 4, 813–817 (2024).
Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012).
Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).
Chen, W. et al. An artificial chromosome for data storage. Natl. Sci. Rev. 8, nwab028 (2021).
Ren, Y. et al. DNA-based concatenated encoding system for high-reliability and high-density data storage. Small Methods 6, e2101335 (2022).
Ge, Q. et al. Pragmatic soft-decision data readout of encoded large DNA. Brief. Bioinform. 26, bbaf102 (2025).
Xiang, L. et al. A tutorial on coding methods for DNA-based molecular communications and storage. IEEE Internet Things J. 11, 11825–11847 (2024).
Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
Ping, Z. et al. Towards practical and robust DNA-based data archiving using the yin–yang codec system. Nat. Comput. Sci. 2, 234–242 (2022).
Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).
Xu, C., Zhao, C., Ma, B. & Liu, H. Uncertainties in synthetic DNA-based data storage. Nucleic Acids Res. 49, 5451–5469 (2021).
Hoose, A., Vellacott, R., Storch, M., Freemont, P. S. & Ryadnov, M. G. DNA synthesis technologies to close the gene writing gap. Nat. Rev. Chem. 7, 144–161 (2023).
Fan, C., Deng, Q. & Zhu, T. F. Bioorthogonal information storage in l-DNA with a high-fidelity mirror-image Pfu DNA polymerase. Nat. Biotechnol. 39, 1548–1555 (2021).
Tabatabaei, S. K. et al. Expanding the molecular alphabet of DNA based data storage systems with neural network nanopore readout processing. Nano Lett. 22, 1905–1914 (2022).
Kawabe, H. et al. Enzymatic synthesis and nanopore sequencing of 12-letter supernumerary DNA. Nat. Commun. 14, 6820 (2023).
Hamashima, K., Soong, Y. T., Matsunaga, K., Kimoto, M. & Hirao, I. DNA Sequencing method including unnatural bases for DNA aptamer generation by genetic alphabet expansion. ACS Synth. Biol. 8, 1401–1410 (2019).
Choi, Y. et al. High information capacity DNA-based data storage with augmented encoding characters using degenerate bases. Sci. Rep. 9, 6582 (2019).
Anavy, L., Vaknin, I., Atar, O., Amit, R. & Yakhini, Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat. Biotechnol. 37, 1229–1236 (2019).
Lu, X. et al. Enzymatic DNA synthesis by engineering terminal deoxynucleotidyl transferase. ACS Catal. 12, 2988–2997 (2022).
Hou, Z. et al. “Cell Disk” DNA storage system capable of random reading and rewriting. Adv. Sci. 11, 2305921 (2024).
Sun, F. et al. Mobile and self-sustained data storage in an extremophile genomic DNA. Adv. Sci. 10, 2206201 (2023).
Zhang, C. et al. Parallel molecular data storage by printing epigenetic bits on DNA. Nature 634, 824–832 (2024).
Chen, Y. J. et al. Quantifying molecular bias in DNA data storage. Nat. Commun. 11, 3264 (2020).
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
Jeong, J. et al. Cooperative sequence clustering and decoding for DNA storage system with fountain codes. Bioinformatics 37, 3136–3143 (2021).
He, X. & Cai, K. Basis-finding algorithm for decoding fountain codes for DNA-based data storage. IEEE Trans. Inf. Theory 69, 3691–3707 (2023).
Ding, L. et al. Improving error-correcting capability in DNA digital storage via soft-decision decoding. Natl. Sci. Rev. 11, nwad229 (2023).
Xu, Y., Ding, L., Wu, S. & Ruan, J. Overcoming the high error rate of composite DNA letters-based digital storage through soft-decision decoding. Adv. Sci. 11, 2402951 (2024).
Preuss, I., Rosenberg, M., Yakhini, Z. & Anavy, L. Efficient DNA-based data storage using shortmer combinatorial encoding. Sci. Rep. 14, 7731 (2024).
Zhang, W., Chen, Z. & Wang Z. Limited-magnitude error correction for probability vectors in DNA storage. In Proc. 2022 IEEE International Conference on Communications (ICC) 3460–3465. https://doi.org/10.1109/ICC45855.2022.9838471 (IEEE, 2022).
Cohen, T. & Yaakobi, E. Optimizing the decoding probability and coverage ratio of composite DNA. In Proc. 2024 IEEE International Symposium on Information Theory (ISIT) 1949–1954. https://doi.org/10.1109/ISIT57864.2024.10619348 (IEEE, 2024).
Liu, Z. et al. Family of mutually uncorrelated codes for DNA storage address design. IEEE Trans. Nanobiosci. 24, 295–304 (2025).
Ungerboeck, G. Channel coding with multilevel/phase signals. IEEE Trans. Inf. Theory 28, 55–67 (1982).
Wachsmann, U., Fischer, R. F. H. & Huber, J. B. Multilevel codes: theoretical concepts and practical design rules. IEEE Trans. Inf. Theory 45, 1361–1391 (1999).
Rougemont, J. et al. Probabilistic base calling of Solexa sequencing data. BMC Bioinform. 9, 431 (2008).
Xie, R. et al. Study of the error correction capability of multiple sequence alignment algorithm (MAFFT) in DNA storage. BMC Bioinform. 24, 111 (2023).
Song, L. et al. Robust data storage in DNA by de Bruijn graph-based de novo strand assembly. Nat. Commun. 13, 5361 (2022).
Chen, W., Liang, C., Guo, T. & Ding, Y. Encoder implementation with FPGA for non-binary LDPC codes. In Proc. 2012 18th Asia-Pacific Conference on Communications (APCC) 980–984. https://doi.org/10.1109/APCC.2012.6388230 (IEEE, 2012).
Weindel, F., Gimpel, A. L., Grass, R. N. & Heckel, R. Embracing errors is more effective than avoiding them through constrained coding for DNA data storage. In Proc. 2023 59th Annual Allerton Conference on Communication, Control, and Computing (Allerton) 1–8. https://doi.org/10.1109/Allerton58177.2023.10313494 (IEEE, 2023).
Wetterstrand, K. A. DNA sequencing costs: data from the NHGRI genome sequencing program (GSP). Available at: www.genome.gov/sequencingcostsdata (2019).
Walter, F., Sabary, O., Wachter-Zeh, A. & Yaakobi, E. Coding for composite DNA to correct substitutions, strand losses, and deletions. In Proc. 2024 IEEE International Symposium on Information Theory (ISIT) 97–102. https://doi.org/10.1109/ISIT57864.2024.10619202 (IEEE, 2024).
Zhao, X. et al. Composite hedges nanopores codec system for rapid and portable DNA data readout with high INDEL-correction. Nat. Commun. 15, 9395 (2024).
Sabary, O. et al. Error-correcting codes for combinatorial composite DNA. In Proc. 2024 IEEE International Symposium on Information Theory (ISIT) 109–114. https://doi.org/10.1109/ISIT57864.2024.10619334 (IEEE, 2024).
Walter, F. & Yehezkeally, Y. Coding for strand breaks in composite DNA. In Proc. 2025 IEEE International Symposium on Information Theory (ISIT) 1–6. https://doi.org/10.1109/ISIT63088.2025.11195278 (IEEE, 2025).
Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina paired-end read merger. Bioinformatics 30, 614–620 (2014).
Chen, W. Sequencing data of large-scale pools for composite letter DNA storage. Zenodo, https://doi.org/10.5281/zenodo.17350307 (2025).
Chen, W. Illumina sequencing data for composite letter DNA storage. Zenodo, https://doi.org/10.5281/zenodo.15337157 (2025).
Chen, W. Software of two stage composite-letter detection and recovery for DNA data storage. Zenodo, https://doi.org/10.5281/zenodo.17905993 (2025).
Acknowledgements
This work was supported by grants from the National Key R&D Program of China (2023YFA0913800 and 2021YFF1200200 to W.C.; 2024YFF1500500 to Y.Y.). The authors thank Dashun Huang (Hippobio Co., Ltd., Huzhou, China) for assistance with the column-based synthesis of composite DNA strands, and Dynegene Technologies (Shanghai, China) for their support in the array-based synthesis of large-scale composite pools. The authors also thank Lulu Li (LC-Bio Technology Co., Ltd., Hangzhou, China) for support in library preparation and sequencing, and Rui Qin for performing the amplification and accelerated aging experiments.
Author information
Authors and Affiliations
Contributions
W.C. and Y.Y. conceived the project and reviewed the results. Q.G., C.H., and W.C. designed the composite DNA storage system and wrote the encoding and decoding programs. Q.G. developed the two-stage composite letter detection program and performed the simulations and experimental validation. Q.G., M.R., and T.Q. analyzed sequencing data. Q.G. and W.C. validated the results and wrote the manuscript. All authors supervised the results, revised the manuscript, and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
Q.G., C.H., and W.C. have been granted a Chinese patent related to the encoding and decoding approach for composite letter DNA data storage (patent number CN119649874B). The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Jue Ruan and Jiongyu Zhang for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ge, Q., Ren, M., Qi, T. et al. DNA diamond formulates a decomposable composite letter constellation model for DNA data storage. Nat Commun (2026). https://doi.org/10.1038/s41467-026-68861-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-026-68861-y