Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Nature Communications
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. nature communications
  3. articles
  4. article
DNA diamond formulates a decomposable composite letter constellation model for DNA data storage
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 31 January 2026

DNA diamond formulates a decomposable composite letter constellation model for DNA data storage

  • Qi Ge  ORCID: orcid.org/0000-0002-0489-350X1,
  • Menghui Ren  ORCID: orcid.org/0009-0007-2404-46131,
  • Tingting Qi  ORCID: orcid.org/0009-0009-6834-44361,
  • Changcai Han  ORCID: orcid.org/0000-0003-4879-17481,
  • Yingjin Yuan  ORCID: orcid.org/0000-0003-0553-00892,3 &
  • …
  • Weigang Chen  ORCID: orcid.org/0000-0002-4880-81861,2,3 

Nature Communications , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Computer science
  • DNA computing and cryptography
  • Information theory
  • Next-generation sequencing

Abstract

Oligonucleotide multiplicity is an inherent property of current DNA synthesis technology. Composite letter DNA storage exploits this property to improve logical density and reduce costs. However, letter indistinguishability and high molecular diversity pose challenges for reliable recovery. Here, we formulate a composite letter constellation model, named DNA diamond, consisting of 15 decomposable points. Inspired by set partitioning in telecommunications, we propose a two-stage letter detection framework that partitions these letters into four distinguishable subsets based on their discrete entropy. Furthermore, we incorporate encoded double-end indices to eliminate crosstalk between synthesis sites and simultaneously apply length filtering to suppress error propagation during readout. We validate the eight-letter and 15-letter composite letter DNA storage under DNA diamond model, each with 10,000 composite strands. The eight-letter system achieves a payload density of 2.5 bits per letter and enables error-free recovery at 14× coverage, surpassing the storage density of prior six-letter systems while requiring lower coverage. The full 15-letter constellation enables 3.125 bits per letter for payload with error-free recovery at 33× coverage, corresponding to a density of 2.23 bits per letter for payload plus indices. The proposed decomposable DNA diamond model advances a practical and scalable framework for high-density composite DNA data storage.

Data availability

The encoded composite letter sequences are available via Zenodo at https://doi.org/10.5281/zenodo.1735030747. The sequencing data (FASTQ format) from the array-based synthesis pools have been deposited in the Sequence Read Archive under accession number PRJNA1345374, and are also available via Zenodo at https://doi.org/10.5281/zenodo.1735030747. The sequencing data from the column-based synthesis experiments have been deposited in the Sequence Read Archive under accession number PRJNA1258704, and are also available via Zenodo at https://doi.org/10.5281/zenodo.1533715748. Source data are provided with this paper.

Code availability

The source code for composite letter detection and data readout is publicly available and has been deposited in GitHub at https://github.com/TJU-QiGe/Two-stage-composite-letter-detection-method-using-set-partitioning, under MIT license. The specific version of the code associated with this publication is archived in Zenodo and is accessible via https://doi.org/10.5281/zenodo.1790599349. This implementation makes use of several third-party software packages under their respective licenses, including RS codes by Morelos-Zaragoza, R. (https://www.eccpage.com), seqtk by Li, H. (https://github.com/lh3/seqtk), and edlib by Šošić, M. (https://github.com/Martinsos/edlib).

References

  1. Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).

    Google Scholar 

  2. Bar-Lev, D., Sabary, O. & Yaakobi, E. The zettabyte era is in our DNA. Nat. Comput. Sci. 4, 813–817 (2024).

    Google Scholar 

  3. Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012).

    Google Scholar 

  4. Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).

    Google Scholar 

  5. Chen, W. et al. An artificial chromosome for data storage. Natl. Sci. Rev. 8, nwab028 (2021).

  6. Ren, Y. et al. DNA-based concatenated encoding system for high-reliability and high-density data storage. Small Methods 6, e2101335 (2022).

    Google Scholar 

  7. Ge, Q. et al. Pragmatic soft-decision data readout of encoded large DNA. Brief. Bioinform. 26, bbaf102 (2025).

  8. Xiang, L. et al. A tutorial on coding methods for DNA-based molecular communications and storage. IEEE Internet Things J. 11, 11825–11847 (2024).

    Google Scholar 

  9. Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).

    Google Scholar 

  10. Ping, Z. et al. Towards practical and robust DNA-based data archiving using the yin–yang codec system. Nat. Comput. Sci. 2, 234–242 (2022).

    Google Scholar 

  11. Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).

    Google Scholar 

  12. Xu, C., Zhao, C., Ma, B. & Liu, H. Uncertainties in synthetic DNA-based data storage. Nucleic Acids Res. 49, 5451–5469 (2021).

    Google Scholar 

  13. Hoose, A., Vellacott, R., Storch, M., Freemont, P. S. & Ryadnov, M. G. DNA synthesis technologies to close the gene writing gap. Nat. Rev. Chem. 7, 144–161 (2023).

    Google Scholar 

  14. Fan, C., Deng, Q. & Zhu, T. F. Bioorthogonal information storage in l-DNA with a high-fidelity mirror-image Pfu DNA polymerase. Nat. Biotechnol. 39, 1548–1555 (2021).

    Google Scholar 

  15. Tabatabaei, S. K. et al. Expanding the molecular alphabet of DNA based data storage systems with neural network nanopore readout processing. Nano Lett. 22, 1905–1914 (2022).

    Google Scholar 

  16. Kawabe, H. et al. Enzymatic synthesis and nanopore sequencing of 12-letter supernumerary DNA. Nat. Commun. 14, 6820 (2023).

    Google Scholar 

  17. Hamashima, K., Soong, Y. T., Matsunaga, K., Kimoto, M. & Hirao, I. DNA Sequencing method including unnatural bases for DNA aptamer generation by genetic alphabet expansion. ACS Synth. Biol. 8, 1401–1410 (2019).

    Google Scholar 

  18. Choi, Y. et al. High information capacity DNA-based data storage with augmented encoding characters using degenerate bases. Sci. Rep. 9, 6582 (2019).

    Google Scholar 

  19. Anavy, L., Vaknin, I., Atar, O., Amit, R. & Yakhini, Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat. Biotechnol. 37, 1229–1236 (2019).

    Google Scholar 

  20. Lu, X. et al. Enzymatic DNA synthesis by engineering terminal deoxynucleotidyl transferase. ACS Catal. 12, 2988–2997 (2022).

    Google Scholar 

  21. Hou, Z. et al. “Cell Disk” DNA storage system capable of random reading and rewriting. Adv. Sci. 11, 2305921 (2024).

    Google Scholar 

  22. Sun, F. et al. Mobile and self-sustained data storage in an extremophile genomic DNA. Adv. Sci. 10, 2206201 (2023).

    Google Scholar 

  23. Zhang, C. et al. Parallel molecular data storage by printing epigenetic bits on DNA. Nature 634, 824–832 (2024).

    Google Scholar 

  24. Chen, Y. J. et al. Quantifying molecular bias in DNA data storage. Nat. Commun. 11, 3264 (2020).

    Google Scholar 

  25. Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).

    Google Scholar 

  26. Jeong, J. et al. Cooperative sequence clustering and decoding for DNA storage system with fountain codes. Bioinformatics 37, 3136–3143 (2021).

    Google Scholar 

  27. He, X. & Cai, K. Basis-finding algorithm for decoding fountain codes for DNA-based data storage. IEEE Trans. Inf. Theory 69, 3691–3707 (2023).

    Google Scholar 

  28. Ding, L. et al. Improving error-correcting capability in DNA digital storage via soft-decision decoding. Natl. Sci. Rev. 11, nwad229 (2023).

  29. Xu, Y., Ding, L., Wu, S. & Ruan, J. Overcoming the high error rate of composite DNA letters-based digital storage through soft-decision decoding. Adv. Sci. 11, 2402951 (2024).

    Google Scholar 

  30. Preuss, I., Rosenberg, M., Yakhini, Z. & Anavy, L. Efficient DNA-based data storage using shortmer combinatorial encoding. Sci. Rep. 14, 7731 (2024).

    Google Scholar 

  31. Zhang, W., Chen, Z. & Wang Z. Limited-magnitude error correction for probability vectors in DNA storage. In Proc. 2022 IEEE International Conference on Communications (ICC) 3460–3465. https://doi.org/10.1109/ICC45855.2022.9838471 (IEEE, 2022).

  32. Cohen, T. & Yaakobi, E. Optimizing the decoding probability and coverage ratio of composite DNA. In Proc. 2024 IEEE International Symposium on Information Theory (ISIT) 1949–1954. https://doi.org/10.1109/ISIT57864.2024.10619348 (IEEE, 2024).

  33. Liu, Z. et al. Family of mutually uncorrelated codes for DNA storage address design. IEEE Trans. Nanobiosci. 24, 295–304 (2025).

    Google Scholar 

  34. Ungerboeck, G. Channel coding with multilevel/phase signals. IEEE Trans. Inf. Theory 28, 55–67 (1982).

    Google Scholar 

  35. Wachsmann, U., Fischer, R. F. H. & Huber, J. B. Multilevel codes: theoretical concepts and practical design rules. IEEE Trans. Inf. Theory 45, 1361–1391 (1999).

    Google Scholar 

  36. Rougemont, J. et al. Probabilistic base calling of Solexa sequencing data. BMC Bioinform. 9, 431 (2008).

    Google Scholar 

  37. Xie, R. et al. Study of the error correction capability of multiple sequence alignment algorithm (MAFFT) in DNA storage. BMC Bioinform. 24, 111 (2023).

    Google Scholar 

  38. Song, L. et al. Robust data storage in DNA by de Bruijn graph-based de novo strand assembly. Nat. Commun. 13, 5361 (2022).

    Google Scholar 

  39. Chen, W., Liang, C., Guo, T. & Ding, Y. Encoder implementation with FPGA for non-binary LDPC codes. In Proc. 2012 18th Asia-Pacific Conference on Communications (APCC) 980–984. https://doi.org/10.1109/APCC.2012.6388230 (IEEE, 2012).

  40. Weindel, F., Gimpel, A. L., Grass, R. N. & Heckel, R. Embracing errors is more effective than avoiding them through constrained coding for DNA data storage. In Proc. 2023 59th Annual Allerton Conference on Communication, Control, and Computing (Allerton) 1–8. https://doi.org/10.1109/Allerton58177.2023.10313494 (IEEE, 2023).

  41. Wetterstrand, K. A. DNA sequencing costs: data from the NHGRI genome sequencing program (GSP). Available at: www.genome.gov/sequencingcostsdata (2019).

  42. Walter, F., Sabary, O., Wachter-Zeh, A. & Yaakobi, E. Coding for composite DNA to correct substitutions, strand losses, and deletions. In Proc. 2024 IEEE International Symposium on Information Theory (ISIT) 97–102. https://doi.org/10.1109/ISIT57864.2024.10619202 (IEEE, 2024).

  43. Zhao, X. et al. Composite hedges nanopores codec system for rapid and portable DNA data readout with high INDEL-correction. Nat. Commun. 15, 9395 (2024).

    Google Scholar 

  44. Sabary, O. et al. Error-correcting codes for combinatorial composite DNA. In Proc. 2024 IEEE International Symposium on Information Theory (ISIT) 109–114. https://doi.org/10.1109/ISIT57864.2024.10619334 (IEEE, 2024).

  45. Walter, F. & Yehezkeally, Y. Coding for strand breaks in composite DNA. In Proc. 2025 IEEE International Symposium on Information Theory (ISIT) 1–6. https://doi.org/10.1109/ISIT63088.2025.11195278 (IEEE, 2025).

  46. Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina paired-end read merger. Bioinformatics 30, 614–620 (2014).

    Google Scholar 

  47. Chen, W. Sequencing data of large-scale pools for composite letter DNA storage. Zenodo, https://doi.org/10.5281/zenodo.17350307 (2025).

  48. Chen, W. Illumina sequencing data for composite letter DNA storage. Zenodo, https://doi.org/10.5281/zenodo.15337157 (2025).

  49. Chen, W. Software of two stage composite-letter detection and recovery for DNA data storage. Zenodo, https://doi.org/10.5281/zenodo.17905993 (2025).

Download references

Acknowledgements

This work was supported by grants from the National Key R&D Program of China (2023YFA0913800 and 2021YFF1200200 to W.C.; 2024YFF1500500 to Y.Y.). The authors thank Dashun Huang (Hippobio Co., Ltd., Huzhou, China) for assistance with the column-based synthesis of composite DNA strands, and Dynegene Technologies (Shanghai, China) for their support in the array-based synthesis of large-scale composite pools. The authors also thank Lulu Li (LC-Bio Technology Co., Ltd., Hangzhou, China) for support in library preparation and sequencing, and Rui Qin for performing the amplification and accelerated aging experiments.

Author information

Authors and Affiliations

  1. School of Microelectronics, Tianjin University, Tianjin, China

    Qi Ge, Menghui Ren, Tingting Qi, Changcai Han & Weigang Chen

  2. State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin, China

    Yingjin Yuan & Weigang Chen

  3. Frontiers Science Center for Synthetic Biology (Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin, China

    Yingjin Yuan & Weigang Chen

Authors
  1. Qi Ge
    View author publications

    Search author on:PubMed Google Scholar

  2. Menghui Ren
    View author publications

    Search author on:PubMed Google Scholar

  3. Tingting Qi
    View author publications

    Search author on:PubMed Google Scholar

  4. Changcai Han
    View author publications

    Search author on:PubMed Google Scholar

  5. Yingjin Yuan
    View author publications

    Search author on:PubMed Google Scholar

  6. Weigang Chen
    View author publications

    Search author on:PubMed Google Scholar

Contributions

W.C. and Y.Y. conceived the project and reviewed the results. Q.G., C.H., and W.C. designed the composite DNA storage system and wrote the encoding and decoding programs. Q.G. developed the two-stage composite letter detection program and performed the simulations and experimental validation. Q.G., M.R., and T.Q. analyzed sequencing data. Q.G. and W.C. validated the results and wrote the manuscript. All authors supervised the results, revised the manuscript, and approved the final manuscript.

Corresponding authors

Correspondence to Yingjin Yuan or Weigang Chen.

Ethics declarations

Competing interests

Q.G., C.H., and W.C. have been granted a Chinese patent related to the encoding and decoding approach for composite letter DNA data storage (patent number CN119649874B). The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Jue Ruan and Jiongyu Zhang for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Dataset 1

Reporting Summary

Transparent Peer Review file

Source data

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ge, Q., Ren, M., Qi, T. et al. DNA diamond formulates a decomposable composite letter constellation model for DNA data storage. Nat Commun (2026). https://doi.org/10.1038/s41467-026-68861-y

Download citation

  • Received: 04 June 2025

  • Accepted: 16 January 2026

  • Published: 31 January 2026

  • DOI: https://doi.org/10.1038/s41467-026-68861-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Videos
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on Twitter
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims & Scope
  • Editors
  • Journal Information
  • Open Access Fees and Funding
  • Calls for Papers
  • Editorial Values Statement
  • Journal Metrics
  • Editors' Highlights
  • Contact
  • Editorial policies
  • Top Articles

Publish with us

  • For authors
  • For Reviewers
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Nature Communications (Nat Commun)

ISSN 2041-1723 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics