Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Lossless data compression by large models

This article has been updated

A preprint version of the article is available at arXiv.

Abstract

Data compression is a fundamental technology that enables efficient storage and transmission of information. However, traditional compression methods are approaching their theoretical limits after 80 years of research and development. At the same time, large artificial intelligence models have emerged, which, trained on vast amounts of data, are able to ‘understand’ various semantics. Intuitively, semantics conveys the meaning of data concisely, so large models hold the potential to revolutionize compression technology. Here we present LMCompress, a new method that leverages large models to compress data. LMCompress shatters all previous lossless compression records on four media types: text, images, video and audio. It halves the compression rates of JPEG-XL for images, FLAC for audio and H.264 for video, and it achieves nearly one-third of the compression rates of zpaq for text. Our results demonstrate that the better a model understands the data, the more effectively it can compress it, suggesting a deep connection between understanding and compression.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The architecture of our LMCompress.
Fig. 2: Key insight of this paper.
Fig. 3: Image compression rates.
Fig. 4: Video compression rates.
Fig. 5: Audio compression rates.
Fig. 6: Text compression rates.

Similar content being viewed by others

Data availability

ILSVRC is available at https://www.image-net.org/challenges/LSVRC/2012/index.php. CLIC is available at https://clic.compression.cc/2019/. LibriSpeech is available at www.openslr.org/12. LJSpeech is available at https://keithito.com/LJ-Speech-Dataset. Mozilla Common Voice 11 is available at https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0. VoxPopuli is available at https://huggingface.co/datasets/facebook/voxpopuli. MeDAL is available at https://github.com/McGill-NLP/medal. Eurlex is available at https://huggingface.co/datasets/pile-of-law/pile-of-law. CIPR SIF is available at https://media.xiph.org/video/derf/.

Code availability

Our code is available via Code Ocean at https://doi.org/10.24433/CO.9735997.v1. (ref. 28)

Change history

  • 09 May 2025

    In the version of the article initially published, Xingwu Liu was listed with two affiliations. This has now been corrected to a single affiliation (School of Mathematical Sciences, Dalian University of Technology, Dalian, China) in the HTML and PDF versions of the article.

References

  1. Pavlov, I. 7-zip. www.7-zip.org/a/lzma-specification.7z (2024).

  2. Xiph.Org Foundation. Flac: free lossless audio codec. Xiph.org https://xiph.org/flac/features.html (2023).

  3. Boutell, T. Rfc2083: png (portable network graphics) specification version 1.0. W3C https://www.w3.org/TR/REC-png-961001 (1997).

  4. Richardson, I. E. The H.264 Advanced Video Compression Standard 2nd edn (Wiley, 2010).

  5. High efficiency video coding (hevc) - itu-t recommendation h.265. ITU https://www.itu.int/rec/T-REC-H.265 (2013).

  6. Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).

    Article  MathSciNet  MATH  Google Scholar 

  7. Solomonoff, R. A formal theory of inductive inference. Inform. control 7, 1–22 (1964).

    Article  MathSciNet  MATH  Google Scholar 

  8. Grau-Moya, J. et al. Learning universal predictors. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 16178–16205 (PMLR, 2024).

  9. Huang, C., Xie, Y., Jiang, Z., Lin, J. & Li, M. Approximating human-like few-shot learning with gpt-based compression. Preprint at https://arXiv.org/abs/2308.06942 (2023).

  10. Deletang, G. et al. Language modeling is compression. In Twelfth International Conference on Learning Representations (eds Chaudhuri, S. et al.) (ICLR, 2024).

  11. Bellard, F. Nncp v2: lossless data compression with transformer. Preprint at Fabrice Bellard https://bellard.org/nncp/nncp_v2.pdf (2021).

  12. Chen, M. et al. Generative pretraining from pixels. In International conference on machine learning (eds Daumé III, H. et al.) 1691–1703 (PMLR, 2020).

  13. Wu, S. et al. Beyond language models: byte models are digital world simulators. Preprint at https://arXiv.org/abs/2402.19155 (2024).

  14. Jiang, Z., Wang, R., Bu, D. & Li, M. A theory of human-like few-shot learning. Preprint at https://arXiv/org/abs/2301.01047 (2023).

  15. Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. of Comput. Vis. 115, 211–252 (2015).

    Article  MathSciNet  Google Scholar 

  16. Challenge on Learned Image Compression (CLIC). https://archive.compression.cc/2019/challenge/ (2019).

  17. Xiph.org video test media. Xiph.org https://media.xiph.org/video/derf/ (accessed 2024).

  18. Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: an asr corpus based on public domain audio book. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015).

  19. Ito, K. & Johnson, L. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/ (2017).

  20. Ardila, R. et al. Common voice: a massively-multilingual speech corpus. In Proc. of the 12th Conference on Language Resources and Evaluation (eds Calzolari, N. et al.) 4211–4215 (LREC, 2020).

  21. Wang, C. et al. VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (eds Zong, C., Xia, F., Li, W. & Navigli, R.) 993–1003 (Association for Computational Linguistics, 2021).

  22. Wen, Z., Lu, X. H. & Reddy, S. MeDAL: medical abbreviation disambiguation dataset for natural language understanding pretraining. In Proc. 3rd Clinical Natural Language Processing Workshop (eds Rumshisky, A. et al.) 130–135 (Association for Computational Linguistics, 2020).

  23. Henderson, P. et al. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Adv. Neural Inf. Process. Syst. 35, 29217–29234 (2022).

    Google Scholar 

  24. Satellite Communications and their Role in Enabling 6G. Technical Report (GSOA, 2014); https://gsoasatellite.com/wp-content/uploads/6G-Paper-GSOA.pdf

  25. Li, M. & Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications (Springer, 2019).

  26. Niedermayer, M., Rice, D. & Martinez, J. FFV1 Video Coding Format Version 4. Internet-Draft draft-ietf-cellar-ffv1-v4-22. Internet Engineering Task Force https://datatracker.ietf.org/doc/draft-ietf-cellar-ffv1-v4/22/ (2024).

  27. Information technology - jpeg 2000 image coding system: motion jpeg 2000 - part 3 (ISO, 2007); https://www.iso.org/standard/41570.html

  28. Li, Z. & Wang, X. Understanding is compression: v.0.1.0 Code Ocean https://doi.org/10.24433/CO.9735997.v1 (2024).

Download references

Acknowledgements

This work is partially supported by the National Key R&D Program of China grant no. 2022YFA1304603 (to M.L.); Proteomic Navigator of the Human Body Project (to M.L.); Canada’s NSERC OGP0046506 (to M.L. and C.W.); Canada Research Chair Program (to C.W.); National Natural Science Foundation of China grant no. 62072433 (to X.L.), grant no. 62088102 (to W.G. and M.L.) and grant no. 62025101 (to W.G. and M.L.); and Kechuang Yongjiang 2035 key technology breakthrough plan of Zhejiang Ningbo grant no. 2024Z119 (to C.H.). We thank N. Zhang and P. Vitanyi for discussions on Solomonoff induction. We thank C. Huang, Y. Xie, Z. Jiang, R. Wang and P. Guo for their discussions and related work in ref. 14 and ref. 9.

Author information

Authors and Affiliations

Contributions

M.L. conceived the presented idea. M.L., X.L., C.H., Q.Y. and W.G. developed the theory and supervised the findings of this work. Z.L., C.H., X.W., H.H. and C.W. performed the computations and carried out the experiments. Z.L., C.H., X.W., H.H., C.W., D.B., X.L. and M.L. wrote the paper. All authors discussed the results and contributed to the final paper.

Corresponding authors

Correspondence to Xingwu Liu or Ming Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Ziv Goldfeld, Jan Voges and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary information about experiments and Supplementary Tables 1–10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Huang, C., Wang, X. et al. Lossless data compression by large models. Nat Mach Intell 7, 794–799 (2025). https://doi.org/10.1038/s42256-025-01033-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-01033-7

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics