Lossless data compression by large models

Li, Ziguang; Huang, Chao; Wang, Xuliang; Hu, Haibo; Wyeth, Cole; Bu, Dongbo; Yu, Quan; Gao, Wen; Liu, Xingwu; Li, Ming

doi:10.1038/s42256-025-01033-7

Article
Published: 01 May 2025

Lossless data compression by large models

Ziguang Li^1,2,3^na1,
Chao Huang ORCID: orcid.org/0009-0003-3740-4500^4,5^na1,
Xuliang Wang ORCID: orcid.org/0000-0002-9043-1407^1,4,6^na1,
Haibo Hu^4,5^na1,
Cole Wyeth⁶,
Dongbo Bu ORCID: orcid.org/0000-0003-4119-4238^1,4,
Quan Yu²,
Wen Gao²,
Xingwu Liu ORCID: orcid.org/0000-0001-8304-3126⁷ &
…
Ming Li ORCID: orcid.org/0000-0002-2157-2775^1,2,3,6

Nature Machine Intelligence volume 7, pages 794–799 (2025)Cite this article

4485 Accesses
3 Citations
27 Altmetric
Metrics details

Subjects

This article has been updated

A preprint version of the article is available at arXiv.

Abstract

Data compression is a fundamental technology that enables efficient storage and transmission of information. However, traditional compression methods are approaching their theoretical limits after 80 years of research and development. At the same time, large artificial intelligence models have emerged, which, trained on vast amounts of data, are able to ‘understand’ various semantics. Intuitively, semantics conveys the meaning of data concisely, so large models hold the potential to revolutionize compression technology. Here we present LMCompress, a new method that leverages large models to compress data. LMCompress shatters all previous lossless compression records on four media types: text, images, video and audio. It halves the compression rates of JPEG-XL for images, FLAC for audio and H.264 for video, and it achieves nearly one-third of the compression rates of zpaq for text. Our results demonstrate that the better a model understands the data, the more effectively it can compress it, suggesting a deep connection between understanding and compression.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The architecture of our LMCompress.**

Fig. 2: Key insight of this paper.

Efficient self-attention with smart pruning for sustainable large language models

Article Open access 24 March 2025

AI models collapse when trained on recursively generated data

Article Open access 24 July 2024

Semantic lossless encoded image representation for malware classification

Article Open access 07 March 2025

Data availability

Code availability

Our code is available via Code Ocean at https://doi.org/10.24433/CO.9735997.v1. (ref. ²⁸)

Change history

09 May 2025
In the version of the article initially published, Xingwu Liu was listed with two affiliations. This has now been corrected to a single affiliation (School of Mathematical Sciences, Dalian University of Technology, Dalian, China) in the HTML and PDF versions of the article.

References

Pavlov, I. 7-zip. www.7-zip.org/a/lzma-specification.7z (2024).
Xiph.Org Foundation. Flac: free lossless audio codec. Xiph.org https://xiph.org/flac/features.html (2023).
Boutell, T. Rfc2083: png (portable network graphics) specification version 1.0. W3C https://www.w3.org/TR/REC-png-961001 (1997).
Richardson, I. E. The H.264 Advanced Video Compression Standard 2nd edn (Wiley, 2010).
High efficiency video coding (hevc) - itu-t recommendation h.265. ITU https://www.itu.int/rec/T-REC-H.265 (2013).
Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).
Article MathSciNet MATH Google Scholar
Solomonoff, R. A formal theory of inductive inference. Inform. control 7, 1–22 (1964).
Article MathSciNet MATH Google Scholar
Grau-Moya, J. et al. Learning universal predictors. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 16178–16205 (PMLR, 2024).
Huang, C., Xie, Y., Jiang, Z., Lin, J. & Li, M. Approximating human-like few-shot learning with gpt-based compression. Preprint at https://arXiv.org/abs/2308.06942 (2023).
Deletang, G. et al. Language modeling is compression. In Twelfth International Conference on Learning Representations (eds Chaudhuri, S. et al.) (ICLR, 2024).
Bellard, F. Nncp v2: lossless data compression with transformer. Preprint at Fabrice Bellard https://bellard.org/nncp/nncp_v2.pdf (2021).
Chen, M. et al. Generative pretraining from pixels. In International conference on machine learning (eds Daumé III, H. et al.) 1691–1703 (PMLR, 2020).
Wu, S. et al. Beyond language models: byte models are digital world simulators. Preprint at https://arXiv.org/abs/2402.19155 (2024).
Jiang, Z., Wang, R., Bu, D. & Li, M. A theory of human-like few-shot learning. Preprint at https://arXiv/org/abs/2301.01047 (2023).
Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. of Comput. Vis. 115, 211–252 (2015).
Article MathSciNet Google Scholar
Challenge on Learned Image Compression (CLIC). https://archive.compression.cc/2019/challenge/ (2019).
Xiph.org video test media. Xiph.org https://media.xiph.org/video/derf/ (accessed 2024).
Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: an asr corpus based on public domain audio book. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015).
Ito, K. & Johnson, L. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/ (2017).
Ardila, R. et al. Common voice: a massively-multilingual speech corpus. In Proc. of the 12th Conference on Language Resources and Evaluation (eds Calzolari, N. et al.) 4211–4215 (LREC, 2020).
Wang, C. et al. VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (eds Zong, C., Xia, F., Li, W. & Navigli, R.) 993–1003 (Association for Computational Linguistics, 2021).
Wen, Z., Lu, X. H. & Reddy, S. MeDAL: medical abbreviation disambiguation dataset for natural language understanding pretraining. In Proc. 3rd Clinical Natural Language Processing Workshop (eds Rumshisky, A. et al.) 130–135 (Association for Computational Linguistics, 2020).
Henderson, P. et al. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Adv. Neural Inf. Process. Syst. 35, 29217–29234 (2022).
Google Scholar
Satellite Communications and their Role in Enabling 6G. Technical Report (GSOA, 2014); https://gsoasatellite.com/wp-content/uploads/6G-Paper-GSOA.pdf
Li, M. & Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications (Springer, 2019).
Niedermayer, M., Rice, D. & Martinez, J. FFV1 Video Coding Format Version 4. Internet-Draft draft-ietf-cellar-ffv1-v4-22. Internet Engineering Task Force https://datatracker.ietf.org/doc/draft-ietf-cellar-ffv1-v4/22/ (2024).
Information technology - jpeg 2000 image coding system: motion jpeg 2000 - part 3 (ISO, 2007); https://www.iso.org/standard/41570.html
Li, Z. & Wang, X. Understanding is compression: v.0.1.0 Code Ocean https://doi.org/10.24433/CO.9735997.v1 (2024).

Download references

Acknowledgements

This work is partially supported by the National Key R&D Program of China grant no. 2022YFA1304603 (to M.L.); Proteomic Navigator of the Human Body Project (to M.L.); Canada’s NSERC OGP0046506 (to M.L. and C.W.); Canada Research Chair Program (to C.W.); National Natural Science Foundation of China grant no. 62072433 (to X.L.), grant no. 62088102 (to W.G. and M.L.) and grant no. 62025101 (to W.G. and M.L.); and Kechuang Yongjiang 2035 key technology breakthrough plan of Zhejiang Ningbo grant no. 2024Z119 (to C.H.). We thank N. Zhang and P. Vitanyi for discussions on Solomonoff induction. We thank C. Huang, Y. Xie, Z. Jiang, R. Wang and P. Guo for their discussions and related work in ref. ¹⁴ and ref. ⁹.

Author information

These authors contributed equally: Ziguang Li, Chao Huang, Xuliang Wang, Haibo Hu.

Authors and Affiliations

Central China Institute of Artificial Intelligence, Zhengzhou, China
Ziguang Li, Xuliang Wang, Dongbo Bu & Ming Li
Peng Cheng Laboratory, Shenzhen, China
Ziguang Li, Quan Yu, Wen Gao & Ming Li
Shanghai Institute for Mathematics and Interdisciplinary Sciences, Shanghai, China
Ziguang Li & Ming Li
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Chao Huang, Xuliang Wang, Haibo Hu & Dongbo Bu
Ningbo Institute of Artificial Intelligence Industry, Ningbo, China
Chao Huang & Haibo Hu
School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada
Xuliang Wang, Cole Wyeth & Ming Li
School of Mathematical Sciences, Dalian University of Technology, Dalian, China
Xingwu Liu

Authors

Ziguang Li
View author publications
Search author on:PubMed Google Scholar
Chao Huang
View author publications
Search author on:PubMed Google Scholar
Xuliang Wang
View author publications
Search author on:PubMed Google Scholar
Haibo Hu
View author publications
Search author on:PubMed Google Scholar
Cole Wyeth
View author publications
Search author on:PubMed Google Scholar
Dongbo Bu
View author publications
Search author on:PubMed Google Scholar
Quan Yu
View author publications
Search author on:PubMed Google Scholar
Wen Gao
View author publications
Search author on:PubMed Google Scholar
Xingwu Liu
View author publications
Search author on:PubMed Google Scholar
Ming Li
View author publications
Search author on:PubMed Google Scholar

Contributions

M.L. conceived the presented idea. M.L., X.L., C.H., Q.Y. and W.G. developed the theory and supervised the findings of this work. Z.L., C.H., X.W., H.H. and C.W. performed the computations and carried out the experiments. Z.L., C.H., X.W., H.H., C.W., D.B., X.L. and M.L. wrote the paper. All authors discussed the results and contributed to the final paper.

Corresponding authors

Correspondence to Xingwu Liu or Ming Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Ziv Goldfeld, Jan Voges and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary information about experiments and Supplementary Tables 1–10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, Z., Huang, C., Wang, X. et al. Lossless data compression by large models. Nat Mach Intell 7, 794–799 (2025). https://doi.org/10.1038/s42256-025-01033-7

Download citation

Received: 13 July 2024
Accepted: 04 April 2025
Published: 01 May 2025
Issue date: May 2025
DOI: https://doi.org/10.1038/s42256-025-01033-7