Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing

Büchel, Julian; Vasilopoulos, Athanasios; Simon, William Andrew; Boybat, Irem; Tsai, HsinYu; Burr, Geoffrey W.; Castro, Hernan; Filipiak, Bill; Le Gallo, Manuel; Rahimi, Abbas; Narayanan, Vijay; Sebastian, Abu

doi:10.1038/s43588-024-00753-x

Article
Published: 08 January 2025

Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing

Nature Computational Science volume 5, pages 13–26 (2025)Cite this article

4569 Accesses
21 Citations
7 Altmetric
Metrics details

Subjects

Abstract

Large language models (LLMs), with their remarkable generative capacities, have greatly impacted a range of fields, but they face scalability challenges due to their large parameter counts, which result in high costs for training and inference. The trend of increasing model sizes is exacerbating these challenges, particularly in terms of memory footprint, latency and energy consumption. Here we explore the deployment of ‘mixture of experts’ (MoEs) networks—networks that use conditional computing to keep computational demands low despite having many parameters—on three-dimensional (3D) non-volatile memory (NVM)-based analog in-memory computing (AIMC) hardware. When combined with the MoE architecture, this hardware, utilizing stacked NVM devices arranged in a crossbar array, offers a solution to the parameter-fetching bottleneck typical in traditional models deployed on conventional von-Neumann-based architectures. By simulating the deployment of MoEs on an abstract 3D AIMC system, we demonstrate that, due to their conditional compute mechanism, MoEs are inherently better suited to this hardware than conventional, dense model architectures. Our findings suggest that MoEs, in conjunction with emerging 3D NVM-based AIMC, can substantially reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Three-dimensional AIMC-based implementation of MoEs.**

**Fig. 2: Overview of the simulation framework.**

**Fig. 3: Inference performance of MoEs deployed on 3D AIMC.**

**Fig. 4: Comparison of throughput, area efficiency and energy efficiency between simulated pipelined execution of LLMs on 3D AIMC against the NVIDIA A100.**

**Fig. 5: MoEs form the Pareto front in terms of model accuracy against system performance when compared to dense models.**

**Fig. 6: Impact of AIMC noise on iso-performance.**

Rapid learning with phase-change memory-based in-memory computing through learning-to-learn

Article Open access 01 February 2025

Demonstration of 4-quadrant analog in-memory matrix multiplication in a single modulation

Article Open access 03 October 2024

Self-rectifying resistive memory in passive crossbar arrays

Article Open access 20 May 2021

Data availability

Data were generated by the presented simulator and by evaluating the models trained on the WikiText-103 dataset. The WikiText-103 dataset is publicly available at https://huggingface.co/datasets/Salesforce/wikitext. Source data are provided with this paper⁵⁷.

Code availability

The code used to generate the results of this study is publicly available from https://github.com/IBM/analog-moe (ref. ⁵⁸) and https://github.com/IBM/3D-CiM-LLM-Inference-Simulator (ref. ⁵⁹).

References

Jiang, A. Q. et al. Mixtral of experts. Preprint at https://arxiv.org/abs/2401.04088 (2024).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2024).
Gemini Team Google et al. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).
Brown, T. B. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, 2020).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).
Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. In Proc. Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) Vol. 35 (Curran Associates, 2022).
Chowdhery, A. et al. PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 11324–11436 (2023).
MATH Google Scholar
Jordan, M. & Jacobs, R. Hierarchical mixtures of experts and the EM algorithm. In Proc. 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan) Vol. 2, 1339–1344 (IEEE, 1993); https://doi.org/10.1109/IJCNN.1993.716791
Jacobs, R. A., Jordan, M. I., Nowlan, S. J. & Hinton, G. E. Adaptive mixtures of local experts. Neural Comput. 3, 79–87 (1991).
Article MATH Google Scholar
Shazeer, N. et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In Proc. International Conference on Learning Representations (ICLR, 2017); https://openreview.net/forum?id=B1ckMDqlg
Fedus, W., Zoph, B. & Shazeer, N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 23, 5232–5270 (2022).
MathSciNet MATH Google Scholar
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 5485–5551 (2020).
MathSciNet MATH Google Scholar
Du, N. et al. GLaM: efficient scaling of language models with mixture-of-experts. In Proc. 39th International Conference on Machine Learning, Proceedings of Machine Learning Research Vol. 162, 5547–5569 (PMLR, 2022).
Clark, A. et al. Unified scaling laws for routed language models. In Proc. 39th International Conference on Machine Learning Vol. 162 (eds Chaudhuri, K. et al.) 4057–4086 (PMLR, 2022).
Ludziejewski, J. et al. Scaling laws for fine-grained mixture of experts. In Proc. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models (PMLR, 2024); https://openreview.net/forum?id=Iizr8qwH7J
Csordás, R., Irie, K. & Schmidhuber, J. Approximating two-layer feedforward networks for efficient transformers. In Proc. Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H. et al.) 674–692 (ACL, 2023); https://doi.org/10.18653/v1/2023.findings-emnlp.49
Reuther, A. et al. AI and ML accelerator survey and trends. In Proc. 2022 IEEE High Performance Extreme Computing Conference (HPEC) 1–10 (IEEE, 2022); https://doi.org/10.1109/HPEC55821.2022.9926331
Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 15, 529–544 (2020).
Article Google Scholar
Lanza, M. et al. Memristive technologies for data storage, computation, encryption and radio-frequency communication. Science 376, eabj9979 (2022).
Article MATH Google Scholar
Mannocci, P. et al. In-memory computing with emerging memory devices: status and outlook. APL Mach. Learn 1, 010902 (2023).
Article Google Scholar
Huang, Y. et al. Memristor-based hardware accelerators for artificial intelligence. Nat. Rev. Electr. Eng. 1, 286–299 (2024).
Article MATH Google Scholar
Le Gallo, M. et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 6, 680–693 (2023).
Article MATH Google Scholar
Ambrogio, S. et al. An analog-AI chip for energy-efficient speech recognition and transcription. Nature 620, 768–775 (2023).
Article MATH Google Scholar
Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).
Article MATH Google Scholar
Zhang, W. et al. Edge learning using a fully integrated neuro-inspired memristor chip. Science 381, 1205–1211 (2023).
Article MATH Google Scholar
Wen, T.-H. et al. Fusion of memristor and digital compute-in-memory processing for energy-efficient edge computing. Science 384, 325–332 (2024).
Article MATH Google Scholar
Fick, L., Skrzyniarz, S., Parikh, M., Henry, M. B. & Fick, D. Analog matrix processor for edge AI real-time video analytics. In Proc. 2022 IEEE International Solid-State Circuits Conference (ISSCC) Vol. 65, 260–262 (IEEE, 2022); https://doi.org/10.1109/ISSCC42614.2022.9731773
Arnaud, F. et al. High density embedded PCM cell in 28 nm FDSOI technology for automotive micro-controller applications. In Proc. 2020 IEEE International Electron Devices Meeting (IEDM) 24.2.1–24.2.4 (IEEE, 2020); https://doi.org/10.1109/IEDM13553.2020.9371934
Lee, S. et al. A 1 Tb 4b/cell 64-stacked-WL 3D NAND flash memory with 12 MB/s program throughput. In Proc. 2018 IEEE International Solid-State Circuits Conference (ISSCC) 340–342 (IEEE, 2018); https://doi.org/10.1109/ISSCC.2018.8310323
Park, J.-W. et al. A 176-stacked 512 Gb 3b/cell 3D-NAND flash with 10.8 Gb/mm² density with a peripheral circuit under cell array architecture. In Proc. 2021 IEEE International Solid-State Circuits Conference (ISSCC) Vol. 64, 422–423 (IEEE, 2021); https://doi.org/10.1109/ISSCC42613.2021.9365809
Lee, S.-T. & Lee, J.-H. Neuromorphic computing using NAND flash memory architecture with pulse width modulation scheme. Front. Neurosci. 14, 571292 (2020).
Article MATH Google Scholar
Bavandpour, M., Sahay, S., Mahmoodi, M. R. & Strukov, D. B. 3D-aCortex: an ultra-compact energy-efficient neurocomputing platform based on commercial 3D-NAND flash memories. Neuromorphic Comput. Eng. 1, 014001 (2021).
Article Google Scholar
Shim, W. & Yu, S. Technological design of 3D NAND-based compute-in-memory architecture for GB-scale deep neural network. IEEE Electron Device Lett. 42, 160–163 (2020).
Article MATH Google Scholar
Hsieh, C.-C. et al. Chip demonstration of a high-density (43 Gb) and high-search-bandwidth (300 Gb/s) 3D NAND based in-memory search accelerator for Ternary Content Addressable Memory (TCAM) and proximity search of Hamming distance. In Proc. 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 1–2 (IEEE, 2023); https://doi.org/10.23919/VLSITechnologyandCir57934.2023.10185361
Huo, Q. et al. A computing-in-memory macro based on three-dimensional resistive random-access memory. Nat. Electron. 5, 469–477 (2022).
Article MATH Google Scholar
Jain, S. et al. A heterogeneous and programmable compute-in-memory accelerator architecture for analog-AI using dense 2-D mesh. IEEE Trans. Very Large Scale Integr. VLSI Syst. 31, 114–127 (2023).
Article MATH Google Scholar
Cui, C. et al. A survey on multimodal large language models for autonomous driving. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops 958–979 (IEEE, 2024).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 4171–4186 (ACL, 2019); https://doi.org/10.18653/v1/N19-1423
Kim, W., Son, B. & Kim, I. ViLT: vision-and-language transformer without convolution or region supervision. In Proc. 38th International Conference on Machine Learning Vol. 139 (eds Meila, M. & Zhang, T.) 5583–5594 (PMLR, 2021); https://proceedings.mlr.press/v139/kim21k.html
Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 23716–23736 (Curran Associates, 2022); https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf
Pope, R. et al. Efficiently scaling transformer inference. In Proc. Machine Learning and Systems Vol. 5 (eds Song, D. et al.) 606–624 (Curran Associates, 2023).; https://proceedings.mlsys.org/paper_files/paper/2023/file/c4be71ab8d24cdfb45e3d06dbfca2780-Paper-mlsys2023.pdf
Choquette, J., Gandhi, W., Giroux, O., Stam, N. & Krashinsky, R. NVIDIA A100 Tensor Core GPU: performance and innovation. IEEE Micro 41, 29–35 (2021).
Article Google Scholar
Radford, A. et al. Language models are unsupervised multitask learners. Semantic Scholar https://api.semanticscholar.org/CorpusID:160025533 (2019).
Merity, S., Xiong, C., Bradbury, J. & Socher, R. Pointer sentinel mixture models. In Proc. International Conference on Learning Representations (ICLR, 2017); https://openreview.net/forum?id=Byj72udxe
Vasilopoulos, A. et al. Exploiting the state dependency of conductance variations in memristive devices for accurate in-memory computing. IEEE Trans. Electron Devices 70, 6279–6285 (2023).
Article MATH Google Scholar
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems Vol. 32, 8024–8035 (Curran Associates, 2019).
Reed, J. K., DeVito, Z., He, H., Ussery, A. & Ansel, J. Torch.fx: practical program capture and transformation for deep learning in Python. Preprint at https://arxiv.org/abs/2112.08429 (2021).
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017); https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Fisher Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. C-30, 478–490 (1981).
Article MATH Google Scholar
Bernstein, D. & Rodeh, M. Global instruction scheduling for superscalar machines. In Proc. ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation PLDI ’91 241–255 (ACM, 1991); https://doi.org/10.1145/113445.113466
Joshi, V. et al. Accurate deep neural network inference using computational phase-change memory. Nat. Commun. 11, 2473 (2020).
Article MATH Google Scholar
Kudo, T. & Richardson, J. SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Blanco, E. & Lu, W.) 66–71 (Association for Computational Linguistics, 2018).
Tillet, P., Kung, H. T. & Cox, D. Triton: an intermediate language and compiler for tiled neural network computations. In Proc. 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages 10–19 (ACM, 2019); https://doi.org/10.1145/3315508.3329973
Le Gallo, M. et al. Using the IBM analog in-memory hardware acceleration kit for neural network training and inference. APL Mach. Learn. 1, 041102 (2023).
Article MATH Google Scholar
Büchel, J. et al. AIHWKIT-lightning: a scalable HW-aware training toolkit for analog in-memory computing. In Proc. Advances in Neural Information Processing Systems 2024 Workshop, Machine Learning with new Compute Paradigms (Curran Associates, 2024); https://openreview.net/forum?id=QNdxOgGmhR
Büchel, J. et al. Gradient descent-based programming of analog in-memory computing cores. In Proc. 2022 International Electron Devices Meeting (IEDM) 33.1.1–33.1.4 (IEEE, 2022); https://doi.org/10.1109/IEDM45625.2022.10019486
Büchel, J. Source data for figures in ‘Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing’. Zenodo https://doi.org/10.5281/zenodo.14146703 (2024).
Büchel, J. IBM/analog-moe: code release. Zenodo https://doi.org/10.5281/zenodo.14025079 (2024).
Büchel, J. & Vasilopolous, A. IBM/3D-CiM-LLM-Inference-Simulator: code release. Zenodo https://doi.org/10.5281/zenodo.14025077 (2024).
Goda, A. 3D NAND technology achievements and future scaling perspectives. IEEE Trans. Electron Devices 67, 1373–1381 (2020).
Article MATH Google Scholar
Lacaita, A. L., Spinelli, A. S. & Compagnoni, C. M. High-density solid-state storage: a long path to success. In Proc. 2021 IEEE Latin America Electron Devices Conference (LAEDC) 1–4 (IEEE, 2021); https://doi.org/10.1109/LAEDC51812.2021.9437865
Shoeybi, M. et al. Megatron-LM: training multi-billion parameter language models using model parallelism. Preprint at https://arxiv.org/abs/1909.08053 (2020).

Download references

Acknowledgements

We thank G. Atwood, A. Goda and D. Mills from Micron for fruitful discussions and technical insights. We also thank R. Csordás from Stanford for valuable help in the reproduction of the Sigma-MoE results. We thank P. Diener for helping with illustrations. We also thank J. Burns from IBM and M. Helm from Micron for managerial support. We received no specific funding for this work.

Author information

Authors and Affiliations

IBM Research Europe, Rüschlikon, Switzerland
Julian Büchel, Athanasios Vasilopoulos, William Andrew Simon, Irem Boybat, Manuel Le Gallo, Abbas Rahimi & Abu Sebastian
IBM Research – Almaden, San Jose, CA, USA
HsinYu Tsai & Geoffrey W. Burr
Micron Technology, Folsom, CA, USA
Hernan Castro
Micron Technology, Novi, MI, USA
Bill Filipiak
IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Vijay Narayanan

Authors

Julian Büchel
View author publications
Search author on:PubMed Google Scholar
Athanasios Vasilopoulos
View author publications
Search author on:PubMed Google Scholar
William Andrew Simon
View author publications
Search author on:PubMed Google Scholar
Irem Boybat
View author publications
Search author on:PubMed Google Scholar
HsinYu Tsai
View author publications
Search author on:PubMed Google Scholar
Geoffrey W. Burr
View author publications
Search author on:PubMed Google Scholar
Hernan Castro
View author publications
Search author on:PubMed Google Scholar
Bill Filipiak
View author publications
Search author on:PubMed Google Scholar
Manuel Le Gallo
View author publications
Search author on:PubMed Google Scholar
Abbas Rahimi
View author publications
Search author on:PubMed Google Scholar
Vijay Narayanan
View author publications
Search author on:PubMed Google Scholar
Abu Sebastian
View author publications
Search author on:PubMed Google Scholar

Contributions

J.B., A.V., I.B., A.R., M.L.G. and A.S. initiated the research effort. J.B. and A.V. implemented the high-level simulator. J.B. implemented the GPU kernels for training MoEs and conducted the experiments. W.A.S., J.B. and G.W.B. compared the high-level simulator to a more detailed simulator developed for 2D AIMC. B.F. and H.C. provided insights on 3D in-memory computing. H.T., V.N. and A.S. provided managerial support. J.B. and A.V. wrote the manuscript with input from all authors.

Corresponding authors

Correspondence to Julian Büchel or Abu Sebastian.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Erika Covi, Anand Subramoney and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Hyperparameters underlying the results presented in Fig. 1

Full size table

Extended Data Table 2 Assumed latency and energy consumption of different operations for varying model dimensions

Full size table

Extended Data Table 3 TOPs and target processing unit per operation

Full size table

Extended Data Table 4 Hyperparameters used for comparison against NVIDIA A100 GPU

Full size table

Extended Data Table 5 Parameters used for the scaling laws

Full size table

Extended Data Table 6 Hyperparameters used for training FP-32 dense and MoE models

Full size table

Supplementary information

Supplementary Information

Supplementary Notes 1–8.

Reporting Summary

Supplementary Data 1

Source data of Supplementary Fig. 1.

Supplementary Data 2

Source data of Supplementary Fig. 2.

Supplementary Data 3

Source data of Supplementary Fig. 3.

Supplementary Data 4

Source data of Supplementary Fig. 4.

Supplementary Data 5

Source data of Supplementary Fig. 5.

Supplementary Data 6

Source data of Supplementary Fig. 6.

Supplementary Data 7

Source data of Supplementary Fig. 7.

Supplementary Data 8

Source data of Supplementary Fig. 8.

Supplementary Data 9

Source data of Supplementary Fig. 9.

Supplementary Data 10

Source data of Supplementary Fig. 10.

Source data

Source Data Fig. 1

Source data of Fig. 1.

Source Data Fig. 3

Source data of Fig. 3.

Source Data Fig. 4

Source data of Fig. 4.

Source Data Fig. 5

Source data of Fig. 5.

Source Data Fig. 6

Source data of Fig. 6.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Büchel, J., Vasilopoulos, A., Simon, W.A. et al. Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing. Nat Comput Sci 5, 13–26 (2025). https://doi.org/10.1038/s43588-024-00753-x

Download citation

Received: 27 May 2024
Accepted: 27 November 2024
Published: 08 January 2025
Version of record: 08 January 2025
Issue date: January 2025
DOI: https://doi.org/10.1038/s43588-024-00753-x

This article is cited by

Compute-in-memory implementation of state space models for event sequence processing
- Xiaoyu Zhang
- Mingtao Hu
- Wei D. Lu
Nature Communications (2026)
Efficient large language model with analog in-memory computing
- Anand Subramoney
Nature Computational Science (2025)
The design of analogue in-memory computing tiles
- Abhairaj Singh
- Manuel Le Gallo
- Abu Sebastian
Nature Electronics (2025)
Evolution of Neural Network Models and Computing-in-Memory Architectures
- Ashish Kumar
- Meenu Devi
- Dhaneshwar Mishra
Archives of Computational Methods in Engineering (2025)
Revisiting the Role of Review Articles in the Age of AI-Agents: Integrating AI-Reasoning and AI-Synthesis Reshaping the Future of Scientific Publishing
- Andrej Thurzo
- Ivan Varga
Bratislava Medical Journal (2025)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links