Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing

Abstract

Large language models (LLMs), with their remarkable generative capacities, have greatly impacted a range of fields, but they face scalability challenges due to their large parameter counts, which result in high costs for training and inference. The trend of increasing model sizes is exacerbating these challenges, particularly in terms of memory footprint, latency and energy consumption. Here we explore the deployment of ‘mixture of experts’ (MoEs) networks—networks that use conditional computing to keep computational demands low despite having many parameters—on three-dimensional (3D) non-volatile memory (NVM)-based analog in-memory computing (AIMC) hardware. When combined with the MoE architecture, this hardware, utilizing stacked NVM devices arranged in a crossbar array, offers a solution to the parameter-fetching bottleneck typical in traditional models deployed on conventional von-Neumann-based architectures. By simulating the deployment of MoEs on an abstract 3D AIMC system, we demonstrate that, due to their conditional compute mechanism, MoEs are inherently better suited to this hardware than conventional, dense model architectures. Our findings suggest that MoEs, in conjunction with emerging 3D NVM-based AIMC, can substantially reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Three-dimensional AIMC-based implementation of MoEs.
Fig. 2: Overview of the simulation framework.
Fig. 3: Inference performance of MoEs deployed on 3D AIMC.
Fig. 4: Comparison of throughput, area efficiency and energy efficiency between simulated pipelined execution of LLMs on 3D AIMC against the NVIDIA A100.
Fig. 5: MoEs form the Pareto front in terms of model accuracy against system performance when compared to dense models.
Fig. 6: Impact of AIMC noise on iso-performance.

Similar content being viewed by others

Data availability

Data were generated by the presented simulator and by evaluating the models trained on the WikiText-103 dataset. The WikiText-103 dataset is publicly available at https://huggingface.co/datasets/Salesforce/wikitext. Source data are provided with this paper57.

Code availability

The code used to generate the results of this study is publicly available from https://github.com/IBM/analog-moe (ref. 58) and https://github.com/IBM/3D-CiM-LLM-Inference-Simulator (ref. 59).

References

  1. Jiang, A. Q. et al. Mixtral of experts. Preprint at https://arxiv.org/abs/2401.04088 (2024).

  2. Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2024).

  3. Gemini Team Google et al. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).

  4. Brown, T. B. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, 2020).

  5. Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).

  6. Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. In Proc. Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) Vol. 35 (Curran Associates, 2022).

  7. Chowdhery, A. et al. PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 11324–11436 (2023).

    MATH  Google Scholar 

  8. Jordan, M. & Jacobs, R. Hierarchical mixtures of experts and the EM algorithm. In Proc. 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan) Vol. 2, 1339–1344 (IEEE, 1993); https://doi.org/10.1109/IJCNN.1993.716791

  9. Jacobs, R. A., Jordan, M. I., Nowlan, S. J. & Hinton, G. E. Adaptive mixtures of local experts. Neural Comput. 3, 79–87 (1991).

    Article  MATH  Google Scholar 

  10. Shazeer, N. et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In Proc. International Conference on Learning Representations (ICLR, 2017); https://openreview.net/forum?id=B1ckMDqlg

  11. Fedus, W., Zoph, B. & Shazeer, N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 23, 5232–5270 (2022).

    MathSciNet  MATH  Google Scholar 

  12. Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 5485–5551 (2020).

    MathSciNet  MATH  Google Scholar 

  13. Du, N. et al. GLaM: efficient scaling of language models with mixture-of-experts. In Proc. 39th International Conference on Machine Learning, Proceedings of Machine Learning Research Vol. 162, 5547–5569 (PMLR, 2022).

  14. Clark, A. et al. Unified scaling laws for routed language models. In Proc. 39th International Conference on Machine Learning Vol. 162 (eds Chaudhuri, K. et al.) 4057–4086 (PMLR, 2022).

  15. Ludziejewski, J. et al. Scaling laws for fine-grained mixture of experts. In Proc. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models (PMLR, 2024); https://openreview.net/forum?id=Iizr8qwH7J

  16. Csordás, R., Irie, K. & Schmidhuber, J. Approximating two-layer feedforward networks for efficient transformers. In Proc. Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H. et al.) 674–692 (ACL, 2023); https://doi.org/10.18653/v1/2023.findings-emnlp.49

  17. Reuther, A. et al. AI and ML accelerator survey and trends. In Proc. 2022 IEEE High Performance Extreme Computing Conference (HPEC) 1–10 (IEEE, 2022); https://doi.org/10.1109/HPEC55821.2022.9926331

  18. Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 15, 529–544 (2020).

    Article  Google Scholar 

  19. Lanza, M. et al. Memristive technologies for data storage, computation, encryption and radio-frequency communication. Science 376, eabj9979 (2022).

    Article  MATH  Google Scholar 

  20. Mannocci, P. et al. In-memory computing with emerging memory devices: status and outlook. APL Mach. Learn 1, 010902 (2023).

    Article  Google Scholar 

  21. Huang, Y. et al. Memristor-based hardware accelerators for artificial intelligence. Nat. Rev. Electr. Eng. 1, 286–299 (2024).

    Article  MATH  Google Scholar 

  22. Le Gallo, M. et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 6, 680–693 (2023).

    Article  MATH  Google Scholar 

  23. Ambrogio, S. et al. An analog-AI chip for energy-efficient speech recognition and transcription. Nature 620, 768–775 (2023).

    Article  MATH  Google Scholar 

  24. Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).

    Article  MATH  Google Scholar 

  25. Zhang, W. et al. Edge learning using a fully integrated neuro-inspired memristor chip. Science 381, 1205–1211 (2023).

    Article  MATH  Google Scholar 

  26. Wen, T.-H. et al. Fusion of memristor and digital compute-in-memory processing for energy-efficient edge computing. Science 384, 325–332 (2024).

    Article  MATH  Google Scholar 

  27. Fick, L., Skrzyniarz, S., Parikh, M., Henry, M. B. & Fick, D. Analog matrix processor for edge AI real-time video analytics. In Proc. 2022 IEEE International Solid-State Circuits Conference (ISSCC) Vol. 65, 260–262 (IEEE, 2022); https://doi.org/10.1109/ISSCC42614.2022.9731773

  28. Arnaud, F. et al. High density embedded PCM cell in 28 nm FDSOI technology for automotive micro-controller applications. In Proc. 2020 IEEE International Electron Devices Meeting (IEDM) 24.2.1–24.2.4 (IEEE, 2020); https://doi.org/10.1109/IEDM13553.2020.9371934

  29. Lee, S. et al. A 1 Tb 4b/cell 64-stacked-WL 3D NAND flash memory with 12 MB/s program throughput. In Proc. 2018 IEEE International Solid-State Circuits Conference (ISSCC) 340–342 (IEEE, 2018); https://doi.org/10.1109/ISSCC.2018.8310323

  30. Park, J.-W. et al. A 176-stacked 512 Gb 3b/cell 3D-NAND flash with 10.8 Gb/mm2 density with a peripheral circuit under cell array architecture. In Proc. 2021 IEEE International Solid-State Circuits Conference (ISSCC) Vol. 64, 422–423 (IEEE, 2021); https://doi.org/10.1109/ISSCC42613.2021.9365809

  31. Lee, S.-T. & Lee, J.-H. Neuromorphic computing using NAND flash memory architecture with pulse width modulation scheme. Front. Neurosci. 14, 571292 (2020).

    Article  MATH  Google Scholar 

  32. Bavandpour, M., Sahay, S., Mahmoodi, M. R. & Strukov, D. B. 3D-aCortex: an ultra-compact energy-efficient neurocomputing platform based on commercial 3D-NAND flash memories. Neuromorphic Comput. Eng. 1, 014001 (2021).

    Article  Google Scholar 

  33. Shim, W. & Yu, S. Technological design of 3D NAND-based compute-in-memory architecture for GB-scale deep neural network. IEEE Electron Device Lett. 42, 160–163 (2020).

    Article  MATH  Google Scholar 

  34. Hsieh, C.-C. et al. Chip demonstration of a high-density (43 Gb) and high-search-bandwidth (300 Gb/s) 3D NAND based in-memory search accelerator for Ternary Content Addressable Memory (TCAM) and proximity search of Hamming distance. In Proc. 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 1–2 (IEEE, 2023); https://doi.org/10.23919/VLSITechnologyandCir57934.2023.10185361

  35. Huo, Q. et al. A computing-in-memory macro based on three-dimensional resistive random-access memory. Nat. Electron. 5, 469–477 (2022).

    Article  MATH  Google Scholar 

  36. Jain, S. et al. A heterogeneous and programmable compute-in-memory accelerator architecture for analog-AI using dense 2-D mesh. IEEE Trans. Very Large Scale Integr. VLSI Syst. 31, 114–127 (2023).

    Article  MATH  Google Scholar 

  37. Cui, C. et al. A survey on multimodal large language models for autonomous driving. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops 958–979 (IEEE, 2024).

  38. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 4171–4186 (ACL, 2019); https://doi.org/10.18653/v1/N19-1423

  39. Kim, W., Son, B. & Kim, I. ViLT: vision-and-language transformer without convolution or region supervision. In Proc. 38th International Conference on Machine Learning Vol. 139 (eds Meila, M. & Zhang, T.) 5583–5594 (PMLR, 2021); https://proceedings.mlr.press/v139/kim21k.html

  40. Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. In Proc. Advances in Neural Information Processing Systems Vol. 35 (eds Koyejo, S. et al.) 23716–23736 (Curran Associates, 2022); https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf

  41. Pope, R. et al. Efficiently scaling transformer inference. In Proc. Machine Learning and Systems Vol. 5 (eds Song, D. et al.) 606–624 (Curran Associates, 2023).; https://proceedings.mlsys.org/paper_files/paper/2023/file/c4be71ab8d24cdfb45e3d06dbfca2780-Paper-mlsys2023.pdf

  42. Choquette, J., Gandhi, W., Giroux, O., Stam, N. & Krashinsky, R. NVIDIA A100 Tensor Core GPU: performance and innovation. IEEE Micro 41, 29–35 (2021).

    Article  Google Scholar 

  43. Radford, A. et al. Language models are unsupervised multitask learners. Semantic Scholar https://api.semanticscholar.org/CorpusID:160025533 (2019).

  44. Merity, S., Xiong, C., Bradbury, J. & Socher, R. Pointer sentinel mixture models. In Proc. International Conference on Learning Representations (ICLR, 2017); https://openreview.net/forum?id=Byj72udxe

  45. Vasilopoulos, A. et al. Exploiting the state dependency of conductance variations in memristive devices for accurate in-memory computing. IEEE Trans. Electron Devices 70, 6279–6285 (2023).

    Article  MATH  Google Scholar 

  46. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems Vol. 32, 8024–8035 (Curran Associates, 2019).

  47. Reed, J. K., DeVito, Z., He, H., Ussery, A. & Ansel, J. Torch.fx: practical program capture and transformation for deep learning in Python. Preprint at https://arxiv.org/abs/2112.08429 (2021).

  48. Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017); https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  49. Fisher Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. C-30, 478–490 (1981).

    Article  MATH  Google Scholar 

  50. Bernstein, D. & Rodeh, M. Global instruction scheduling for superscalar machines. In Proc. ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation PLDI ’91 241–255 (ACM, 1991); https://doi.org/10.1145/113445.113466

  51. Joshi, V. et al. Accurate deep neural network inference using computational phase-change memory. Nat. Commun. 11, 2473 (2020).

    Article  MATH  Google Scholar 

  52. Kudo, T. & Richardson, J. SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Blanco, E. & Lu, W.) 66–71 (Association for Computational Linguistics, 2018).

  53. Tillet, P., Kung, H. T. & Cox, D. Triton: an intermediate language and compiler for tiled neural network computations. In Proc. 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages 10–19 (ACM, 2019); https://doi.org/10.1145/3315508.3329973

  54. Le Gallo, M. et al. Using the IBM analog in-memory hardware acceleration kit for neural network training and inference. APL Mach. Learn. 1, 041102 (2023).

    Article  MATH  Google Scholar 

  55. Büchel, J. et al. AIHWKIT-lightning: a scalable HW-aware training toolkit for analog in-memory computing. In Proc. Advances in Neural Information Processing Systems 2024 Workshop, Machine Learning with new Compute Paradigms (Curran Associates, 2024); https://openreview.net/forum?id=QNdxOgGmhR

  56. Büchel, J. et al. Gradient descent-based programming of analog in-memory computing cores. In Proc. 2022 International Electron Devices Meeting (IEDM) 33.1.1–33.1.4 (IEEE, 2022); https://doi.org/10.1109/IEDM45625.2022.10019486

  57. Büchel, J. Source data for figures in ‘Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing’. Zenodo https://doi.org/10.5281/zenodo.14146703 (2024).

  58. Büchel, J. IBM/analog-moe: code release. Zenodo https://doi.org/10.5281/zenodo.14025079 (2024).

  59. Büchel, J. & Vasilopolous, A. IBM/3D-CiM-LLM-Inference-Simulator: code release. Zenodo https://doi.org/10.5281/zenodo.14025077 (2024).

  60. Goda, A. 3D NAND technology achievements and future scaling perspectives. IEEE Trans. Electron Devices 67, 1373–1381 (2020).

    Article  MATH  Google Scholar 

  61. Lacaita, A. L., Spinelli, A. S. & Compagnoni, C. M. High-density solid-state storage: a long path to success. In Proc. 2021 IEEE Latin America Electron Devices Conference (LAEDC) 1–4 (IEEE, 2021); https://doi.org/10.1109/LAEDC51812.2021.9437865

  62. Shoeybi, M. et al. Megatron-LM: training multi-billion parameter language models using model parallelism. Preprint at https://arxiv.org/abs/1909.08053 (2020).

Download references

Acknowledgements

We thank G. Atwood, A. Goda and D. Mills from Micron for fruitful discussions and technical insights. We also thank R. Csordás from Stanford for valuable help in the reproduction of the Sigma-MoE results. We thank P. Diener for helping with illustrations. We also thank J. Burns from IBM and M. Helm from Micron for managerial support. We received no specific funding for this work.

Author information

Authors and Affiliations

Authors

Contributions

J.B., A.V., I.B., A.R., M.L.G. and A.S. initiated the research effort. J.B. and A.V. implemented the high-level simulator. J.B. implemented the GPU kernels for training MoEs and conducted the experiments. W.A.S., J.B. and G.W.B. compared the high-level simulator to a more detailed simulator developed for 2D AIMC. B.F. and H.C. provided insights on 3D in-memory computing. H.T., V.N. and A.S. provided managerial support. J.B. and A.V. wrote the manuscript with input from all authors.

Corresponding authors

Correspondence to Julian Büchel or Abu Sebastian.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Erika Covi, Anand Subramoney and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Hyperparameters underlying the results presented in Fig. 1
Extended Data Table 2 Assumed latency and energy consumption of different operations for varying model dimensions
Extended Data Table 3 TOPs and target processing unit per operation
Extended Data Table 4 Hyperparameters used for comparison against NVIDIA A100 GPU
Extended Data Table 5 Parameters used for the scaling laws
Extended Data Table 6 Hyperparameters used for training FP-32 dense and MoE models

Supplementary information

Supplementary Information

Supplementary Notes 1–8.

Reporting Summary

Supplementary Data 1

Source data of Supplementary Fig. 1.

Supplementary Data 2

Source data of Supplementary Fig. 2.

Supplementary Data 3

Source data of Supplementary Fig. 3.

Supplementary Data 4

Source data of Supplementary Fig. 4.

Supplementary Data 5

Source data of Supplementary Fig. 5.

Supplementary Data 6

Source data of Supplementary Fig. 6.

Supplementary Data 7

Source data of Supplementary Fig. 7.

Supplementary Data 8

Source data of Supplementary Fig. 8.

Supplementary Data 9

Source data of Supplementary Fig. 9.

Supplementary Data 10

Source data of Supplementary Fig. 10.

Source data

Source Data Fig. 1

Source data of Fig. 1.

Source Data Fig. 3

Source data of Fig. 3.

Source Data Fig. 4

Source data of Fig. 4.

Source Data Fig. 5

Source data of Fig. 5.

Source Data Fig. 6

Source data of Fig. 6.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Büchel, J., Vasilopoulos, A., Simon, W.A. et al. Efficient scaling of large language models with mixture of experts and 3D analog in-memory computing. Nat Comput Sci 5, 13–26 (2025). https://doi.org/10.1038/s43588-024-00753-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s43588-024-00753-x

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics