A dual-domain compute-in-memory system for general neural network inference

Wang, Ze; Yu, Ruihua; Jia, Zhiping; He, Zhifan; Yang, Tianhao; Gao, Bin; Li, Yang; Hu, Zhenping; Hao, Zhenqi; Liu, Yunrui; Lu, Jianghai; Yao, Peng; Tang, Jianshi; Liu, Qi; Qian, He; Wu, Huaqiang

doi:10.1038/s41928-024-01315-9

Article
Published: 24 January 2025

A dual-domain compute-in-memory system for general neural network inference

Nature Electronics volume 8, pages 276–287 (2025)Cite this article

4896 Accesses
7 Citations
19 Altmetric
Metrics details

Subjects

Abstract

Analogue compute-in-memory systems can offer superior energy efficiency and parallelism than conventional digital systems. However, complex regression tasks that require precise floating-point (FP) computing remain challenging with such hardware, and previous approaches have, thus, typically focused on classification tasks requiring low data precision and a limited dynamic range. Here we describe an analogue–digital unified compute-in-memory architecture for general neural network inference. The approach is based on a low-cost dual-domain FP processor and merges analogue compute-in-memory arrays with digital cores. It exhibits a 39.2 times higher energy efficiency than common FP-32 multipliers during FP neural network inference. We use this architecture to develop a memristor-based computing system and illustrate its capabilities with a fully hardware-implemented complex regression task using YOLO. The system exhibits a 2.7 times higher mean average precision (increasing from 0.27 to 0.724, mAP-50) compared with pure analogue compute-in-memory systems.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Current ACIM challenges and proposed AnDi computing architecture.**

**Fig. 2: Illustration of the analogue–digital unified architecture.**

**Fig. 3: Computation enhancement strategies.**

**Fig. 4: Performance analysis through two NN demonstrations.**

Efficient nonlinear function approximation in analog resistive crossbars for recurrent neural networks

Article Open access 29 January 2025

Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators

Article Open access 30 August 2023

Optimised weight programming for analogue memory-based deep neural networks

Article Open access 30 June 2022

Data availability

The data for the hardware demonstration, including pretrained NNs, training dataset and test dataset, are available at https://github.com/wangze22/AnDi/tree/master. Further data supporting the findings of this study can be obtained from the corresponding authors upon reasonable request.

Code availability

The code that supports the findings of this study requires specific hardware platforms for execution and is available from the corresponding authors upon reasonable request.

References

Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).
MATH Google Scholar
Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).
MATH Google Scholar
Ambrogio, S. et al. An analog-AI chip for energy-efficient speech recognition and transcription. Nature 620, 768–775 (2023).
MATH Google Scholar
Pi, S. et al. Memristor crossbar arrays with 6-nm half-pitch and 2-nm critical dimension. Nat. Nanotechnol. 14, 35–39 (2019).
MATH Google Scholar
Huo, Q. A computing-in-memory macro based on three-dimensional resistive random-access memory. Nat. Electron. 5, 469–477.
Li, Y. Monolithic three-dimensional integration of RRAM-based hybrid memory architecture for one-shot learning. Nat. Commun. 14, 7140 (2023).
MATH Google Scholar
Lin, P. Three-dimensional memristor circuits as complex neural networks. Nat. Electron. 3, 225–232 (2020).
Xue, C.-X. et al. 24.1 A 1Mb multibit ReRAM computing-in-memory macro with 14.6ns parallel MAC computing time for CNN based AI edge processors. In Proc. 2019 IEEE International Solid-State Circuits Conference – (ISSCC) 388–390 (IEEE, 2019).
Wan, W. et al. 33.1 A 74 TMACS/W CMOS-RRAM neurosynaptic core with dynamically reconfigurable dataflow and in-situ transposable weights for probabilistic graphical models. In Proc. 2020 IEEE International Solid-State Circuits Conference – (ISSCC) 498–500 (IEEE, 2020).
Mochida, R. et al. A 4M synapses integrated analog ReRAM based 66.5 TOPS/W neural-network processor with cell current controlled writing and flexible network architecture. In Proc. 2018 IEEE Symposium on VLSI Technology 175–176 (IEEE, 2018).
Wen, T.-H. et al. A 28 nm nonvolatile AI edge processor using 4 Mb analog-based near-memory-compute ReRAM with 27.2 TOPS/W for tiny AI edge devices. In Proc. 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 1–2 (IEEE, 2023).
Le Gallo, M. et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 6, 680–693 (2023).
MATH Google Scholar
Xue, C.-X. et al. A CMOS-integrated compute-in-memory macro based on resistive random-access memory for AI edge devices. Nat. Electron. 4, 81–90 (2020).
MATH Google Scholar
Jung, S. et al. A crossbar array of magnetoresistive memory devices for in-memory computing. Nature 601, 211–216 (2022).
MATH Google Scholar
Hung, J.-M. et al. A four-megabit compute-in-memory macro with eight-bit precision based on CMOS and resistive random-access memory for AI edge devices. Nat. Electron. 4, 921–930 (2021).
MATH Google Scholar
Cai, F. et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nat. Electron. 2, 290–299 (2019).
MATH Google Scholar
Zhong, Y. et al. A memristor-based analogue reservoir computing system for real-time and power-efficient signal processing. Nat. Electron. 5, 672–681 (2022).
MATH Google Scholar
Harabi, K.-E. et al. A memristor-based Bayesian machine. Nat. Electron. 6, 52–63 (2023).
MATH Google Scholar
Turck, C. et al. Bayesian in-memory computing with resistive memories. In Proc. 2023 International Electron Devices Meeting (IEDM) 1–4 (IEEE, 2023).
Yu, S. et al. Binary neural network with 16 Mb RRAM macro chip for classification and online training. In Proc. 2016 IEEE International Electron Devices Meeting (IEDM) 16.2.1–16.2.4 (IEEE, 2016).
Chen, W.-H. et al. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors. Nat. Electron. 2, 420–428 (2019).
MATH Google Scholar
Wang, W. et al. Computing of temporal information in spiking neural networks with ReRAM synapses. Faraday Discuss. 213, 453–469 (2019).
MATH Google Scholar
Ueyoshi, K. et al. DIANA: an end-to-end energy-efficient digital and analog hybrid neural network SoC. In Proc. 2022 IEEE International Solid-State Circuits Conference (ISSCC) 1–3 (IEEE, 2022).
Zhang, W. et al. Edge learning using a fully integrated neuro-inspired memristor chip. Science 381, 1205–1211 (2023).
MATH Google Scholar
Yao, P. et al. Face classification using electronic synapses. Nat. Commun. 8, 15199 (2017).
Google Scholar
Khaddam-Aljameh, R. et al. HERMES core – a 14nm CMOS and PCM-based in-memory compute core using an array of 300ps/LSB linearized CCO-based ADCs and local digital processing. In Proc. 2021 Symposium on VLSI Circuits 1–2 (IEEE, 2021).
Khaddam-Aljameh, R. et al. HERMES-core—A 1.59-TOPS/mm² PCM on 14-nm CMOS in-memory compute core using 300-ps/LSB linearized CCO-based ADCs. IEEE J. Solid-State Circuits 57, 1027–1038 (2022).
Google Scholar
Milano, G. et al. In materia reservoir computing with a fully memristive architecture based on self-organizing nanowire networks. Nat. Mater. 21, 195–202 (2022).
MATH Google Scholar
Bocquet, M. et al. In-memory and error-immune differential RRAM implementation of binarized deep neural networks. In Proc. 2018 IEEE International Electron Devices Meeting (IEDM) 20.6.1–20.6.4 (IEEE, 2018).
Li, C. et al. Long short-term memory networks in memristor crossbar arrays. Nat. Mach. Intell. 1, 49–57 (2019).
MATH Google Scholar
Gao, B. et al. Memristor-based analogue computing for brain-inspired sound localization with in situ training. Nat. Commun. 13, 2026 (2022).
MATH Google Scholar
Hu, M. et al. Memristor‐based analog computation and neural network classification with a dot product engine. Adv. Mater. 30, 1705914 (2018).
Google Scholar
Jebali, F. et al. Powering AI at the edge: a robust, memristor-based binarized neural network with near-memory computing and miniaturized solar cell. Nat. Commun. 15, 741 (2024).
MATH Google Scholar
Yan, B. et al. RRAM-based spiking nonvolatile computing-in-memory processing engine with precision-configurable in situ nonlinear activation. In Proc. 2019 Symposium on VLSI Technology T86–T87 (IEEE, 2019).
Jia, H. et al. Scalable and programmable neural network inference accelerator based on in-memory computing. IEEE J. Solid-State Circuits 57, 198–211 (2022).
MATH Google Scholar
Wang, Z. et al. Toward a generalized Bienenstock-Cooper-Munro rule for spatiotemporal learning via triplet-STDP in memristive devices. Nat. Commun. 11, 1510 (2020).
MATH Google Scholar
Lin, Y. et al. Uncertainty quantification via a memristor Bayesian deep neural network for risk-sensitive reinforcement learning. Nat. Mach. Intell. 5, 714–723 (2023).
MATH Google Scholar
Jain, S. et al. A heterogeneous and programmable compute-in-memory accelerator architecture for analog-AI using dense 2-D mesh. IEEE Trans. Very Large Scale Integr. VLSI Syst. 31, 114–127 (2023).
MATH Google Scholar
Rasch, M. J. et al. Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators. Nat. Commun. 14, 5282 (2023).
MATH Google Scholar
Chiang, Y.-H. et al. Hardware-robust in-RRAM-computing for object detection. IEEE J. Emerg. Sel. Top. Circuits Syst. 12, 547–556 (2022).
MATH Google Scholar
Wang, J. et al. 14.2 A compute SRAM with bit-serial integer/floating-point operations for programmable in-memory vector acceleration. In Proc. 2019 IEEE International Solid-State Circuits Conference – (ISSCC) 224–226 (IEEE, 2019).
Jacob, B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, 2704–2713 (IEEE, 2018).
Du, C.-Y. et al. A 28nm 11.2TOPS/W hardware-utilization-aware neural-network accelerator with dynamic dataflow. In Proc. 2023 IEEE International Solid-State Circuits Conference (ISSCC) 1–3 (IEEE, 2023).
Park, J.-S. et al. A multi-mode 8k-MAC HW-utilization-aware neural processing unit with a unified multi-precision datapath in 4-nm flagship mobile SoC. IEEE J. Solid-State Circuits 58, 189–202 (2023).
Google Scholar
Kim, M. & Seo, J.-S. An energy-efficient deep convolutional neural network accelerator featuring conditional computing and low external memory access. IEEE J. Solid-State Circuits 56, 803–813 (2021).
MATH Google Scholar
Jouppi, N. P. et al. In-datacenter performance analysis of a tensor processing unit. In Proc. 44th Annual International Symposium on Computer Architecture 1–12 (ACM, 2017).
Jouppi, N. et al. TPU v4: an optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proc. 50th Annual International Symposium on Computer Architecture 1–14 (ACM, 2023).
Asanović, K. et al. The Rocket Chip generator. UC Berkeley Electrical Engineering & Computer Sciences http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html (2016).
Khwa, W.-S. et al. A 40-nm, 2M-cell, 8b-precision, hybrid SLC-MLC PCM computing-in-memory macro with 20.5–65.0 TOPS/W for tiny-Al edge devices. In Proc. 2022 IEEE International Solid-State Circuits Conference (ISSCC) 1–3 (IEEE, 2022).
Wen, T.-H. et al. Fusion of memristor and digital compute-in-memory processing for energy-efficient edge computing. Science 384, 325–332 (2024).
MATH Google Scholar

Download references

Acknowledgements

We thank S. Ding, W. Shi and W. Wu for supporting the development of the AnDi hardware system and Q. Qin, J. Li and T. Guo for their valuable discussions. This work is supported in part by STI 2030 – Major Projects (Grant No. 2021ZD0201205 to H.W.), the National Natural Science Foundation of China (Grant Nos. 92064001 to B.G., 624B2083 to Z.W., 62495103 to H.W. and 62025111 to H.W.), the Shanghai Municipal Science and Technology Major Project, the XPLORER Prize, the Beijing Advanced Innovation Center for Integrated Circuits and the IoT Intelligent Microsystem Center of Tsinghua University-China Mobile Communications Group Co., Ltd Joint Institute.

Author information

Authors and Affiliations

School of Integrated Circuits, Tsinghua University, Beijing, China
Ze Wang, Ruihua Yu, Zhiping Jia, Zhifan He, Tianhao Yang, Bin Gao, Zhenqi Hao, Peng Yao, Jianshi Tang, Qi Liu, He Qian & Huaqiang Wu
Department of Internet of Things Technology and Application, China Mobile Research Institute, Beijing, China
Yang Li
Department of Science and Technology Innovation, China Mobile Communications Group Co. Ltd, Beijing, China
Zhenping Hu
Beijing Elemem Technology Co. Ltd, Beijing, China
Yunrui Liu & Jianghai Lu

Authors

Ze Wang
View author publications
Search author on:PubMed Google Scholar
Ruihua Yu
View author publications
Search author on:PubMed Google Scholar
Zhiping Jia
View author publications
Search author on:PubMed Google Scholar
Zhifan He
View author publications
Search author on:PubMed Google Scholar
Tianhao Yang
View author publications
Search author on:PubMed Google Scholar
Bin Gao
View author publications
Search author on:PubMed Google Scholar
Yang Li
View author publications
Search author on:PubMed Google Scholar
Zhenping Hu
View author publications
Search author on:PubMed Google Scholar
Zhenqi Hao
View author publications
Search author on:PubMed Google Scholar
Yunrui Liu
View author publications
Search author on:PubMed Google Scholar
Jianghai Lu
View author publications
Search author on:PubMed Google Scholar
Peng Yao
View author publications
Search author on:PubMed Google Scholar
Jianshi Tang
View author publications
Search author on:PubMed Google Scholar
Qi Liu
View author publications
Search author on:PubMed Google Scholar
He Qian
View author publications
Search author on:PubMed Google Scholar
Huaqiang Wu
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.W. designed the AnDi architecture and its key features, including the DDFP data flow, quantization and dequantization units, as well as the hybrid mapping, dynamic scheduling, feature-enhancing and hybrid online training methods. Z.J. implemented the quantization and dequantization units. Z.W. and R.Y. designed and implemented the end-to-end ACIM tool chain. Z.W. conceived and conducted the hardware inference experiments and analysed the data. T.Y. and Z.W. trained the NNs. Z.W., R.Y. and Z.J. analysed the energy efficiency and set up the data pipeline. Z.H. and R.Y. implemented the softcore CPU. R.Y. implemented the compiler for the CIM instruction set. Q.L. designed the ACIM chip. Y. Liu and J.L. supported the software development kit for the ACIM chip. Z.W., R.Y. and B.G. wrote the manuscript. P.Y., J.T., Y. Li, Z. Hu and Z. Hao reviewed and improved the quality of the manuscript. H.W., J.T. and H.Q. supervised the project.

Corresponding authors

Correspondence to Bin Gao or Huaqiang Wu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Electronics thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Details of the hardware parameters used in the performance evaluation

Full size table

Extended Data Table 2 An example of a MAC tile with AnDi architecture

Full size table

Extended Data Table 3 Comparison with prior works

Full size table

Extended Data Table 4 Details of the YOLO neural networks used in the demonstration

Full size table

Extended Data Fig. 1 Comparison of energy efficiency improvement and weight splitting between the hybrid mapper in AnDi and traditional mapper for a pure ACIM system.

a, Traditional mapper for pure ACIM system. b, Hybrid mapper for AnDi system. It is evident that both systems require weight splitting when the weight size exceeds the ACIM array's capacity. However, the hybrid mapper does not split the weights into more blocks or introduce more complex memory-tracking issues than the pure ACIM system. c, Energy efficiency comparison between the traditional mapper and the hybrid mapper under different row parallelism.

Extended Data Fig. 2 An example for the analysis of delay and pipeline in the AnDi architecture.

a, Example of a MAC tile based on the AnDi architecture, considering pipeline delay conditions. The example includes a tile with a 256 × 256 ACIM array with a delay of 50 ns, and four DMAC cores, each capable of performing 64 MAC operations in parallel per computation. This design ensures that regardless of the weight assigned to the DMAC by the hybrid mapper, the DMAC’s computation delay remains within 50 ns, matching the ACIM computation delay and avoiding pipeline disruption. b, Computational units from panel (a) are encapsulated into a MAC tile. Multiple tiles form the on-chip network. c, Each tile can be designed to operate in three different computation modes: dual core mode, where both ACIM and DMAC cores work simultaneously; ACIM only mode, where only the ACIM core is active; and DMAC only mode, where only the DMAC core is active. This design aims to save energy, such that when the hybrid mapper assigns only ACIM or DMAC computations to the current tile, the other core can be completely powered off to avoid energy consumption. d, Energy consumption trends of DMAC and ACIM units as the computational load increases. The blue dashed line represents the power consumption of the AnDi system during MAC computations. e, When different weights are assigned to the DMAC, it is possible to independently switch any of the four DMAC cores on or off to save energy while still meeting the 50 ns delay requirement. The maximum computational load that each DMAC in a tile needs to handle (represented by the largest rectangle within the green area) depends on the peak energy efficiency ratio between ACIM and DMAC, which is 10:1.

Extended Data Fig. 3 AnDi Memristor based Hardware System.

a, A photo of the AnDi computation hardware system running the adaptive self-driving car. b, Block diagram of the softcore RISC-V CPU and ACIM arrays.

Extended Data Fig. 4 Maps used for turn pass rate test for the self-driving car, and mapping locations of YOLO neural network weights onto ACIM arrays.

a, Maps used for turn pass rate test for the self-driving car. b, Mapping locations of YOLO neural network weights onto ACIM arrays.

Extended Data Fig. 5 YOLO inference results with and without the enhancing layers.

All data is computed on the AnDi hardware system.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7 and discussion.

Supplementary Video 1

Video demonstration of the adaptive self-driving car tested on ten different maps.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Z., Yu, R., Jia, Z. et al. A dual-domain compute-in-memory system for general neural network inference. Nat Electron 8, 276–287 (2025). https://doi.org/10.1038/s41928-024-01315-9

Download citation

Received: 01 March 2024
Accepted: 16 November 2024
Published: 24 January 2025
Issue date: March 2025
DOI: https://doi.org/10.1038/s41928-024-01315-9

This article is cited by

A new model for AI

Nature Electronics (2025)
Machine learning and high-throughput computation-assisted precise synthesis of quantum dots for reliable neuromorphic computing
- Zhiqing Wang
- Keqiang Chen
- Wen Chen
Science China Materials (2025)
A memristor-based dual-domain system for overcoming limitations of traditional analogue compute-in-memory
- Shixiong Liu
- Tao Sun
- Yang Li
Science China Materials (2025)

A dual-domain compute-in-memory system for general neural network inference

Subjects

Abstract

Access options

Similar content being viewed by others

Efficient nonlinear function approximation in analog resistive crossbars for recurrent neural networks

Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators

Optimised weight programming for analogue memory-based deep neural networks

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Extended Data Fig. 1 Comparison of energy efficiency improvement and weight splitting between the hybrid mapper in AnDi and traditional mapper for a pure ACIM system.

Extended Data Fig. 2 An example for the analysis of delay and pipeline in the AnDi architecture.

Extended Data Fig. 3 AnDi Memristor based Hardware System.

Extended Data Fig. 4 Maps used for turn pass rate test for the self-driving car, and mapping locations of YOLO neural network weights onto ACIM arrays.

Extended Data Fig. 5 YOLO inference results with and without the enhancing layers.

Supplementary information

Supplementary Information

Supplementary Video 1

Rights and permissions

About this article

Cite this article

This article is cited by

A new model for AI

Machine learning and high-throughput computation-assisted precise synthesis of quantum dots for reliable neuromorphic computing

A memristor-based dual-domain system for overcoming limitations of traditional analogue compute-in-memory

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links