Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A dual-domain compute-in-memory system for general neural network inference

Abstract

Analogue compute-in-memory systems can offer superior energy efficiency and parallelism than conventional digital systems. However, complex regression tasks that require precise floating-point (FP) computing remain challenging with such hardware, and previous approaches have, thus, typically focused on classification tasks requiring low data precision and a limited dynamic range. Here we describe an analogue–digital unified compute-in-memory architecture for general neural network inference. The approach is based on a low-cost dual-domain FP processor and merges analogue compute-in-memory arrays with digital cores. It exhibits a 39.2 times higher energy efficiency than common FP-32 multipliers during FP neural network inference. We use this architecture to develop a memristor-based computing system and illustrate its capabilities with a fully hardware-implemented complex regression task using YOLO. The system exhibits a 2.7 times higher mean average precision (increasing from 0.27 to 0.724, mAP-50) compared with pure analogue compute-in-memory systems.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Current ACIM challenges and proposed AnDi computing architecture.
Fig. 2: Illustration of the analogue–digital unified architecture.
Fig. 3: Computation enhancement strategies.
Fig. 4: Performance analysis through two NN demonstrations.

Similar content being viewed by others

Data availability

The data for the hardware demonstration, including pretrained NNs, training dataset and test dataset, are available at https://github.com/wangze22/AnDi/tree/master. Further data supporting the findings of this study can be obtained from the corresponding authors upon reasonable request.

Code availability

The code that supports the findings of this study requires specific hardware platforms for execution and is available from the corresponding authors upon reasonable request.

References

  1. Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).

    MATH  Google Scholar 

  2. Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).

    MATH  Google Scholar 

  3. Ambrogio, S. et al. An analog-AI chip for energy-efficient speech recognition and transcription. Nature 620, 768–775 (2023).

    MATH  Google Scholar 

  4. Pi, S. et al. Memristor crossbar arrays with 6-nm half-pitch and 2-nm critical dimension. Nat. Nanotechnol. 14, 35–39 (2019).

    MATH  Google Scholar 

  5. Huo, Q. A computing-in-memory macro based on three-dimensional resistive random-access memory. Nat. Electron. 5, 469–477.

  6. Li, Y. Monolithic three-dimensional integration of RRAM-based hybrid memory architecture for one-shot learning. Nat. Commun. 14, 7140 (2023).

    MATH  Google Scholar 

  7. Lin, P. Three-dimensional memristor circuits as complex neural networks. Nat. Electron. 3, 225–232 (2020).

  8. Xue, C.-X. et al. 24.1 A 1Mb multibit ReRAM computing-in-memory macro with 14.6ns parallel MAC computing time for CNN based AI edge processors. In Proc. 2019 IEEE International Solid-State Circuits Conference – (ISSCC) 388–390 (IEEE, 2019).

  9. Wan, W. et al. 33.1 A 74 TMACS/W CMOS-RRAM neurosynaptic core with dynamically reconfigurable dataflow and in-situ transposable weights for probabilistic graphical models. In Proc. 2020 IEEE International Solid-State Circuits Conference – (ISSCC) 498–500 (IEEE, 2020).

  10. Mochida, R. et al. A 4M synapses integrated analog ReRAM based 66.5 TOPS/W neural-network processor with cell current controlled writing and flexible network architecture. In Proc. 2018 IEEE Symposium on VLSI Technology 175–176 (IEEE, 2018).

  11. Wen, T.-H. et al. A 28 nm nonvolatile AI edge processor using 4 Mb analog-based near-memory-compute ReRAM with 27.2 TOPS/W for tiny AI edge devices. In Proc. 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 1–2 (IEEE, 2023).

  12. Le Gallo, M. et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 6, 680–693 (2023).

    MATH  Google Scholar 

  13. Xue, C.-X. et al. A CMOS-integrated compute-in-memory macro based on resistive random-access memory for AI edge devices. Nat. Electron. 4, 81–90 (2020).

    MATH  Google Scholar 

  14. Jung, S. et al. A crossbar array of magnetoresistive memory devices for in-memory computing. Nature 601, 211–216 (2022).

    MATH  Google Scholar 

  15. Hung, J.-M. et al. A four-megabit compute-in-memory macro with eight-bit precision based on CMOS and resistive random-access memory for AI edge devices. Nat. Electron. 4, 921–930 (2021).

    MATH  Google Scholar 

  16. Cai, F. et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nat. Electron. 2, 290–299 (2019).

    MATH  Google Scholar 

  17. Zhong, Y. et al. A memristor-based analogue reservoir computing system for real-time and power-efficient signal processing. Nat. Electron. 5, 672–681 (2022).

    MATH  Google Scholar 

  18. Harabi, K.-E. et al. A memristor-based Bayesian machine. Nat. Electron. 6, 52–63 (2023).

    MATH  Google Scholar 

  19. Turck, C. et al. Bayesian in-memory computing with resistive memories. In Proc. 2023 International Electron Devices Meeting (IEDM) 1–4 (IEEE, 2023).

  20. Yu, S. et al. Binary neural network with 16 Mb RRAM macro chip for classification and online training. In Proc. 2016 IEEE International Electron Devices Meeting (IEDM) 16.2.1–16.2.4 (IEEE, 2016).

  21. Chen, W.-H. et al. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors. Nat. Electron. 2, 420–428 (2019).

    MATH  Google Scholar 

  22. Wang, W. et al. Computing of temporal information in spiking neural networks with ReRAM synapses. Faraday Discuss. 213, 453–469 (2019).

    MATH  Google Scholar 

  23. Ueyoshi, K. et al. DIANA: an end-to-end energy-efficient digital and analog hybrid neural network SoC. In Proc. 2022 IEEE International Solid-State Circuits Conference (ISSCC) 1–3 (IEEE, 2022).

  24. Zhang, W. et al. Edge learning using a fully integrated neuro-inspired memristor chip. Science 381, 1205–1211 (2023).

    MATH  Google Scholar 

  25. Yao, P. et al. Face classification using electronic synapses. Nat. Commun. 8, 15199 (2017).

    Google Scholar 

  26. Khaddam-Aljameh, R. et al. HERMES core – a 14nm CMOS and PCM-based in-memory compute core using an array of 300ps/LSB linearized CCO-based ADCs and local digital processing. In Proc. 2021 Symposium on VLSI Circuits 1–2 (IEEE, 2021).

  27. Khaddam-Aljameh, R. et al. HERMES-core—A 1.59-TOPS/mm2 PCM on 14-nm CMOS in-memory compute core using 300-ps/LSB linearized CCO-based ADCs. IEEE J. Solid-State Circuits 57, 1027–1038 (2022).

    Google Scholar 

  28. Milano, G. et al. In materia reservoir computing with a fully memristive architecture based on self-organizing nanowire networks. Nat. Mater. 21, 195–202 (2022).

    MATH  Google Scholar 

  29. Bocquet, M. et al. In-memory and error-immune differential RRAM implementation of binarized deep neural networks. In Proc. 2018 IEEE International Electron Devices Meeting (IEDM) 20.6.1–20.6.4 (IEEE, 2018).

  30. Li, C. et al. Long short-term memory networks in memristor crossbar arrays. Nat. Mach. Intell. 1, 49–57 (2019).

    MATH  Google Scholar 

  31. Gao, B. et al. Memristor-based analogue computing for brain-inspired sound localization with in situ training. Nat. Commun. 13, 2026 (2022).

    MATH  Google Scholar 

  32. Hu, M. et al. Memristor‐based analog computation and neural network classification with a dot product engine. Adv. Mater. 30, 1705914 (2018).

    Google Scholar 

  33. Jebali, F. et al. Powering AI at the edge: a robust, memristor-based binarized neural network with near-memory computing and miniaturized solar cell. Nat. Commun. 15, 741 (2024).

    MATH  Google Scholar 

  34. Yan, B. et al. RRAM-based spiking nonvolatile computing-in-memory processing engine with precision-configurable in situ nonlinear activation. In Proc. 2019 Symposium on VLSI Technology T86–T87 (IEEE, 2019).

  35. Jia, H. et al. Scalable and programmable neural network inference accelerator based on in-memory computing. IEEE J. Solid-State Circuits 57, 198–211 (2022).

    MATH  Google Scholar 

  36. Wang, Z. et al. Toward a generalized Bienenstock-Cooper-Munro rule for spatiotemporal learning via triplet-STDP in memristive devices. Nat. Commun. 11, 1510 (2020).

    MATH  Google Scholar 

  37. Lin, Y. et al. Uncertainty quantification via a memristor Bayesian deep neural network for risk-sensitive reinforcement learning. Nat. Mach. Intell. 5, 714–723 (2023).

    MATH  Google Scholar 

  38. Jain, S. et al. A heterogeneous and programmable compute-in-memory accelerator architecture for analog-AI using dense 2-D mesh. IEEE Trans. Very Large Scale Integr. VLSI Syst. 31, 114–127 (2023).

    MATH  Google Scholar 

  39. Rasch, M. J. et al. Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators. Nat. Commun. 14, 5282 (2023).

    MATH  Google Scholar 

  40. Chiang, Y.-H. et al. Hardware-robust in-RRAM-computing for object detection. IEEE J. Emerg. Sel. Top. Circuits Syst. 12, 547–556 (2022).

    MATH  Google Scholar 

  41. Wang, J. et al. 14.2 A compute SRAM with bit-serial integer/floating-point operations for programmable in-memory vector acceleration. In Proc. 2019 IEEE International Solid-State Circuits Conference – (ISSCC) 224–226 (IEEE, 2019).

  42. Jacob, B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, 2704–2713 (IEEE, 2018).

  43. Du, C.-Y. et al. A 28nm 11.2TOPS/W hardware-utilization-aware neural-network accelerator with dynamic dataflow. In Proc. 2023 IEEE International Solid-State Circuits Conference (ISSCC) 1–3 (IEEE, 2023).

  44. Park, J.-S. et al. A multi-mode 8k-MAC HW-utilization-aware neural processing unit with a unified multi-precision datapath in 4-nm flagship mobile SoC. IEEE J. Solid-State Circuits 58, 189–202 (2023).

    Google Scholar 

  45. Kim, M. & Seo, J.-S. An energy-efficient deep convolutional neural network accelerator featuring conditional computing and low external memory access. IEEE J. Solid-State Circuits 56, 803–813 (2021).

    MATH  Google Scholar 

  46. Jouppi, N. P. et al. In-datacenter performance analysis of a tensor processing unit. In Proc. 44th Annual International Symposium on Computer Architecture 1–12 (ACM, 2017).

  47. Jouppi, N. et al. TPU v4: an optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proc. 50th Annual International Symposium on Computer Architecture 1–14 (ACM, 2023).

  48. Asanović, K. et al. The Rocket Chip generator. UC Berkeley Electrical Engineering & Computer Sciences http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html (2016).

  49. Khwa, W.-S. et al. A 40-nm, 2M-cell, 8b-precision, hybrid SLC-MLC PCM computing-in-memory macro with 20.5–65.0 TOPS/W for tiny-Al edge devices. In Proc. 2022 IEEE International Solid-State Circuits Conference (ISSCC) 1–3 (IEEE, 2022).

  50. Wen, T.-H. et al. Fusion of memristor and digital compute-in-memory processing for energy-efficient edge computing. Science 384, 325–332 (2024).

    MATH  Google Scholar 

Download references

Acknowledgements

We thank S. Ding, W. Shi and W. Wu for supporting the development of the AnDi hardware system and Q. Qin, J. Li and T. Guo for their valuable discussions. This work is supported in part by STI 2030 – Major Projects (Grant No. 2021ZD0201205 to H.W.), the National Natural Science Foundation of China (Grant Nos. 92064001 to B.G., 624B2083 to Z.W., 62495103 to H.W. and 62025111 to H.W.), the Shanghai Municipal Science and Technology Major Project, the XPLORER Prize, the Beijing Advanced Innovation Center for Integrated Circuits and the IoT Intelligent Microsystem Center of Tsinghua University-China Mobile Communications Group Co., Ltd Joint Institute.

Author information

Authors and Affiliations

Contributions

Z.W. designed the AnDi architecture and its key features, including the DDFP data flow, quantization and dequantization units, as well as the hybrid mapping, dynamic scheduling, feature-enhancing and hybrid online training methods. Z.J. implemented the quantization and dequantization units. Z.W. and R.Y. designed and implemented the end-to-end ACIM tool chain. Z.W. conceived and conducted the hardware inference experiments and analysed the data. T.Y. and Z.W. trained the NNs. Z.W., R.Y. and Z.J. analysed the energy efficiency and set up the data pipeline. Z.H. and R.Y. implemented the softcore CPU. R.Y. implemented the compiler for the CIM instruction set. Q.L. designed the ACIM chip. Y. Liu and J.L. supported the software development kit for the ACIM chip. Z.W., R.Y. and B.G. wrote the manuscript. P.Y., J.T., Y. Li, Z. Hu and Z. Hao reviewed and improved the quality of the manuscript. H.W., J.T. and H.Q. supervised the project.

Corresponding authors

Correspondence to Bin Gao or Huaqiang Wu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Electronics thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Details of the hardware parameters used in the performance evaluation
Extended Data Table 2 An example of a MAC tile with AnDi architecture
Extended Data Table 3 Comparison with prior works
Extended Data Table 4 Details of the YOLO neural networks used in the demonstration

Extended Data Fig. 1 Comparison of energy efficiency improvement and weight splitting between the hybrid mapper in AnDi and traditional mapper for a pure ACIM system.

a, Traditional mapper for pure ACIM system. b, Hybrid mapper for AnDi system. It is evident that both systems require weight splitting when the weight size exceeds the ACIM array's capacity. However, the hybrid mapper does not split the weights into more blocks or introduce more complex memory-tracking issues than the pure ACIM system. c, Energy efficiency comparison between the traditional mapper and the hybrid mapper under different row parallelism.

Extended Data Fig. 2 An example for the analysis of delay and pipeline in the AnDi architecture.

a, Example of a MAC tile based on the AnDi architecture, considering pipeline delay conditions. The example includes a tile with a 256 × 256 ACIM array with a delay of 50 ns, and four DMAC cores, each capable of performing 64 MAC operations in parallel per computation. This design ensures that regardless of the weight assigned to the DMAC by the hybrid mapper, the DMAC’s computation delay remains within 50 ns, matching the ACIM computation delay and avoiding pipeline disruption. b, Computational units from panel (a) are encapsulated into a MAC tile. Multiple tiles form the on-chip network. c, Each tile can be designed to operate in three different computation modes: dual core mode, where both ACIM and DMAC cores work simultaneously; ACIM only mode, where only the ACIM core is active; and DMAC only mode, where only the DMAC core is active. This design aims to save energy, such that when the hybrid mapper assigns only ACIM or DMAC computations to the current tile, the other core can be completely powered off to avoid energy consumption. d, Energy consumption trends of DMAC and ACIM units as the computational load increases. The blue dashed line represents the power consumption of the AnDi system during MAC computations. e, When different weights are assigned to the DMAC, it is possible to independently switch any of the four DMAC cores on or off to save energy while still meeting the 50 ns delay requirement. The maximum computational load that each DMAC in a tile needs to handle (represented by the largest rectangle within the green area) depends on the peak energy efficiency ratio between ACIM and DMAC, which is 10:1.

Extended Data Fig. 3 AnDi Memristor based Hardware System.

a, A photo of the AnDi computation hardware system running the adaptive self-driving car. b, Block diagram of the softcore RISC-V CPU and ACIM arrays.

Extended Data Fig. 4 Maps used for turn pass rate test for the self-driving car, and mapping locations of YOLO neural network weights onto ACIM arrays.

a, Maps used for turn pass rate test for the self-driving car. b, Mapping locations of YOLO neural network weights onto ACIM arrays.

Extended Data Fig. 5 YOLO inference results with and without the enhancing layers.

All data is computed on the AnDi hardware system.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7 and discussion.

Supplementary Video 1

Video demonstration of the adaptive self-driving car tested on ten different maps.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Z., Yu, R., Jia, Z. et al. A dual-domain compute-in-memory system for general neural network inference. Nat Electron 8, 276–287 (2025). https://doi.org/10.1038/s41928-024-01315-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41928-024-01315-9

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics