Abstract
Non-volatile compute-in-memory macros can reduce data transfer between processing and memory units, providing fast and energy-efficient artificial intelligence computations. However, the non-volatile compute-in-memory architecture typically relies on analogue computing, which is limited in terms of accuracy, scalability and robustness. Here we report a 64-kb non-volatile digital compute-in-memory macro based on 40-nm spin-transfer torque magnetic random-access memory technology. Our macro features in situ multiplication and digitization at the bitcell level, precision-reconfigurable digital addition and accumulation at the macro level and a toggle-rate-aware training scheme at the algorithm level. The macro supports lossless matrix–vector multiplications with flexible input and weight precisions (4, 8, 12 and 16 bits), and can achieve a software-equivalent inference accuracy for a residual network at 8-bit precision and physics-informed neural networks at 16-bit precision. Our non-volatile compute-in-memory macro has computation latencies of 7.4–29.6 ns and energy efficiencies of 7.02–112.3 tera-operations per second per watt for fully parallel matrix–vector multiplications across precision configurations ranging from 4 to 16 bits.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Data availability
The data that support the plots presented in this Article, as well as other findings derived from this study, are available from the corresponding authors upon reasonable request.
Code availability
Computer codes are available from the corresponding authors upon reasonable request.
References
Sze, V., Chen, Y.-H., Yang, T.-J. & Emer, J. S. Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105, 2295–2329 (2017).
Xu, X. et al. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 216–222 (2018).
Di Ventra, M. & Pershin, Y. V. The parallel approach. Nat. Phys. 9, 200–202 (2013).
Horowitz, M. 1.1 Computing’s energy problem (and what we can do about it). In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 10–14 (IEEE, 2014).
Yu, E., K, G. K., Saxena, U. & Roy, K. Ferroelectric capacitors and field-effect transistors as in-memory computing elements for machine learning workloads. Sci. Rep. 14, 9426 (2024).
Luo, Y.-C. et al. Experimental demonstration of non-volatile capacitive crossbar array for in-memory computing. In Proc. IEEE International Electron Devices Meeting (IEDM) 21.4.1–21.4.4 (IEEE, 2021).
Slesazeck, S. et al. A 2TnC ferroelectric memory gain cell suitable for compute-in-memory and neuromorphic application. In Proc. IEEE International Electron Devices Meeting (IEDM) 38.6.1–38.6.4 (IEEE, 2019).
Chen, W.-H. et al. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors. Nat. Electron. 2, 420–428 (2019).
Cai, F. et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nat. Electron. 2, 290–299 (2019).
Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).
Lin, P. et al. Three-dimensional memristor circuits as complex neural networks. Nat. Electron. 3, 225–232 (2020).
Hung, J.-M. et al. A four-megabit compute-in-memory macro with eight-bit precision based on CMOS and resistive random-access memory for AI edge devices. Nat. Electron. 4, 921–930 (2021).
Xue, C.-X. et al. A CMOS-integrated compute-in-memory macro based on resistive random-access memory for AI edge devices. Nat. Electron. 4, 81–90 (2021).
Huo, Q. et al. A computing-in-memory macro based on three-dimensional resistive random-access memory. Nat. Electron. 5, 469–477 (2022).
Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).
Wen, T.-H. et al. Fusion of memristor and digital compute-in-memory processing for energy-efficient edge computing. Science 384, 325–332 (2024).
Cai, H. et al. A 28 nm 2 Mb STT-MRAM computing-in-memory macro with a refined bit-cell and 22.4–41.5 TOPS/W for AI inference. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 500–502 (IEEE, 2023).
Jung, S. et al. A crossbar array of magnetoresistive memory devices for in-memory computing. Nature 601, 211–216 (2022).
Xie, W. et al. A 709.3 TOPS/W event-driven smart vision SoC with high-linearity and reconfigurable MRAM PIM. In Proc. IEEE Symposium on VLSI Technology 1–2 (IEEE, 2023).
Deaville, P., Zhang, B. & Verma, N. A 22 nm 128-kb MRAM row/column-parallel in-memory computing macro with memory-resistance boosting and multi-column ADC readout. In Proc. IEEE Symposium on VLSI Technology 268–269 (IEEE, 2022).
Joshi, V. et al. Accurate deep neural network inference using computational phase-change memory. Nat. Commun. 11, 2473 (2020).
Le Gallo, M. et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 6, 680–693 (2023).
Khaddam-Aljameh, R. et al. HERMES-core—a 1.59-TOPS/mm2 PCM on 14-nm CMOS in-memory compute core using 300-ps/LSB linearized CCO-based ADCs. IEEE J. Solid-State Circuits 57, 1027–1038 (2022).
Narayanan, P. et al. Fully on-chip MAC at 14 nm enabled by accurate row-wise programming of PCM-based weights and parallel vector-transport in duration-format. IEEE Trans. Electron Devices 68, 6629–6636 (2021).
Khwa, W.-S. et al. A 40-nm, 2M-cell, 8b-precision, hybrid SLC-MLC PCM computing-in-memory macro with 20.5–65.0 TOPS/W for tiny-AI edge devices. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 1–3 (IEEE, 2022).
Sun, Z. et al. A full spectrum of computing-in-memory technologies. Nat. Electron. 6, 823–835 (2023).
Kim, H., Yoo, T., Kim, T. T.-H. & Kim, B. Colonnade: a reconfigurable SRAM-based digital bit-serial compute-in-memory macro for processing neural networks. IEEE J. Solid-State Circuits 56, 2221–2233 (2021).
Murmann, B. Mixed-signal computing for deep neural network inference. IEEE Trans. Very Large Scale Integr. VLSI Syst. 29, 3–13 (2021).
Murmann, B., Verhelst, M. & Manoli, Y. Analog-to-information conversion. In NANO-CHIPS 2030: On-Chip AI for an Efficient Data-Driven World 275–292 (Springer International Publishing, 2020).
Murmann, B. A/D converter trends: power dissipation, scaling and digitally assisted architectures. In Proc. IEEE Custom Integrated Circuits Conference (CICC) 105–112 (IEEE, 2008).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Rudy, S. H., Brunton, S. L., Proctor, J. L. & Kutz, J. N. Data-driven discovery of partial differential equations. Sci. Adv. 3, e1602614 (2017).
Chih, Y.-D. et al. An 89 TOPS/W and 16.3 TOPS/mm2 all-digital SRAM-based full-precision compute-in-memory macro in 22 nm for machine-learning edge applications. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 252–254 (IEEE, 2021).
Mori, H. et al. A 4 nm 6163 TOPS/W/b 4790 TOPS/mm2/b SRAM-based digital-computing-in-memory macro supporting bit-width flexibility and simultaneous MAC and weight update. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 132–134 (IEEE, 2023).
Fujiwara, H. et al. A 3 nm, 32.5 TOPS/W, 55.0 TOPS/mm2 and 3.78 Mb/mm2 fully-digital compute-in-memory macro supporting INT12 × INT12 with a parallel-MAC architecture and foundry 6T-SRAM bit cell. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 572–574 (IEEE, 2024).
Shih, M.-E. et al. NVE: a 3 nm 23.2 TOPS/W 12b-digital-CIM-based neural engine for high-resolution visual-quality enhancement on smart devices. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 360–362 (IEEE, 2024).
Wang, J. et al. A 22 nm 29.3 TOPS/W end-to-end CIM-utilization-aware accelerator with reconfigurable 4D-CIM mapping and adaptive feature reuse for diverse CNNs and transformers. In Proc. IEEE Custom Integrated Circuits Conference (CICC) 1–3 (IEEE, 2025).
Lou, M. et al. Area-efficient and low-power 8T compute-SRAM bitcell design for digital compute-in-memory macros in 22 nm CMOS. IEEE Trans. Circuits Syst. II Express Briefs 72, 1459–1463 (2025).
Lu, A. et al. High-speed emerging memories for AI hardware accelerators. Nat. Rev. Electr. Eng. 1, 24–34 (2024).
Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 15, 529–544 (2020).
Chiu, Y.-C. et al. A CMOS-integrated spintronic compute-in-memory macro for secure AI edge devices. Nat. Electron. 6, 534–543 (2023).
Yang, Z. et al. A novel computing-in-memory platform based on hybrid spintronic/CMOS memory. IEEE Trans. Electron Devices 69, 1698–1705 (2022).
Sayadi, L., Amirany, A., Moaiyeri, M. H. & Timarchi, S. Balancing precision and efficiency: an approximate multiplier with built-in error compensation for error-resilient applications. J. Supercomput. 81, 109 (2025).
Rezaei, M., Amirany, A., Moaiyeri, M. H. & Jafari, K. A reliable non-volatile in-memory computing associative memory based on spintronic neurons and synapses. Eng. Rep. 6, e12902 (2024).
Angizi, S., He, Z., Chen, A. & Fan, D. Hybrid spin-CMOS polymorphic logic gate with application in in-memory computing. IEEE Trans. Magn. 56, 3400215 (2020).
Tong, Z. et al. BSTCIM: a balanced symmetry ternary fully digital in-mram computing macro for energy efficiency neural network. IEEE Trans. Circuits Syst. Regul. Pap. 71, 6114–6127 (2024).
Mazaheri, M. M., Amirany, A. & Moaiyeri, M. H. TPCSA-MRAM: ternary precharge sense amplifier-based MRAM. IEEE Access 12, 132817–132824 (2024).
Rajaei, R. & Amirany, A. Nonvolatile low-cost approximate spintronic full adders for computing in memory architectures. IEEE Trans. Magn. 56, 3400308 (2020).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images (University of Toronto, 2009).
Xu, S. et al. A practical approach to flow field reconstruction with sparse or incomplete data through physics informed neural network. Acta Mech. Sin. 39, 322302 (2023).
Chandrakasan, A. P. & Brodersen, R. W. Minimizing power consumption in digital CMOS circuits. Proc. IEEE 83, 498–523 (1995).
Natarajarathinam, A., Zhu, R., Visscher, P. B. & Gupta, S. Perpendicular magnetic tunnel junctions based on thin CoFeB free layer and Co-based multilayer synthetic antiferromagnet pinned layers. J. Appl. Phys. 111, 07C918 (2012).
Song, J., Dixit, H., Behin-Aein, B., Kim, C. H. & Taylor, W. Impact of process variability on write error rate and read disturbance in STT-MRAM devices. IEEE Trans. Magn. 56, 3400411 (2020).
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Deng, L. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag. 29, 141–142 (2012).
Yang, J. et al. TIMAQ: a time-domain computing-in-memory-based processor using predictable decomposed convolution for arbitrary quantized DNNs. IEEE J. Solid-State Circuits 56, 3021–3038 (2021).
Jain, S., Lin, L. & Alioto, M. ±CIM SRAM for signed in-memory broad-purpose computing from DSP to neural processing. IEEE J. Solid-State Circuits 56, 2981–2992 (2021).
Yoshioka, K. A 818–4,094 TOPS/W capacitor-reconfigured CIM macro for unified acceleration of CNNs and transformers. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 574–576 (IEEE, 2024).
Acknowledgements
This work was supported by the National Key R&D Program of China (grant number 2022YFB4400200 to T.M.), the National Natural Science Foundation of China (grant numbers 62274081 to L.L. and 12327806 to Z.C.), Zhujiang Young Talent Program (grant number 2023QN10X177 to L.L.) and Shenzhen Stable Support Plan Program for Higher Education Institutions Research Program (grant number 20231121110457002 to L.L.). We acknowledge the SUSTech SME-Pixelcore Neuromorphic In-sensor Computing Joint Laboratory and the SUSTech SME-CIMCube Joint Laboratory for experimental support in this work.
Author information
Authors and Affiliations
Contributions
L.L. and T.M. conceived and supervised the project. H. Li, R.P. and L.L. designed the circuits for the nvDCIM macro and test chip. H. Li, W.D. and J.H. performed the training, quantization and inference of the NNs and implemented the toggle-rate-aware training algorithm. H. Li, Z.C., S.L., Z.K., X.Y., X.W., Z.Y., H. Lyu, H.Y. and X.Z. performed the experiments, including device characterization and building the chip-testing platform, and chip testing. H. Li, Z.C., J.L., F.Z., Y.L., Z.X., T.M. and L.L. analysed the data. H. Li, Z.C., T.M. and L.L. wrote the paper. All authors reviewed and approved the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Electronics thanks Abdolah Amirany, Esteban Garzón and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 STT-MRAM device.
(a) Vertical structure of MTJ. RL: reference layer, which has a fixed magnetization. FL: free layer, whose magnetization can switch between parallel (P) and antiparallel (AP) orientations relative to the reference layer. HL: hard layer, which possesses strong perpendicular magnetic anisotropy (PMA). (b) Measured I-V curve of the MTJ’s magnetic resistive switching. The SEM image shows the MTJ’s critical diameter of 78 nm. (c) Measured R-V curve of the MTJ’s magnetic resistive switching, showing a clear resistance change between high-resistance (anti-parallel, AP) and low-resistance (parallel, P) states. The TMR ratio is approximately 170% at 0.1 V. TMR: Tunnel Magneto-Resistance, TMR = (RAP – RP) / RP * 100%. (d) Measured distribution of state switching voltages (VAP→P, VP→AP), with a mean VAP→P of 0.461 V (standard deviation σ = 0.029 V, CV = 6.3%) and a mean VP→AP of –0.299 V (σ = 0.020 V, CV = 6.7%). (e) Resistance distributions of the RP and RAP, where the mean RAP is 9199.2 Ω (σ = 480.7 Ω, CV = 5.2%) and the mean RP is 3363.7 Ω (σ = 171.8 Ω, CV = 5.1%). (f) Measured TMR distribution, with a mean of 173.5% (σ = 3.9%, CV = 2.2%). CV: Coefficient of Variation, CV = standard deviation (σ) / mean (μ) * 100%.
Extended Data Fig. 2 MVM timing diagram, test chip architecture, and test flow chart.
(a) The nvDCIM test chip architecture featuring the nvDCIM macro, integrated on-chip buffers, a clock generator, and SPI interfaces for data transfer. (b) nvDCIM chip test flow chart. (c) MVM timing diagram illustrating bit-serial processing for two examples: 4-bit unsigned weight with 4-bit signed input and 8-bit signed weight with 4-bit unsigned input.
Extended Data Fig. 3 Experimental and measurement platform for evaluating the nvDCIM chips.
(a) The experimental platform consists of an nvDCIM test chip, a PCB test board, and a National Instruments (NI) PXIe system, including the PXIe-6570 and PXIe-8881 modules, which handle chip control, intermediate data processing, and result visualization. Additionally, two source/measurement units (SMUs) are included in the platform for power measurements. (b) A flowchart illustrating the inference process conducted on the experimental platform. During the execution, the 64-kb nvDCIM macro performs parallel and lossless MVM operations across 4-, 8-, 12-, and 16-bit precisions for convolutional and fully connected layers. Input vectors and matrix data are supplied to the nvDCIM macro via the NI PXIe-6570 controlled through LabVIEW, which also retrieves the MVM results. Beyond this, the PXIe-6570 implements ReLU activations, pooling, Tanh activations, and batch normalization. The PXIe-8881 processes and displays the final inference results. The system supports a range of computational tasks, such as low computational precision tasks (for example, image classification with CNNs) and high computational precision tasks (for example, flow field reconstruction using PINNs).
Extended Data Fig. 4 Power and area breakdown.
(a) Area breakdown of the main macro components. (b) Power consumption breakdown of the nvDCIM macro components measured during the inference of the ResNet-20 model on the CIFAR-10 dataset.
Extended Data Fig. 5 Measured shmoo plot, energy efficiency, and test results across 24 nvDCIM chips.
(a) Measured shmoo plot of the nvDCIM macro showing the relationship between supply voltage (VDD) and maximum clock frequency (fCLK) while operating in 4-bit-input, 4-bit-weight, and 16-bit-output mode. (b) Measured energy efficiency of the nvDCIM macro versus VDD in 4-bit-input, 4-bit-weight, and 16-bit-output mode, when the weight sparsity is 50% and the input toggle rate ranges from 50% to 6.25%. F, Fail; P, Pass. (c) Wafer map showing 12 selected shots (highlighted in blue), with the Z-pattern sampling. (d) Photograph of the fabricated 12-inch wafer, showing the positions of the selected shots, corresponding to the Z-pattern used in (c). (e) Measured throughput (TOPS) distribution at VDD = 1.20 V, with a mean of 4.44 TOPS (standard deviation σ = 0.10 TOPS, CV = 2.3%). (f) Measured throughput (TOPS) distribution at VDD = 0.65 V, with a mean of 0.64 TOPS (σ = 0.03 TOPS, CV = 4.7%). (g) Measured energy efficiency (TOPS/W) distribution at VDD = 1.20 V, with a mean of 40.1 TOPS/W (σ = 1.97 TOPS/W, CV = 4.9%). (h) Measured energy efficiency (TOPS/W) distribution at VDD = 0.65 V, with a mean of 86.0 TOPS/W (σ = 8.82 TOPS/W, CV = 10.3%). CV: Coefficient of Variation, CV = standard deviation (σ) / mean (μ) * 100%.
Extended Data Fig. 6 PINN model quantization and performance.
Flow field reconstruction with PINN: A comparison of predictions for u (streamwise velocity), v (spanwise velocity), and p (pressure) at varying computational precision levels, with computational fluid dynamics (CFD) data benchmark32. (a) CFD benchmark data. (b) Predictions from the FP32 PINN model. (c) Predictions from the INT16 PINN model. (d) Predictions from the INT12 PINN model. (e) Predictions from the INT8 PINN model. (f) Predictions from the INT4 PINN model. (g) Relative L2 Norm (RL2): Temporal RL2 for u with different bit precisions and the overall RL2 for u. (h) Temporal RL2 for v with different bit precisions and the overall RL2 for v. (i) Temporal RL2 for u with different bit precisions and the overall RL2 for p.
Extended Data Fig. 7 Energy model construction.
In neural network workloads, convolution layers are mapped onto the nvDCIM hardware to perform efficient MVM operation, with feature maps and kernels assigned to input drivers and memory banks. Dynamic energy consumption is calculated by aggregating the energy costs of IBMD-bitcells, input drivers, and adders based on precomputed energy look-up tables. This model enables an evaluation of both model accuracy and energy consumption across convolutional and fully connected layers.
Extended Data Fig. 8 Toggle-rate-aware training results.
(a) Relationship between input toggle rate and accuracy for LeNet-5 INT4 model (dataset: MNIST). Increasing the regularization factor λ reduces the toggle rate while slightly impacting accuracy, demonstrating a tradeoff. (b) Energy efficiency improvement for LeNet-5. Reduction in toggle rate leads to notable energy efficiency gains, as illustrated by energy model estimations and chip-level measurements. (c) Relationship between input toggle rate and accuracy for ResNet-20 INT8 model (dataset: CIFAR-100). Similar to LeNet-5, increasing λ reduces the toggle rate with minimal accuracy degradation. (d) Energy efficiency improvement for ResNet-20 (dataset: CIFAR-100). Larger λ values result in more energy efficiency improvements.
Supplementary information
Supplementary Information
Supplementary Figs. 1–4, Notes 1–5, Table 1 and References.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, H., Chai, Z., Dong, W. et al. A lossless and fully parallel spintronic compute-in-memory macro for artificial intelligence chips. Nat Electron (2025). https://doi.org/10.1038/s41928-025-01479-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41928-025-01479-y