Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A lossless and fully parallel spintronic compute-in-memory macro for artificial intelligence chips

Abstract

Non-volatile compute-in-memory macros can reduce data transfer between processing and memory units, providing fast and energy-efficient artificial intelligence computations. However, the non-volatile compute-in-memory architecture typically relies on analogue computing, which is limited in terms of accuracy, scalability and robustness. Here we report a 64-kb non-volatile digital compute-in-memory macro based on 40-nm spin-transfer torque magnetic random-access memory technology. Our macro features in situ multiplication and digitization at the bitcell level, precision-reconfigurable digital addition and accumulation at the macro level and a toggle-rate-aware training scheme at the algorithm level. The macro supports lossless matrix–vector multiplications with flexible input and weight precisions (4, 8, 12 and 16 bits), and can achieve a software-equivalent inference accuracy for a residual network at 8-bit precision and physics-informed neural networks at 16-bit precision. Our non-volatile compute-in-memory macro has computation latencies of 7.4–29.6 ns and energy efficiencies of 7.02–112.3 tera-operations per second per watt for fully parallel matrix–vector multiplications across precision configurations ranging from 4 to 16 bits.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Motivation and overview of the nvDCIM macro.
Fig. 2: Overview of the IBMD bitcell.
Fig. 3: System architecture of the nvDCIM macro.
Fig. 4: Toggle-rate-aware training scheme.

Data availability

The data that support the plots presented in this Article, as well as other findings derived from this study, are available from the corresponding authors upon reasonable request.

Code availability

Computer codes are available from the corresponding authors upon reasonable request.

References

  1. Sze, V., Chen, Y.-H., Yang, T.-J. & Emer, J. S. Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105, 2295–2329 (2017).

    Article  Google Scholar 

  2. Xu, X. et al. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 216–222 (2018).

    Article  Google Scholar 

  3. Di Ventra, M. & Pershin, Y. V. The parallel approach. Nat. Phys. 9, 200–202 (2013).

    Article  Google Scholar 

  4. Horowitz, M. 1.1 Computing’s energy problem (and what we can do about it). In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 10–14 (IEEE, 2014).

  5. Yu, E., K, G. K., Saxena, U. & Roy, K. Ferroelectric capacitors and field-effect transistors as in-memory computing elements for machine learning workloads. Sci. Rep. 14, 9426 (2024).

    Article  Google Scholar 

  6. Luo, Y.-C. et al. Experimental demonstration of non-volatile capacitive crossbar array for in-memory computing. In Proc. IEEE International Electron Devices Meeting (IEDM) 21.4.1–21.4.4 (IEEE, 2021).

  7. Slesazeck, S. et al. A 2TnC ferroelectric memory gain cell suitable for compute-in-memory and neuromorphic application. In Proc. IEEE International Electron Devices Meeting (IEDM) 38.6.1–38.6.4 (IEEE, 2019).

  8. Chen, W.-H. et al. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors. Nat. Electron. 2, 420–428 (2019).

    Article  Google Scholar 

  9. Cai, F. et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nat. Electron. 2, 290–299 (2019).

    Article  Google Scholar 

  10. Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).

    Article  Google Scholar 

  11. Lin, P. et al. Three-dimensional memristor circuits as complex neural networks. Nat. Electron. 3, 225–232 (2020).

    Article  Google Scholar 

  12. Hung, J.-M. et al. A four-megabit compute-in-memory macro with eight-bit precision based on CMOS and resistive random-access memory for AI edge devices. Nat. Electron. 4, 921–930 (2021).

    Article  Google Scholar 

  13. Xue, C.-X. et al. A CMOS-integrated compute-in-memory macro based on resistive random-access memory for AI edge devices. Nat. Electron. 4, 81–90 (2021).

    Article  Google Scholar 

  14. Huo, Q. et al. A computing-in-memory macro based on three-dimensional resistive random-access memory. Nat. Electron. 5, 469–477 (2022).

    Article  Google Scholar 

  15. Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).

    Article  Google Scholar 

  16. Wen, T.-H. et al. Fusion of memristor and digital compute-in-memory processing for energy-efficient edge computing. Science 384, 325–332 (2024).

    Article  Google Scholar 

  17. Cai, H. et al. A 28 nm 2 Mb STT-MRAM computing-in-memory macro with a refined bit-cell and 22.4–41.5 TOPS/W for AI inference. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 500–502 (IEEE, 2023).

  18. Jung, S. et al. A crossbar array of magnetoresistive memory devices for in-memory computing. Nature 601, 211–216 (2022).

    Article  Google Scholar 

  19. Xie, W. et al. A 709.3 TOPS/W event-driven smart vision SoC with high-linearity and reconfigurable MRAM PIM. In Proc. IEEE Symposium on VLSI Technology 1–2 (IEEE, 2023).

  20. Deaville, P., Zhang, B. & Verma, N. A 22 nm 128-kb MRAM row/column-parallel in-memory computing macro with memory-resistance boosting and multi-column ADC readout. In Proc. IEEE Symposium on VLSI Technology 268–269 (IEEE, 2022).

  21. Joshi, V. et al. Accurate deep neural network inference using computational phase-change memory. Nat. Commun. 11, 2473 (2020).

    Article  Google Scholar 

  22. Le Gallo, M. et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 6, 680–693 (2023).

    Article  Google Scholar 

  23. Khaddam-Aljameh, R. et al. HERMES-core—a 1.59-TOPS/mm2 PCM on 14-nm CMOS in-memory compute core using 300-ps/LSB linearized CCO-based ADCs. IEEE J. Solid-State Circuits 57, 1027–1038 (2022).

    Article  Google Scholar 

  24. Narayanan, P. et al. Fully on-chip MAC at 14 nm enabled by accurate row-wise programming of PCM-based weights and parallel vector-transport in duration-format. IEEE Trans. Electron Devices 68, 6629–6636 (2021).

    Article  Google Scholar 

  25. Khwa, W.-S. et al. A 40-nm, 2M-cell, 8b-precision, hybrid SLC-MLC PCM computing-in-memory macro with 20.5–65.0 TOPS/W for tiny-AI edge devices. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 1–3 (IEEE, 2022).

  26. Sun, Z. et al. A full spectrum of computing-in-memory technologies. Nat. Electron. 6, 823–835 (2023).

    Article  Google Scholar 

  27. Kim, H., Yoo, T., Kim, T. T.-H. & Kim, B. Colonnade: a reconfigurable SRAM-based digital bit-serial compute-in-memory macro for processing neural networks. IEEE J. Solid-State Circuits 56, 2221–2233 (2021).

    Article  Google Scholar 

  28. Murmann, B. Mixed-signal computing for deep neural network inference. IEEE Trans. Very Large Scale Integr. VLSI Syst. 29, 3–13 (2021).

    Article  Google Scholar 

  29. Murmann, B., Verhelst, M. & Manoli, Y. Analog-to-information conversion. In NANO-CHIPS 2030: On-Chip AI for an Efficient Data-Driven World 275–292 (Springer International Publishing, 2020).

  30. Murmann, B. A/D converter trends: power dissipation, scaling and digitally assisted architectures. In Proc. IEEE Custom Integrated Circuits Conference (CICC) 105–112 (IEEE, 2008).

  31. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  32. Rudy, S. H., Brunton, S. L., Proctor, J. L. & Kutz, J. N. Data-driven discovery of partial differential equations. Sci. Adv. 3, e1602614 (2017).

    Article  Google Scholar 

  33. Chih, Y.-D. et al. An 89 TOPS/W and 16.3 TOPS/mm2 all-digital SRAM-based full-precision compute-in-memory macro in 22 nm for machine-learning edge applications. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 252–254 (IEEE, 2021).

  34. Mori, H. et al. A 4 nm 6163 TOPS/W/b 4790 TOPS/mm2/b SRAM-based digital-computing-in-memory macro supporting bit-width flexibility and simultaneous MAC and weight update. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 132–134 (IEEE, 2023).

  35. Fujiwara, H. et al. A 3 nm, 32.5 TOPS/W, 55.0 TOPS/mm2 and 3.78 Mb/mm2 fully-digital compute-in-memory macro supporting INT12 × INT12 with a parallel-MAC architecture and foundry 6T-SRAM bit cell. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 572–574 (IEEE, 2024).

  36. Shih, M.-E. et al. NVE: a 3 nm 23.2 TOPS/W 12b-digital-CIM-based neural engine for high-resolution visual-quality enhancement on smart devices. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 360–362 (IEEE, 2024).

  37. Wang, J. et al. A 22 nm 29.3 TOPS/W end-to-end CIM-utilization-aware accelerator with reconfigurable 4D-CIM mapping and adaptive feature reuse for diverse CNNs and transformers. In Proc. IEEE Custom Integrated Circuits Conference (CICC) 1–3 (IEEE, 2025).

  38. Lou, M. et al. Area-efficient and low-power 8T compute-SRAM bitcell design for digital compute-in-memory macros in 22 nm CMOS. IEEE Trans. Circuits Syst. II Express Briefs 72, 1459–1463 (2025).

  39. Lu, A. et al. High-speed emerging memories for AI hardware accelerators. Nat. Rev. Electr. Eng. 1, 24–34 (2024).

    Article  Google Scholar 

  40. Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 15, 529–544 (2020).

    Article  Google Scholar 

  41. Chiu, Y.-C. et al. A CMOS-integrated spintronic compute-in-memory macro for secure AI edge devices. Nat. Electron. 6, 534–543 (2023).

    Article  Google Scholar 

  42. Yang, Z. et al. A novel computing-in-memory platform based on hybrid spintronic/CMOS memory. IEEE Trans. Electron Devices 69, 1698–1705 (2022).

    Article  Google Scholar 

  43. Sayadi, L., Amirany, A., Moaiyeri, M. H. & Timarchi, S. Balancing precision and efficiency: an approximate multiplier with built-in error compensation for error-resilient applications. J. Supercomput. 81, 109 (2025).

  44. Rezaei, M., Amirany, A., Moaiyeri, M. H. & Jafari, K. A reliable non-volatile in-memory computing associative memory based on spintronic neurons and synapses. Eng. Rep. 6, e12902 (2024).

    Article  Google Scholar 

  45. Angizi, S., He, Z., Chen, A. & Fan, D. Hybrid spin-CMOS polymorphic logic gate with application in in-memory computing. IEEE Trans. Magn. 56, 3400215 (2020).

    Article  Google Scholar 

  46. Tong, Z. et al. BSTCIM: a balanced symmetry ternary fully digital in-mram computing macro for energy efficiency neural network. IEEE Trans. Circuits Syst. Regul. Pap. 71, 6114–6127 (2024).

    Article  Google Scholar 

  47. Mazaheri, M. M., Amirany, A. & Moaiyeri, M. H. TPCSA-MRAM: ternary precharge sense amplifier-based MRAM. IEEE Access 12, 132817–132824 (2024).

    Article  Google Scholar 

  48. Rajaei, R. & Amirany, A. Nonvolatile low-cost approximate spintronic full adders for computing in memory architectures. IEEE Trans. Magn. 56, 3400308 (2020).

    Article  Google Scholar 

  49. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).

  50. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images (University of Toronto, 2009).

  51. Xu, S. et al. A practical approach to flow field reconstruction with sparse or incomplete data through physics informed neural network. Acta Mech. Sin. 39, 322302 (2023).

    Article  Google Scholar 

  52. Chandrakasan, A. P. & Brodersen, R. W. Minimizing power consumption in digital CMOS circuits. Proc. IEEE 83, 498–523 (1995).

    Article  Google Scholar 

  53. Natarajarathinam, A., Zhu, R., Visscher, P. B. & Gupta, S. Perpendicular magnetic tunnel junctions based on thin CoFeB free layer and Co-based multilayer synthetic antiferromagnet pinned layers. J. Appl. Phys. 111, 07C918 (2012).

    Article  Google Scholar 

  54. Song, J., Dixit, H., Behin-Aein, B., Kim, C. H. & Taylor, W. Impact of process variability on write error rate and read disturbance in STT-MRAM devices. IEEE Trans. Magn. 56, 3400411 (2020).

    Article  Google Scholar 

  55. Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).

    Article  Google Scholar 

  56. Deng, L. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag. 29, 141–142 (2012).

    Article  Google Scholar 

  57. Yang, J. et al. TIMAQ: a time-domain computing-in-memory-based processor using predictable decomposed convolution for arbitrary quantized DNNs. IEEE J. Solid-State Circuits 56, 3021–3038 (2021).

    Article  Google Scholar 

  58. Jain, S., Lin, L. & Alioto, M. ±CIM SRAM for signed in-memory broad-purpose computing from DSP to neural processing. IEEE J. Solid-State Circuits 56, 2981–2992 (2021).

    Article  Google Scholar 

  59. Yoshioka, K. A 818–4,094 TOPS/W capacitor-reconfigured CIM macro for unified acceleration of CNNs and transformers. In Proc. IEEE International Solid-State Circuits Conference (ISSCC) 574–576 (IEEE, 2024).

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (grant number 2022YFB4400200 to T.M.), the National Natural Science Foundation of China (grant numbers 62274081 to L.L. and 12327806 to Z.C.), Zhujiang Young Talent Program (grant number 2023QN10X177 to L.L.) and Shenzhen Stable Support Plan Program for Higher Education Institutions Research Program (grant number 20231121110457002 to L.L.). We acknowledge the SUSTech SME-Pixelcore Neuromorphic In-sensor Computing Joint Laboratory and the SUSTech SME-CIMCube Joint Laboratory for experimental support in this work.

Author information

Authors and Affiliations

Contributions

L.L. and T.M. conceived and supervised the project. H. Li, R.P. and L.L. designed the circuits for the nvDCIM macro and test chip. H. Li, W.D. and J.H. performed the training, quantization and inference of the NNs and implemented the toggle-rate-aware training algorithm. H. Li, Z.C., S.L., Z.K., X.Y., X.W., Z.Y., H. Lyu, H.Y. and X.Z. performed the experiments, including device characterization and building the chip-testing platform, and chip testing. H. Li, Z.C., J.L., F.Z., Y.L., Z.X., T.M. and L.L. analysed the data. H. Li, Z.C., T.M. and L.L. wrote the paper. All authors reviewed and approved the paper.

Corresponding authors

Correspondence to Tai Min or Longyang Lin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Electronics thanks Abdolah Amirany, Esteban Garzón and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 STT-MRAM device.

(a) Vertical structure of MTJ. RL: reference layer, which has a fixed magnetization. FL: free layer, whose magnetization can switch between parallel (P) and antiparallel (AP) orientations relative to the reference layer. HL: hard layer, which possesses strong perpendicular magnetic anisotropy (PMA). (b) Measured I-V curve of the MTJ’s magnetic resistive switching. The SEM image shows the MTJ’s critical diameter of 78 nm. (c) Measured R-V curve of the MTJ’s magnetic resistive switching, showing a clear resistance change between high-resistance (anti-parallel, AP) and low-resistance (parallel, P) states. The TMR ratio is approximately 170% at 0.1 V. TMR: Tunnel Magneto-Resistance, TMR = (RAP – RP) / RP * 100%. (d) Measured distribution of state switching voltages (VAP→P, VP→AP), with a mean VAP→P of 0.461 V (standard deviation σ = 0.029 V, CV = 6.3%) and a mean VP→AP of –0.299 V (σ = 0.020 V, CV = 6.7%). (e) Resistance distributions of the RP and RAP, where the mean RAP is 9199.2 Ω (σ = 480.7 Ω, CV = 5.2%) and the mean RP is 3363.7 Ω (σ = 171.8 Ω, CV = 5.1%). (f) Measured TMR distribution, with a mean of 173.5% (σ = 3.9%, CV = 2.2%). CV: Coefficient of Variation, CV = standard deviation (σ) / mean (μ) * 100%.

Extended Data Fig. 2 MVM timing diagram, test chip architecture, and test flow chart.

(a) The nvDCIM test chip architecture featuring the nvDCIM macro, integrated on-chip buffers, a clock generator, and SPI interfaces for data transfer. (b) nvDCIM chip test flow chart. (c) MVM timing diagram illustrating bit-serial processing for two examples: 4-bit unsigned weight with 4-bit signed input and 8-bit signed weight with 4-bit unsigned input.

Extended Data Fig. 3 Experimental and measurement platform for evaluating the nvDCIM chips.

(a) The experimental platform consists of an nvDCIM test chip, a PCB test board, and a National Instruments (NI) PXIe system, including the PXIe-6570 and PXIe-8881 modules, which handle chip control, intermediate data processing, and result visualization. Additionally, two source/measurement units (SMUs) are included in the platform for power measurements. (b) A flowchart illustrating the inference process conducted on the experimental platform. During the execution, the 64-kb nvDCIM macro performs parallel and lossless MVM operations across 4-, 8-, 12-, and 16-bit precisions for convolutional and fully connected layers. Input vectors and matrix data are supplied to the nvDCIM macro via the NI PXIe-6570 controlled through LabVIEW, which also retrieves the MVM results. Beyond this, the PXIe-6570 implements ReLU activations, pooling, Tanh activations, and batch normalization. The PXIe-8881 processes and displays the final inference results. The system supports a range of computational tasks, such as low computational precision tasks (for example, image classification with CNNs) and high computational precision tasks (for example, flow field reconstruction using PINNs).

Extended Data Fig. 4 Power and area breakdown.

(a) Area breakdown of the main macro components. (b) Power consumption breakdown of the nvDCIM macro components measured during the inference of the ResNet-20 model on the CIFAR-10 dataset.

Extended Data Fig. 5 Measured shmoo plot, energy efficiency, and test results across 24 nvDCIM chips.

(a) Measured shmoo plot of the nvDCIM macro showing the relationship between supply voltage (VDD) and maximum clock frequency (fCLK) while operating in 4-bit-input, 4-bit-weight, and 16-bit-output mode. (b) Measured energy efficiency of the nvDCIM macro versus VDD in 4-bit-input, 4-bit-weight, and 16-bit-output mode, when the weight sparsity is 50% and the input toggle rate ranges from 50% to 6.25%. F, Fail; P, Pass. (c) Wafer map showing 12 selected shots (highlighted in blue), with the Z-pattern sampling. (d) Photograph of the fabricated 12-inch wafer, showing the positions of the selected shots, corresponding to the Z-pattern used in (c). (e) Measured throughput (TOPS) distribution at VDD = 1.20 V, with a mean of 4.44 TOPS (standard deviation σ = 0.10 TOPS, CV = 2.3%). (f) Measured throughput (TOPS) distribution at VDD = 0.65 V, with a mean of 0.64 TOPS (σ = 0.03 TOPS, CV = 4.7%). (g) Measured energy efficiency (TOPS/W) distribution at VDD = 1.20 V, with a mean of 40.1 TOPS/W (σ = 1.97 TOPS/W, CV = 4.9%). (h) Measured energy efficiency (TOPS/W) distribution at VDD = 0.65 V, with a mean of 86.0 TOPS/W (σ = 8.82 TOPS/W, CV = 10.3%). CV: Coefficient of Variation, CV = standard deviation (σ) / mean (μ) * 100%.

Extended Data Fig. 6 PINN model quantization and performance.

Flow field reconstruction with PINN: A comparison of predictions for u (streamwise velocity), v (spanwise velocity), and p (pressure) at varying computational precision levels, with computational fluid dynamics (CFD) data benchmark32. (a) CFD benchmark data. (b) Predictions from the FP32 PINN model. (c) Predictions from the INT16 PINN model. (d) Predictions from the INT12 PINN model. (e) Predictions from the INT8 PINN model. (f) Predictions from the INT4 PINN model. (g) Relative L2 Norm (RL2): Temporal RL2 for u with different bit precisions and the overall RL2 for u. (h) Temporal RL2 for v with different bit precisions and the overall RL2 for v. (i) Temporal RL2 for u with different bit precisions and the overall RL2 for p.

Extended Data Fig. 7 Energy model construction.

In neural network workloads, convolution layers are mapped onto the nvDCIM hardware to perform efficient MVM operation, with feature maps and kernels assigned to input drivers and memory banks. Dynamic energy consumption is calculated by aggregating the energy costs of IBMD-bitcells, input drivers, and adders based on precomputed energy look-up tables. This model enables an evaluation of both model accuracy and energy consumption across convolutional and fully connected layers.

Extended Data Fig. 8 Toggle-rate-aware training results.

(a) Relationship between input toggle rate and accuracy for LeNet-5 INT4 model (dataset: MNIST). Increasing the regularization factor λ reduces the toggle rate while slightly impacting accuracy, demonstrating a tradeoff. (b) Energy efficiency improvement for LeNet-5. Reduction in toggle rate leads to notable energy efficiency gains, as illustrated by energy model estimations and chip-level measurements. (c) Relationship between input toggle rate and accuracy for ResNet-20 INT8 model (dataset: CIFAR-100). Similar to LeNet-5, increasing λ reduces the toggle rate with minimal accuracy degradation. (d) Energy efficiency improvement for ResNet-20 (dataset: CIFAR-100). Larger λ values result in more energy efficiency improvements.

Extended Data Table 1 Chip summary
Extended Data Table 2 Comparison of nvDCIM with other nvCIM chips

Supplementary information

Supplementary Information

Supplementary Figs. 1–4, Notes 1–5, Table 1 and References.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Chai, Z., Dong, W. et al. A lossless and fully parallel spintronic compute-in-memory macro for artificial intelligence chips. Nat Electron (2025). https://doi.org/10.1038/s41928-025-01479-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41928-025-01479-y

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics