Introduction

Edge hardware for artificial intelligence (AI) has experienced unprecedented growth, driven by the urgent need to reduce communication latency, improve energy-efficiency (EF), strengthen security and safeguard privacy in order to deliver high-quality AI services1,2,3,4. Deploying AI on edge not only requires high EF to satisfy the stringent power constraints but also calls for high parallelism to improve the real-time performance. Memristive computing-in-memory (mCIM)5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46 offers a competent platform to enable edge intelligence. Particularly, its capability of high-density storage can enable to have all weights on the chip9,10,11,12,13 to eliminate the off-chip data traffic between on-chip cache memory and off-chip dynamic random-access memory14,15,16,17. On the other hand, it can minimize large stand-by power required by the conventional volatile cache memory.

Previous works13,21,22,23 have made significant progress in designing and prototyping mCIM chips. However, enhancing computing parallelism remains an open research question due to the challenges posed by multiple aspects. First of all, there is a dilemma between the input parallelism and the computing accuracy in the analog multiply-and-accumulate (MAC) array. For example, the number of input parallelism for analog accumulation (NIN) in the reported mCIM is relatively limited and usually no larger than 1612,20,21,22,27,33. This is because the MAC error quickly increases as NIN increases, due to various device non-idealities21,27,36, such as the IR drop due to the large summation current, the non-negligible high-resistive-state (HRS) leakage caused by the limited resistance-ratio (R-ratio) between the HRS and the low-resistive-state (LRS), and the accumulation of device-to-device variations in the computing results. Some previous works that can overcome the NIN limitation, either by co-optimization of memristive devices with large R-ratios41 and high resistances19 or by co-design with noise-tolerant algorithms56, are summarized in Supplementary Table 1. Secondly, increasing the number of output parallelism (NOUT) is hindered by the large hardware costs associated with the analog readout circuitry. As the readout precision increases, the energy and area cost of high-precision analog readout circuits exponentially increases47. Also, additional analog processing circuits21,22,31,32,48,49 are required to weight-and-combine (WAC) the analog signals for multi-bit MAC computation. The energy and area consumptions of these circuits limit the maximum number of output ports that can be placed within a single macro. Lastly, integrating multiple macros can increase the parallelism at the system level. It is of importance to develop energy-saving techniques for the scalable CIM system, considering the limited power budget. Previous works have investigated the model-to-hardware mapping strategies and the dataflow control for multi-macro systems23,41,57,58. In particular, prior architectural studies have demonstrated that the mixed precision approach can play a vital role in enabling highly efficient mCIM chips57,58. However, hardware-centric studies of mixed precision muti-macro mCIM with low hardware cost and experimental validation remain challenging. On the other hand, the near-threshold (NVT) computing is another computing paradigm that can potentially play a key role in edge device. It can effectively reduce the power consumption by reducing the supply voltage (VDD). The performance loss due to reduced VDD can be regained through improving the system parallelism by adding more computing components50,51,52. Despite these advantages, a long-standing challenge faced by the NVT computing is its high sensitivity to process variations, which can lead to unpredictable performance loss and reliability issues. The possibility of combining with mCIM remains unexplored. Transistors, which are used as selectors in the conventional one-transistor-one-resistor (1T1R) memristive array, need to be fully activated to readout the memristive resistance properly53.

In this article, we combine the mCIM and NVT computing paradigms for highly energy-efficient and parallel computing. Specifically, we propose an NVT-operated two-transistor-one-resistor (2T1R) cell array. The NVT-operated transistors can modulate the cell current more strongly and thereby increasing the R-ratio of memristors. However, it also increases the variability in cell current. Therefore, we propose a mismatch-canceling scheme, where we use the memristor’s variations to compensate for the transistor’s threshold-voltage (VTH) mismatch. We also propose an analog-to-digital converter (ADC) that can perform data conversion and analog WAC by reusing only one set of the capacitors, reducing the cost for analog readout and thereby increasing the room for greater parallelism. Last but not least, we propose an inter-macro control scheme that uses different WAC and quantizing approaches in different macros to lower the task-level inference power of a multi-macro system, which can help to extend its scalability under stringent energy constraints.

On this basis, we have verified a 1-Mb 16-macro NVT mCIM engine on silicon. The experiments demonstrate that it can perform highly efficient and parallel analog computing over a large number of input channels (256) with a small average relative standard deviation of 2.4%. The chip delivers a peak normalized throughput (TP) up to 10.49 tera-operations per second (TOPS) and achieves an optimal EF from 55.21 to 88.51 TOPS per watt (TOPS/W). It also demonstrates a higher level of computing parallelism compared to the earlier works. This work demonstrates a Mb-level NVT mCIM engine, emphasizing on the promise of combining mCIM and NVT computing for next-generation edge AI hardware with low power and high parallelism.

Results

Overall structure of the computing-in-memory engine

Fig. 1a shows the die photo of the 1 Mb NVT 2T1R mCIM engine, which is fabricated using a standard 180 nm logic process with back-end-of-line (BEOL) integrated memristor cells. The typical cell properties and array-level characterization results can be found in the Supplementary Fig. 1. It contains 16 NVT mCIM macros, 96 kb SRAM buffer, I/O interface, and top control. Each mCIM macro has 64 b 2T1R cells, 16 8-bit (b) built-in sample-and-stack ADCs (BSS-ADCs) with pipelined developing and quantizing (PDQ) clampers, memory periphery circuits, and digital circuits for local control and computing. In the computing mode, the entire 2T1R cell array operates at the NVT region, which leads to reduced operational voltage, lowered summation current, and amplified signal ratio between HRS and LRS. More details about the circuit implementation of NVT mCIM engine can be found in Supplementary Fig. 2.

Fig. 1: Proposed 1-Mb 16-Macro NVT mCIM engine and illustrations of the key enabling concepts.
Fig. 1: Proposed 1-Mb 16-Macro NVT mCIM engine and illustrations of the key enabling concepts.
Full size image

a Die micrograph, macro structure and the NVT operational scheme (Scale bar: 1 mm). b VTH-MC scheme that reduces cell current fluctuations by establishing correlation between transistor VTH and memristor resistance. c BSS-ADC that supports WAC, sampling and quantizing with reduced energy/area overhead and latency compared to the conventional ones. d IM-HC that sets different macros independently to HP- or HE-mode to process precision- or energy-orientated workloads.

The proposed NVT mCIM engine has the following features. Firstly, Fig. 1b conceptually shows the threshold-voltage mismatch canceling (VTH-MC) programming scheme. While the superposition of memristor variation and transistor VTH mismatch tends to exacerbate the cell current fluctuations, the VTH-MC scheme can reverse correlate the two different variation sources to reduce the overall variability of the cell currents. Secondly, Fig. 1c illustrates the proposed BSS-ADCs. While conventional macros perform WAC, sampling, and quantizing with separate components, the proposed BSS-ADC can carry out all three functions using its internal capacitive array to reduce the energy and area costs. Moreover, its energy and area overheads for WAC operations increase only linearly with the WAC precision (NWAC), rather than the exponential increase of the conventional approaches. Besides, by pipelining the developing and quantizing phase, the overall latency can be reduced by the proposed PDQ clamper. Lastly, as shown in Fig. 1d, different from the conventional multi-macro CIM systems that applying the same accumulation and quantizing approach for all macros, the proposed inter-macro hybrid control (IM-HC) scheme can lower the parallel processing power without compromising inference accuracy by mixing the high precision (HP) and the high efficiency (HE) modes for precision- and efficiency-oriented workloads, respectively. These key concepts at different levels from array, macro, to system collaboratively enable the large-scale NVT mCIM engine.

2T1R cell array with mismatch canceling

We propose the NVT 2T1R cell array to combine the NVT computing with the memristive array. Unlike the conventional 1T1R cell having one IO transistor (T1) and one memristor, the proposed 2T1R cell incorporates one additional core transistor (T2), as shown in Fig. 2a. In the memory operation, T1 is used as the selector for read and write through the source-line and bit-line (BL) like the conventional 1T1R cell, and T2 is disabled by grounding its source and drain. In the computing operation, T2 is used to amplify its gate voltage depending on the voltage division between T1 and the memristor. By this means, the resistive switching of memristors is converted into the threshold voltage shift in the 2T1R cells, which leads to strong capabilities of current modulation and signal amplification.

Fig. 2: 2T1R NVT computing cell properties and the mismatch-canceling scheme.
Fig. 2: 2T1R NVT computing cell properties and the mismatch-canceling scheme.
Full size image

a Bit-cell structure and its principle. b Measured BL and TBL current versus WL voltage at HRS and LRS. c Measured signal ratio of BL and TBL current. Signal ratio is improved by more than 120 times compared to that on BL. d Operational flow of the proposed VTH-MC scheme including coarse- and fine-tuning steps. e Measurements of TBL current versus BL voltage on 30 2T1R cells in the cell array before and after performing the VTH-MC scheme. f Comparison on the mean and standard deviation values of TBL current with different programing schemes, including one-shot, coarse-tuning, fine-tunning and VTH-MC. RT1, resistance of T1; Rcell, resistance of memristor; ITARGET, target cell current.

By operating T2 in the NVT region, the 2T1R cell can exponentially modulate the transpose bit-line (TBL) current (ITBL) as a function of the memristor resistance (RCELL). Fig. 2b shows the comparison between IBL and ITBL as a function of the word-line (WL) voltage (VWL). It shows that the HRS and LRS can be differentiated without fully activating T1 in the 2T1R cell because T1 is used as a voltage divider. Compared with the 1T1R counterpart, the 2T1R cell can achieve more than 98% reduced on-state cell current and 47% reduced VWL. Furthermore, by optimizing the WL and BL biasing, the current ratio of ITBL between HRS and LRS can be improved by more than 120 times compared to that of IBL, as shown in Fig. 2c. Consequently, the 2T1R cell can effectively suppress the analog MAC errors caused by the small signal ratio. More details about the NVT computing method of 2T1R cell array and its effect on reduced MAC errors can be found in Supplementary Fig. 3. Additionally, since T2 operates in the saturation region in the activated 2T1R cell, the cell current is insensitive to clamping voltage fluctuations, enabling the use of a low-cost static clamping scheme. A detailed analysis of the impact of different clamping schemes can be found in Supplementary Fig. 4.

Another critical challenge faced in realizing a large-scale NVT analog mCIM chip is that cell current variations tend to increase further due to the presence of memristor variations and transistor mismatch. To deal with this issue, we propose the VTH-MC programming scheme along with the 2T1R cell to cancel out the influences induced by the variations of cell resistance and VTH mismatch. Fig. 2d shows the VTH-MC flow, including the coarse- and fine-tuning steps. The coarse-tuning step is used to search the optimal set condition for the target cell current (ITARGET). It performs multiple reset-and-set (R/S) cycles with an incremental compliance current (IC) in the set process till the minimum cell current deviation (ΔI), defined by the difference between ITARGET and measured ITBL, is less than 30% of ITARGET. The fine-tuning step is used to compensate for the VTH mismatch using the intrinsic cycle-to-cycle (C2C) variations of the memristor. It keeps carrying out R/S cycles using the identical voltage pulses with the optimal set condition determined by the previous step, till ΔI is less than 10% of ITARGET. More details about the operational flow of the VTH-MC scheme, the modulations of the cell resistances and the retention characteristics in the resistance tuning process can be found in Supplementary Fig. 5. Unlike previous iterative programming schemes, the proposed VTH-MC scheme leverages intrinsic C2C variations in memristors to compensate for transistor mismatch. Consequently, the presence of C2C variation effects does not necessarily increase the number of iteration programming cycles in the proposed scheme. A comparison with several previous iterative programming methods15,41,43,59 can be found in Supplementary Table 2. The dependency of the proposed 2T1R cell current on temperature is shown in Supplementary Fig. 6, which provides a guideline for mitigating the influence of temperature by adjusting the operational voltages or the readout circuits of the mCIM chip.

Fig. 2e shows the ITBL versus VBL of 30 2T1R cells at LRS and HRS using different programming schemes where a clear memory window can be observed between HRS and LRS in both cases. The relative standard deviation (σ/μ) of the on-state ITBL current can be lowered from 41.5 to 5.43% after applying the VTH-MC scheme. Fig. 2f further compares the influences of different programming steps in terms of the mean value and standard deviation (σ) of the ITBL distribution at LRS. It can be seen that the coarse-tuning step can contribute to lowered the ΔI, and the fine-tuning step can further suppress the σ, though the trimming range of fine-tuning step is limited. The combination of coarse- and fine-tuning steps can lead to 98% reduced ΔI and 88% reduced σ compared to the conventional one-shot programming with fixed biasing conditions.

Efficient analog processing and readout circuits

We propose the BSS-ADC with PDQ clamper to reduce the costs for analog processing and readout, as shown in Fig. 3a. The PDQ clamper consists of a cascade current mirror and three n-type transistors (N1, N2, and N3), including N1 for voltage clamping (VCLP), N2 for current sampling (SAM), and N3 for voltage reset (RST). It is worth noticing that incorporating N2 allows the BSS-ADC to perform quantization without being influenced by the current change in the array after sampling. The BSS-ADC consists of a binary-weighted digital-to-analog converters (DAC) array controlled by the common-mode voltage (VCM), high- and low-reference-voltage (VREF_H and VREF_L), a comparator, and modified successive approximation register (SAR) logics. Unlike conventional SAR ADC, two additional switches (SW1 and SW2) are introduced at the bottom plate and top plate of the MSB capacitor (CMSB) in the DAC array. The BSS-ADC can perform the WAC operation of partial charges by adding these two switches with a small additional area overhead.

Fig. 3: Structure and operations for proposed BSS-ADC.
Fig. 3: Structure and operations for proposed BSS-ADC.
Full size image

a Schematic of the BSS-ADC with a PDQ clamper. b Operational waveforms of the different readout phases. The time proportion of each phase is also labeled in the figure. c Illustration of the BSS-ADC configurations in sampling and quantizing phases. The analog WAC is achieved by capacitors charge stacking. d The impact of the PDQ clamper on the overall ADC readout latency over different numbers of cells in TBL. e Comparison on the power, latency and EDP for the WAC operation between the conventional approaches (ratioed CMs or capacitors) and the proposed capacitors charge stacking.

The readout process of BSS-ADC, with and without WAC, is illustrated by its waveform (Fig. 3b) and configuration (Fig. 3c). A typical readout process can be divided into the current developing phase (P0), the sampling phase (P1) and the quantizing phase (P2). In the standard readout process without WAC, the BSS-ADCs operate independently. In P0, the SAM and RST of the PDQ clamper are kept off during the ITBL developing, which allows for pipelining of P0 and P2 in different clock (CLK) cycles to reduce the overall latency; In P1, SW1 is turned off and SW2 is turned on. The RST and SAM are sequentially activated to reset the voltage on the top plates and directly sample ITBL using the DAC array, respectively; In P2, the comparator is triggered to quantize the sampled voltage (VSAM) by the asynchronous SAR logics for the 8-bit CIM outputs (COUT[7:0]). In the readout process with WAC, four BSS-ADCs are operated as a group. The operation is similar to the standard operation except the differences as follows. In P1, both SW1 and SW2 are turned on after sampling. This enables the WAC of four VSAM (VSAM0 ~ VSAM3) in different ADCs (ADC0 ~ ADC3) by leveraging the charge stacking. As a result, VSAM3 equals VSAM3/2 + VSAM2/4 + VSAM1/8 + VSAM0/16. Notice that in order to scale the VSAM3 to the same ADC input range, the sampling time of BSS-ADC with WAC is reduced to the half of that in the case without WAC. In P2, SW2 is turned off to maintain the VSAM3 and only the comparator in ADC3 is triggered for the 7-bit CIM outputs (COUT[7:1]).

Fig. 3d shows the performance evaluations of the PDQ clamper. By pipelining the P0 and P2 phases, the overall readout latency can be reduced from 11.2 to 30.8% depending on the array size. The effect of latency reduction is maximized when the durations of P0 and P2 are similar. Fig. 3e further compares different WAC schemes in terms of WAC power, latency and energy-latency-product (EDP). More details about the comparison between different WAC schemes31,48,60 can be found in Supplementary Fig. 7. The conventional ratioed current mirrors (CMs) approach tends to incur a large WAC power consumption due to the introduction of additional current duplication branches. The conventional charge sharing approach shows relatively increased WAC power and latency compared to the proposed charge stacking approach because it usually requires redundant compensation capacitors to form the sampling capacitors. Consequently, the proposed charge stacking approach achieves 97 and 44% reduced WAC EDP compared to the previous approaches based on ratioed CMs and capacitors charge sharing, respectively. Supplementary Note 1 provides additional details on the latency and power data from Fig. 3d, e.

Scalable multi-macro-based computing-in-memory system

The inter-macro collaborative strategy to lower system power is of importance to enable highly scalable mCIM system. Leveraging the features of BSS-ADC, we propose the IM-HC schemes that hybridizes different working modes, namely the HP and HE modes, for different workloads in AI models. The dataflows for performing the 4-bit integer (INT4) MAC operation are shown in Fig. 4a. In both HP and HE modes, the INT4 input (IN) data are bit-wise sent to the BL of the cell array through 4 computing cycles, the INT4 weight (W) data are mapped into four different rows. In the HP mode, the WAC function is disabled and each BSS-ADC works independently. It uses two levels of digital adder-and-shifters (DSAs) to perform the shift and add operation. The first level of DSA is used to combine the partial sum data output by different rows for INT4 W, while the second one is adopted to combine the partial sum data generated by 4 computing cycles for INT4 IN. In the HE mode, the WAC function of BSS-ADC is enabled, allowing the partial sum signals on different rows to be combined in the analog domain for INT4 W. It only uses one level of DSA to combine the partial sum data for INT4 IN.

Fig. 4: Illustration of the proposed IM-HC scheme.
Fig. 4: Illustration of the proposed IM-HC scheme.
Full size image

a Dataflow of the NVT mCIM macro in the HP and the HE modes for INT4 MAC operation over 256 accumulation channels. The multi-bit weights are processed by the DSA in the HP mode and by the WAC in the HE mode. b Processing flow for the IM-HC scheme, including the mode selection, the range selection, the ADC trimming and the ADC results scaling step. The flow is illustrated using a ResNet-8 model trained on the CIFAR-10 dataset as an example. The mode selection step is performed by analyzing the influences of different quantizing methods on the model accuracy layer-wisely. The range selection step is carried out by predicting the pMACV distribution in HE mode across different neural network layers using the training dataset. In HE mode, the pMACV corresponds to the MAC results of 1-bit input and 4-bit weight, with an accumulation length of 256.

The operational flow of the IM-HC scheme is shown in Fig. 4b using a modified ResNet-8 model trained on the CIFAR-10 dataset54 as an example. First of all, mode selection is carried out to identify different types of workloads. Here, we evaluate the inference accuracy of the model by changing the quantization strategy from layer-1 (L1) to layer-8 (L8) layer-wisely according to the dataflow in the HP and HE mode. The overall accuracy is more subjective to the quantization precision of the L1, L3, L6, and L8 compared to the other layers. Consequently, the macros for L1, L3, L, and L8 processing are configured in the HP mode, whereas those for L2, L4, L5, and L7 processing are set to the HE mode. Secondly, in order to suppress the accuracy loss, we analyze the typical partial MAC value (pMACV) distribution of the HE-processed workloads. For instance, the pMACV distribution range of L5 is two times of that of L2 and hence needs a wider quantization range to improve the inference accuracy. Thirdly, the quantization ranges of the ADCs in the HE macros are trimmed by modulating their VREF_H and VCM according to the pMACV distribution of the processing layers. Lastly, depending upon the quantization range, the ADC outputs are re-scaled by digital shifters to generate the corresponding pMACVs. More details about the ADC range selection and trimming can be found in Supplementary Fig. 8. A detailed explanation of the pMACV distribution evaluation is presented in Supplementary Note 2. The pMACV distributions across multiple layers in ResNet-20, along with the effects of different quantization strategies on model inference accuracy, are presented in Supplementary Fig. 9. The illustration of the mode selection step for the IM-HC scheme can be found in Supplementary Note 3.

Performance of the fabricated test-chip

The test-chip is demonstrated on silicon using a 180 nm logic process with BEOL-integrated memristor cells (see Methods for further details). The statistic transfer curves of the test-chip with and without VTH-MC scheme are measured over 256 rows by sweeping the input codes, as shown in Fig. 5a. Highly linear analog current summation can be achieved in both cases. Additionally, the VTH-MC scheme can achieve a relative standard deviation of 2.4% on average, which is 85% reduced compared to that of the case without using the scheme. The shmoo plot (Fig. 5b) shows that the chip can work with a frequency range between 8 to 80  MHz by changing its VDD. The computing engine can deliver a peak normalized TP of 10.49 TOPS at 80 MHz. Fig. 5c further shows the measured normalized EF and TP of the whole chip as a function of frequency (Methods). It achieves an optimal normalized system-level EF of 88.51 TOPS/W in the HE mode, which is 60.3% improved compared to that of the HP mode. It is worth mentioning that, for a fair comparison of the efficiency between the HP and HE modes, the EF and TP values reported here are normalized to 1-bit precision operations. Additionally, a detailed power breakdown of the test-chip at peak EF can be found in Supplementary Fig. 10. We have developed a neural network inference system based on the test-chip (see Supplementary Fig. 11 for further details). The inference accuracy and power of the test-chip to process the ResNet-8 and the ResNet-2055 models for classifying the CIFAR-10 dataset are experimentally measured. The inference testing results are shown in Fig. 5d and Fig. 5e. Using the IM-HC scheme instead of the HP mode only scheme leads to a marginal decrease in accuracy (1.36 and 1.7% for ResNet-8 and ResNet-20, respectively) but considerably reduces power consumption by 27.2 and 30%, respectively. Also, we speculate that more power can be saved as the neural network model scales up because the proportion of the HE-processed workloads tends to continually increase. More details about neural network processing based on the proposed chip are illustrated in the Methods section and the Supplementary Fig. 12. The details of model-to-hardware weight mapping and computing dataflow are shown in Supplementary Fig. 13. Simply put, the IM-HC scheme can effectively reduce the power consumption of the highly parallel CIM systems based on multiple macros with a negligible accuracy loss. The benchmark of this work can be found in the Supplementary Table 3. It experimentally demonstrated highly linear NVT analog MAC operation over a large number of accumulation channels (NIN ≥ 256) with small variations (1σ/μ = 2.4 %). Besides, this work also demonstrates a higher level of computing parallelism compared to the earlier works.

Fig. 5: Performance of the fabricated NVT mCIM engine.
Fig. 5: Performance of the fabricated NVT mCIM engine.
Full size image

a Measured statistics transfer curves over 256 rows w/o and w/ the VTH-MC scheme. b Measured shmoo plot. c Measured system-level EF and TP at different frequency. d and e Measured inference accuracy (d) and power consumption (e) for processing the ResNet-8 and the ResNet-20 models (both trained on the CIFAR-10 dataset) with different inter-macro controlling schemes, including the HP mode only, the HE mode only and the IM-HC scheme. Using the IM-HC scheme instead of the HP-mode only scheme leads to a marginal decrease in accuracy (1.36% and 1.7% for ResNet-8 and ResNet-20, respectively) but considerably reduces power consumption by 27.2% and 30%, respectively.

Discussion

We have reported a highly energy-efficient and parallel 1 Mb 16-macro NVT 2T1R mCIM engine enabled by cross-layer co-design. At the array-level, the combination of the 2T1R cell and the VTH-MC scheme can lead to NVT memristive array with small cell current, large signal ratio, and suppressed variations. At the macro-level, the BSS-ADC with a PDQ clamper can effectively reduce the energy- and area-costs for analog processing and readout with reduced overall readout latency. At the system-level, the IM-HC scheme is able to reduce the power consumption with a negligible accuracy loss in multi-macro mCIM systems by configuring the macros in different modes. The highly parallel and linear analog MAC operations in the NVT region with a small relative deviation of 2.4% are statistically characterized. The whole chip delivers a peak normalized TP up to 10.49 TOPS and achieves an optimal energy efficiency from 55.21 to 88.51 TOPS/W. This work demonstrates a path to overcome the long-standing variation challenge faced by mCIM and NVT computing paradigms, which opens the design space to explore for next-generation edge AI hardware.

Methods

Test-chip fabrication

In this work, the memristive chip has a 1 polysilicon-layer and 6 metal-layers (1P6M) structure. It is fabricated by a standard 180 nm logic process in a commercial foundry and an in-house developed memristor technology in a pilot process line. The layers up to and including the fifth via-layer (V5) are fabricated by the foundry process. The layers above V5 are fabricated by the pilot line, including the memristor cells, the sixth metal layer (M6), and the passivation layer (PA). The memristor cell is integrated between the bottom via connected to the fifth metal-layer (M5) and the top via connected to the sixth metal-layer (M6). Each memristor cell consists of a titanium nitride (TiN) bottom-electrode layer, a transition-metal-oxide-based resistive switching layer, and a TiN top-electrode layer.

Current-voltage characterization of the memristive cells

The test-chip is packaged using chips-on-board technology and mounted on a customized 256-pin universal printed circuit board (PCB) with a socket (Yamaichi NP89-44111). A Keysight N6705B DC power analyzer was used to provide the supply voltages (VDD) for the customized PCB. The working modes and the memory address can be selected by connecting the corresponding pins either to the ground or to VDD. For I-V measurement, the test-chip is set to a test-mode so that the selected memristive cells in the cell array can be directly accessed. Then, to precisely measure the sub-threshold characteristics, the I-V curves of the memristor cells inside the cell array of the test chip are measured by connecting the corresponding pins to the supply/measurement units of a standard semiconductor analyzer (Keysight B1500A).

Chip measurements with the automated test equipment

To probe the chip functions statistically, we build a customized load board of the test-chip for an ATE equipment (YTEC S100). The checkerboard memory testing, transfer curves and shmoo testing in this work are done by the ATE. The shmoo plot is carried out by measuring the functional CLK frequency with varying VDD. The resolution of VDD and the CLK frequency were set at 0.02 V and 8 MHz, respectively.

Demonstration for deep neural network inference

We develop an inference system based on the fabricated test-chip to demonstrate deep neural network inference, as shown in the Supplementary Fig. 11. The system consists of a personal computer (PC) as the host, a Xilinx ZCU106 Field Programmable Gate Array (FPGA) main board for the local control, and an expansion board to accommodate the test-chip. In the host PC, a user interface with different operation modules has been developed using LabVIEW. These modules can support the test-chip for different tasks, such as weight mapping, serial peripheral interface configuration and parallel MAC operations. The main board communicates with the PC through a PCIe module. It allows for the exchange of control signals, inputs and output data between the host PC and the test-chip. The expansion board includes the power modules and the DACs to generate a proper power supply and biasing for the test-chip.

Implementations of ResNet models

For the CIFAR-10 image classification tasks, two ResNet models are employed on chip for demonstration, including the standard ResNet-20 model and a modified ResNet-8 model. The CIFAR-10 dataset consists of 50,000 training images and 10,000 testing images, each belonging to one of ten object classes. The modified ResNet-8 includes seven convolutional layers and one fully connected layer, with batch normalizations and ReLU activations between the layers. The ResNet-20 includes 19 convolutional layers and one fully connected layer, also with batch normalizations and ReLU activations between the layers. These models are trained using the PyTorch framework. The inputs and weights of all convolutional and fully connected layers are quantized to a 4-bit fixed-point format. We merge the batch normalization parameters into convolutional weights and biases after training. All the parameters of the ResNet-8 are mapped on a single NVT mCIM engine, including those of the convolutional layers and the fully connected layers, while all the parameters of the ResNet-20 are mapped on two NVT mCIM engines. Other operations, including shift-and-add, accumulation, dummy output subtraction, shortcut, activation, and pooling are implemented on a Xilinx ZCU106 FPGA integrated on the same board as the NVT mCIM engine. These operations contribute only a small fraction of the total computation, and integrating their implementation in digital CMOS would incur negligible overhead. The choice of FPGA implementation provides flexibility during testing and development.

To implement the weights of a four-dimensional convolutional layer with dimension height (H), width (Wd), input channel (CIN) and output channel (COUT) on two-dimensional memristive arrays, the first three dimensions are flattened into a one-dimensional vector and the bias term (B) of each output channel is appended to each vector. Under the IM-HC scheme, the parameters of a convolutional layer are converted into a conductance matrix of size (HWdCIN + B, COUT). Several model-to-hardware mapping schemes are used to improve TP and facilitate the pipelined processing of multiple inference tasks as shown in the Supplementary Fig. 13, including (1) duplicating weights among different macros for the layers (L1-L2 in ResNet-8, L1-L7 in ResNet-20) having small amounts of parameters (NPA) but large input feature maps to improve the input parallelism, (2) splitting weights between different macros for the layers (all the layers in ResNet-8 and ResNet-20) having COUT larger than the output channels of a macro (NOUT) to improve the output parallelism, (3) dividing weights into different pipelined stages for the layers (L5-L7 in ResNet-8, L14-L19 in ResNet-20) having COUT more than NM×NOUT/NP, where NM is the number of the macros in the system and NP is the number of pipeline stages.