Introduction

Machine learning and artificial intelligence based on neural networks (NNs) have shown remarkable capabilities across a wide range of applications, including autonomous driving, weather prediction, speech recognition, and image understanding1,2,3,4. And it has a substantial demand for accelerators like graphics processing units, which are well-suited for handling large-scale, parallel multiply-and-accumulate operations. However, back-and-forth data movement between the physically separated memory and logic units in the conventional von Neumann architecture and the digital data processing paradigm imposes significant limitations on the system efficiency5,6. Consequently, there is a growing interest in high-efficiency neuromorphic computing hardware (NCH), particularly for intelligent edge devices that can process and store data locally and in an analog manner, akin to the human brain7,8,9.

At the algorithm level, NNs handle weights with infinite precision, a luxury that NCH cannot afford. To implement NNs at the edge, it is necessary to train and/or infer within device-level nodes that have limited numerical precision. Theoretical simulations have shown that many deep NNs with 8 to 24-bit precision will suffer almost no accuracy degradation compared to a much higher precision, owing to stochastic rounding schemes and the large amounts of parameters they usually contain10,11,12,13. On the other hand, excessively low precision (such as <8-bit) may lead to performance degradation instead, particularly in small-sized NNs deployed on edge devices that require high energy efficiency, as each parameter has a greater impact on their overall performance14. Whether training a fixed-precision NN directly or downloading and quantizing a pre-trained NN to achieve a fixed-precision network, devices capable of supporting many distinguishable conductance levels are crucial.

Non-volatile memories, such as floating-gate memories (FGMs)15,16,17,18,19,20, resistive switching memories8,21,22,23,24, phase change memories25,26, and ferroelectric memories27,28,29, have emerged as candidates for NCH. Among them, FGMs are especially promising due to their non-volatile charge-based analog storage mode. When utilized as artificial synapses, FGMs exhibit learning rates that align well with those of visual and auditory signals15. Additionally, FGMs offer a large dynamic range and are compatible with standard complementary metal-oxide-semiconductor (CMOS) technology. In addition, the combination of FGMs with emerging two-dimensional (2D) materials to create 2D FGMs holds great promise for highly integrated NCH30,31,32,33. This is because the atomic thickness of 2D materials offers them exceptional gate control capability and large storage windows, and the van der Waals surface feature facilitates the feasibility of hetero-integration and compatibility with CMOS processes. Nevertheless, the high sensitivity of 2D materials to interfacial states and defect-related instabilities of dielectrics often result in bad long-term stability, poor endurance, and memory states of fewer than one hundred for 2D FGMs31,34,35,36,37,38. This poses a significant challenge for NCH based on 2D FGMs.

Here, we report gate-injection-mode (GIM) 2D FGMs with 8-bit states as candidates for large-scale NCH. Through a coplanar device structure design, the control gate (CG), floating gate, and channel are decoupled, and storing charges are programmed and erased from the CG through the shared tunneling layer. By adopting a bi-pulse state programming strategy, highly distinguishable (with intervals larger than three times of the standard deviations) and stable (with retention times longer than 10,000 s) 8-bit conductance states are achieved at 3 V programming voltage. This high state number as well as the small operation voltage is better than other types of nonvolatile memories based on field-effect transistors (FETs), including normal 2D FGMs, Si-Flash cells, and ferroelectric field-effect transistors (FeFETs). The devices also show symmetry state programming tendency and good endurance of over 105 cycles. In addition, fabricated 256 devices exhibit a 94.9% yield, good uniformity, and repeatability. Leveraging the above findings, we then carry out experimental image convolutions and project 38,592 convolutional kernel parameters on a 9 × 2 device array with results matching well with that of simulations. Finally, we show that fixed-point NNs with 8-bit precision have inference accuracies approaching the ideal values. Our work demonstrates the potential of GIM 2D FGMs for high-performance neuromorphic computing accelerators.

Results

8-bit-precision programming

GIM 2D FGMs with a device structure shown in Fig. 1a were designed to realize numerous distinguishable conductance levels. Here, monolayer/few-layer MoS2, 5-nm Pt, and 8 nm Al2O3 were used as channel, floating gate (FG), and tunnelling/blocking layer. An individual CG coplanar with the source and drain terminals works as both charge programming and erasing electrodes. Although approximately 22% more area may be required compared to a conventional vertical structure, the coplanar design enables the device to support vertical integration with fewer layers of materials (as analyzed in Supplementary Fig. 1). The detailed fabrication processes can be seen in the Methods section. There are several advantages of this design. First, unlike the vertically overlapped structure in a traditional FGM, here channel, FG, and CG are decoupled into two sections: channel-Al2O3-FG stack and CG-Al2O3-FG stack. Hence, the gate programming voltage can be easily regulated by changing the capacitive coupling ratio that is proportional to the area ratio between the CG and channel (denoted as A4/A0 in the inset of Fig. 1a). Second, a state programming strategy combining two sequential gate voltage pulses with opposite signs can be adopted to de-trap the unstable charges captured in the dielectrics, so that highly stable memory states without affecting the channel can be achieved. Third, the state programming is symmetry because of the shared charge tunneling and blocking layer, and the same charge injection and erasing mechanism. These advantages will be discussed in detail in the following sections.

Fig. 1: Programming of the GIM 2D FGM.
figure 1

a Device structure of the GIM 2D FGM. The inset shows the top view of the structure and the areas of the channel and gate are denoted as A0 and A4, respectively. MoS2, Pt, and Al2O3 are used as channel, FG, and tunnelling/blocking layer. b Dual-sweep transfer curve that shows a large counterclockwise hysteresis loop. It was tested on the device with gate area A4 = 2.31 μm2 and channel width/length of 10.37/1.47 μm (as indicated in the OM image of Fig. 2b). c Two conductance states after programming with −Vtune (deep colors) and without −Vtune (shallow colors). d, e Schematics and band diagrams of the programming and tuning process. Detailed energy values can be seen in the alignment diagram of bands in Supplementary Fig. 2. D drain, S source. f 256 states with each sampled for 100 s. The states were programmed using the bi-pulse programming method. g, h The distinguishable neighboring states at different current levels. g Enlarged current-time sampling plots at the corresponding sites denoted in (f) (in blue, red and purple, respectively). The corresponding histogram plots of the sampled currents are shown in (h). σ is the standard deviation and the fitted curves were attained by fitting with normal distribution function. i Benchmarks of the GIM 2D FGM in state number and operation voltage. FeFET ferroelectric field-effect transistor, FGM floating-gate memory. The data are collected from19,29,30,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73.

The gate-injection mode is evidenced by a counterclockwise hysteresis in the double-sweep transfer curve as shown in Fig. 1b. Electrons can be injected in or erased from the FG when applying negative or positive voltages with high enough amplitudes on the gate terminal. And the stored charges will non-volatilely change the threshold voltage and the conductance of the MoS2 FET. The large memory window (about 78% of the sweep range) results from the high-k dielectric layer of Al2O3, the ultra-thin channel of MoS2, and most importantly, the high tunneling efficiency enabled by the optimized gate size, which will be further discussed below. Theoretically, the memory states, i.e., the conductance states of the FGM, should be stable because of the high energy barrier at the Pt/Al2O3 interface (~4.7 eV, Supplementary Fig. 2)39,40,41,42. Nevertheless, the source-drain current (IDS) decreases immediately after voltage programming as seen from the light-colored lines shown in Fig. 1c. This phenomenon is widely observed in 2D FGMs30,43,44, which is mainly coming from the unstable trapped charges inside the dielectrics during charge injection/erasing process that spontaneously de-trap after programming. And this is the same reason for the well-known bias temperature instability found in many Si-based transistors45, especially those with high-k dielectrics like Al2O3 that have a range of widely distributed trap states near the conduction band46.

To resolve the above problem, a programming method combining two sequential gate voltage pulses with opposite signs was adopted. Let us use the low-resistance state programming process as an example to illustrate this strategy (Fig. 1c–e). When a positive programming voltage (Vprog) is applied, the energy band of the tunneling layer Al2O3 can be largely tilted so that a triangle-shaped potential barrier appears (see the first panel of Fig. 1e). Hence, electrons stored in the FG can be erased through Fowler-Nordheim tunneling (FNT, see the first panel of Fig. 1d). The detailed analysis is shown in Supplementary Note 1 and Supplementary Fig. 3. However, some electrons are captured by the trap sites inside the tunneling layer during this process (see the second panels of Fig. 1d, e). After Vprog is withdrawn, the trapped electrons will de-trap into the FG by thermal activation in a slow relaxation process, which induces IDS to decrease gradually. Note that, the subthreshold slope (SS) was nearly unchanged during the relaxation process, implying that the trap states were induced within the device fabrication process rather than generated by voltage programming (Supplementary Fig. 4)38. By applying a negative tuning pulse (−Vtune) soon after Vprog (see the third panels of Fig. 1d, e), the relaxation process can be largely accelerated through de-trapping the trapped electrons into the FG. Nearly all the trapped electrons can be eliminated after Vtune through optimization (see the fourth panels of Fig. 1d, e). As a result, more stable programmed states were attained (stable states in Fig. 1c). The effect of this strategy is obvious and applicable for the whole conductance range as evidenced by the comparison of time-dependent transfer curves between with and without bi-pulse optimizing (Supplementary Fig. 6). Temperature-dependent state retention properties were further studied using the Arrhenius equation (Supplementary Figs. 79). The largely decreased stored charge leakage activation energy after applying Vtune verifies the detraping effect of the bipolar programming strategy47,48.

Using the above programming method, the GIM 2D FGM can have up to 256 distinguishable states (Fig. 1f and the output curves are shown in Supplementary Fig. 10, a detailed closed-loop programming method and corresponding parameters see Supplementary Figs. 11 and 12), which is equivalent to an 8-bit precision. The densely distributed states can be recognized from each other with an over-3σ variation (σ, the standard deviation for a state) between neighboring states (Fig. 1g, h). That state number is comparable to the advanced commercial Si-Flash cells and unprecedented among the other previously reported nonvolatile multibit memories based on FETs, including normal 2D FGMs and FeFETs (Fig. 1i and Supplementary Table 1). Note that most of the state numbers from the compared literature come from continuous voltage programming measurements or current-voltage curves rather than current-time curves as used here, which means the state stabilities were actually not well studied. By lowering the state variation to 1σ, even a doubled state number of 512 (9-bit precision) can be achieved (Supplementary Fig. 14). Moreover, the programming voltage can be decreased to a level of 3 V by optimizing the gate size, which is among the lowest ones according to literature (Fig. 1i).

Programmability and reliability

To investigate the programmability of the GIM 2D FGMs, we adopted a device circuit configuration shown in Fig. 2a. Here, Vprog was applied on a selected gate, a small source-drain bias (VDS) of 0.1 V was applied on the drain terminal while the source was kept grounded, and the two equivalent capacitors of channel/FG and FG/gate 1 (C0 and C1) were connected in series. In that configuration, the voltage drop on the tunneling layer can be calculated by Vtunnel = VprogC0/(C0 + C1). As a result, the programming efficiency during a single programming operation is strongly related to the capacitive coupling ratio between the equivalent lateral configurated capacitors, that is, the ratio Ci/C0 (i = 1, 2, 3, …) in Fig. 2a. To systematically investigate the gate-area-dependent programmability, we fabricated GIM 2D FGMs with varying-area multi gates (Fig. 2b, see Supplementary Fig. 15 for the fabrication process and Supplementary Fig. 16 for detailed geometric parameters).

Fig. 2: Performances of the GIM 2D FGMs.
figure 2

a Schematic diagram of the multi-gate configurated GIM 2D FGM and the equivalent circuit when VDS and Vprog are applied to the corresponding terminals. The capacitor value (Ci) between a specific gate and the FG is strongly correlated with the gate area. b OM image of the multi-gate device. The channel MoS2 thickness was determined to be about 2.6 nm by atomic force microscope (see Supplementary Fig. 23), equivalent to four layers of MoS274. The white-dashed and red-dashed areas denote the floating gate and channel/gate overlapping regions, respectively. The channel’s width/length is 10.37/1.47 μm (see Supplementary Fig. 16 for detailed geometric parameters). Scale bar: 5 μm. c Dual-sweep transfer curves of selected three gates G1, G2, and G3, which are corresponding to areas A1, A2, and A3 in (b). d Memory window as a function of the area ratio Ai/A0 with a linear fitting curve. Inset: the linear scale plot. e Vprog during carrier erasing/injection operation as a function of area ratio. The erasing and injection operations were conducted to alternatively change IDS between ~1 uA and ~1 pA. f Retention of 4 exponentially separated states for 10,000 s. The states were read by a source-drain voltage of Vread = 0.1 V. g Endurance performance during 105 program/erase cycles. 5-us-width programming pulse and tuning pulse were applied with a gap of 200 μs in a single programming/erasing operation. Cycle period: 100 ms. The programming schematic is shown as the inset. Vtune = 2 V.

It’s worth noting that because the gates share the same oxide layer and FG, and the capacitor value is calculated by C = εA/(4πkdox), where ε, A, k, dox are dielectric constant, effective area, electrostatic force constant and oxide layer thickness, the capacitor ratio Ci/C0 can be directly calculated by the area ratio Ai/A0 (the area ratio between gate i and the channel). As demonstrated in Fig. 2c, d, the dual-sweep transfer curves show an obvious area ratio dependency of the memory window, with the largest memory window of 10.3 V and the smallest memory window of 0.46 V. This difference is a direct result of the area-controlled partial voltage on the gate-Al2O3-FG stack. Simulated voltage potential distributions given in Supplementary Fig. 17 show similar results, validating the above analysis. The device can behave more like a transistor with a steep switch and a negligible memory window when the area ratio is very large, such as the case with an area ratio of 0.457 in Fig. 2c. That kind of device can be implemented as node selectors or activation function hardware in NNs.

The programming voltage can be decreased while maintaining a large memory window by using a smaller gate area, as shown in Fig. 2e. This dependency is consistent well with the simulation results (Supplementary Note 2 and Supplementary Fig. 18). The programming voltage can be as low as 3 V, showing potential in low-power applications. In addition, towards realizing the implementation of this device as the basic unit for NCH, the ability to update the device’s weights (conductance states) in a small range under the guidance of a backpropagation algorithm is important for on-chip training processes. That ability was also proved by the quite symmetric state updating in positive and negative directions, which is because of the identical charge injection and erasing mechanism through the coplanar GIM design (Supplementary Fig. 19).

The device also showed stable programmed states for over 10,000 s while maintaining the largest on-off ratio of over 1 × 108 (Fig. 2f). Given the uniform oxide thickness in the channel and gate regions, the device’s retention properties exhibit a clear dependence on the overlap areas between the floating gate and the drain, source, and gate electrodes (Supplementary Figs. 20–22). To further enhance the retention property, an additional blocking layer can be introduced below the source and drain regions to suppress this charge leakage pathway. And a good endurance performance of 105 cycles was also observed, which shows the reliability of being utilized for high-frequency weight update operations for on-chip training (Fig. 2g).

Repeatability of the 8-bit programming ability

We have fabricated 256 devices using a large-scale MoS2 film grown by chemical vapor deposition (CVD) to study the repeatability of the 8-bit programming ability (Fig. 3, see Methods for the detailed fabrication process). The optical microscope (OM) image of the devices is shown in Fig. 3a, in which a typical area ratio is calculated to be 0.084 (see Supplementary Fig. 24 for geometric parameters). Of the 256 devices, about 13 devices were broken, which might be due to the discontinuous sites on the large-scale MoS2 film introduced during the material transfer process, resulting a total yield rate of 94.9% (243 out of 256 devices). Apart from that, large hysteresis windows and the evenly distributed 9 programmed states can be observed from the electrical tests (Fig. 3b, c).

Fig. 3: Uniformity and repeatability of 8-bit programming.
figure 3

a OM image of the fabricated 256 devices. Scale bar: 0.2 mm. Inset: OM image of a typical device among the 256 devices (scale bar: 30 μm). The channel’s width/length is 11.22/3.11 μm (see Supplementary Fig. 24 for detailed geometric parameters). b Dual-sweep transfer curves of the devices. About 92.9% of devices have an on-state current exceeding 100 nA (Supplementary Fig. 25). The uniformity of transfer curves is comparable with previous works (Supplementary Fig. 26)31,75. c Current maps of the programmed 9 separate states. Sites in blue denote the 13 broken devices. d 256 programmed states for 120 devices. e device-to-device variation as a function of state number. f current distributions of selected adjacent states extracted from (d). Read voltage: 1 V. The programming method and parameters are the same as those used in Supplementary Figs. 11 and 12.

Moreover, we programmed 120 out of 137 devices with a yield of 87.6% into 256 (8-bit) distinct states, ranging from a current level of 1 pA to 100 nA (the original data are shown in Supplementary Figs. 2731). The statistics of state current as a function of device number and state number are presented in Fig. 3d. These 120 devices exhibit an overall low device-to-device variation of below 4% for the programmed states (Fig. 3e, f), which can be largely attributed to the accurate programming method employed and the wide memory windows of the devices.

The 8-bit states, low programming voltage, good stability and endurance, good repeatability and scalability shown above demonstrate the potential of GIM 2D FGMs for NCH.

Hardware convolutions based on device arrays

Vector-matrix multiplications are the most important operation in NNs, like the representation transformation processes between neighboring layers and kernel filtering processes in convolution layers for feature extraction. In this section, we fabricated a 9 × 2 array, which is comparable to other configured arrays for analog computing (see Supplementary Table 3)29,30,31,49,50,51,52,53, and carried out hardware convolutions to demonstrate the potential of GIM 2D FGMs for NCH. The optical image of an array bonded on a chip carrier is shown in Fig. 4a (see Methods and Supplementary Fig. 32 for the array fabrication process, see Supplementary Fig. 33 for geometric parameters). A homemade test system was used to experimentally run the convolution process as shown in Supplementary Fig. 34. The gate lines were wired out for the programming operation, while the rows and columns were wired out and connected to every device’s drain and source terminals respectively. As a 3 × 3 convolution kernel, the first column stores positive kernel weights and the second column stores negative kernel weights. That kernel configuration can eliminate possible parasitic currents (as analyzed in Supplementary Fig. 35) And the device structure also shows small parasitic capacitances and device-to-device interferences as thoroughly analyzed in Supplementary Note 3 and Supplementary Figs. 3639. We adopted a parallel programming method for weight (conductance states) updating (Fig. 4b), i.e., devices in a selected row were programmed simultaneously by gate voltages with the common drain terminals grounded. And a row-by-row validation scheme was used to validate the kernel programmed (Supplementary Fig. 40). Additional discussions on the limitations when operating the device array can be found in Supplementary Note 4.

Fig. 4: Hardware convolutions using GIM 2D FGMs array.
figure 4

a Photo (left, scale bar: 1 mm) and OM image (right, scale bar: 40 μm) of the wired 9 × 2 array. The electrode lines are also shown. The channel’s width/length is 3.95/2.47 μm (see Supplementary Fig. 33 for detailed geometric parameters). b Parallel programming method for a selected row of the array. c Illustration of the vector-matrix multiplication operation for image convolution. The kernel weights were mapped as conductance states of the device array before each convolution process. During the convolution process, 3 × 3 patches of pixels were converted to drain voltage inputs patch-by-patch, with the patches sliding through the whole image row-by-row. d Conductance maps of three kinds of kernels that were mapped to the array. e The corresponding convolution results mapped into the source-drain current. f Comparisons between output current distributions in (e) (hardware) and software-based convolution results. The results have been normalized. g Illustration of the VGG16 convolutional base structure. There are 5 convolution blocks with each containing several convolution layers and a pooling layer. h, i Hardware-based conductance maps of the two convolution layers in block 1 (h) and the corresponding software-based weight maps (i). j, k The corresponding histogram plots of hardware-based conductances (j) and software-based weights (k).

Figure 4c uses the convolution operation of image ‘0’ in the MNIST dataset as an example to illustrate the inference process. The image pixels were converted into voltages based on greyscale and grouped into 3 × 3 patches. Then the pixels in each patch were imported as drain inputs to the array and the output currents on the source terminals were collected as the convolution results. With different kinds of kernels that were separately programmed onto the device array (Fig. 4d and Supplementary Fig. 41), the output images after convolutions show different features (Fig. 4e). The convolution results of another image from the Fashion MNIST dataset and the convolution results with large current outputs are also shown in Supplementary Figs. 42 and 43. The experimental output images show almost the same distributions with that of software-based convolutions (Fig. 4f), demonstrating the array works well as physical kernels for feature extraction.

Considering the 8-bit states realized on GIM 2D FGFESTs, more complex kernels can be mapped onto the 9 × 2 array for high-level feature extractions. Take the convolutional base of the large-scale convolutional neural network (CNN) VGG16 as an example. It contains a 5-block convolutional base, with each block containing several convolution layers and a pooling layer (Fig. 4g). All the 38,592 kernel parameters in the first block were mapped onto the 9 × 2 array kernel-by-kernel, as shown in Fig. 4h. We see the hardware-based kernels’ weights show almost the same landscapes as the software-based values (Fig. 4i). A more direct comparison can be seen from the distributions of conductance and weight values (Fig. 4j, k). The above result implies the hardware integration capability for vector-matrix multiplication, and brings us the concept of incorporating GIM 2D FGMs in the whole body of large-scale NNs to validate the potential of constructing advanced NCH.

Convolutional neural networks with 8-bit precision

The accuracy of NNs with limited numerical states (fixed-point NNs) is an important issue for the practical application of NCH. We note that downloading a pre-trained NN to a local NCH and quantizing the weights with limited numerical states (quantization after training) is generally a more energy-efficient approach. Therefore, to demonstrate the potential of GIM 2D FGMs array for NCH (Fig. 5a), pre-trained large-scale convolutional neural networks (CNNs) were used for ImageNet dataset recognition (Fig. 5b). Here, the large number of parameters in these CNNs were quantized to the 8-bit states of the GIM 2D FGM using a nearest-rounding method. According to the simulation results with different bit precisions (the 4-bit, 5-bit, 6-bit, and 7-bit states adopted are shown in Supplementary Fig. 44), a 8-bit precision is sufficient for CNNs to achieve high recognition accuracy compared to their unlimited-precision version (Fig. 5c and Supplementary Fig. 45). It’s important to note that while 8-bit precision achieves a higher recognition accuracy (89.43%) compared to lower precisions (such as 88.96% for 7-bit) for the smallest MobileNet model, 7-bit precision is sufficient for the larger Xception model. This suggests that larger CNNs can operate effectively with lower bit precision. However, from a practical perspective, deploying small-sized NNs on edge devices is typically more energy-efficient. Therefore, the higher 8-bit precision storage for these small-sized CNNs is crucial for enhancing their performance.

Fig. 5: Image recognition using CNNs with different precisions.
figure 5

a Schematic of a large-scale GIM 2D FGM array for vector-matrix multiplication in neural networks. b Schematic of CNNs for ImageNet image recognition. c, d Comparison of top-5 accuracy for CNNs quantized with different precisions after (c) and during (d) training. The quantization process used the nearest rounding scheme. The numbers of parameters are denoted in the brackets following the models’ names (4.3 million of MobileNet, 22.9 million of Xception).

An alternative approach involves directly training fixed-point NNs on the NCH with limited states (quantization during training). Even though this approach consumes much more energy and time compared to quantization after training, which is mainly due to the large-scale weight updating, it offers greater flexibility by adapting to specific tasks through weight fine-tuning. Through simulation of quantization during training (Fig. 5d), we observed that the advantage of 8-bit precision over lower precisions is still very obvious for both MobileNet and Xception models. This is further supported by results from a simpler model for MNIST recognition (Supplementary Fig. 46). However, an overall accuracy decrease is observed across all fixed precisions compared to quantization after training, likely due to the reduced efficiency of the training process caused by inaccurate weight updates at lower precisions.

Another important point is the choice of rounding scheme. In the above simulations, a nearest-rounding scheme was adopted. However, according to previous reports10,11,12,13, a stochastic rounding scheme can enhance NNs performance. To validate this, we have reconducted the simulations using a stochastic rounding scheme (Supplementary Fig. 47), and the results showed an obvious accuracy increase for all the fixed precisions, especially for lower precisions such as 5-bit and 6-bit, confirming the benefits of stochastic rounding. Combined with the demonstrated capabilities of vector-matrix multiplication and high repeatability of 8-bit programming, GIM 2D FGMs show great promise for system-level-integrated vector-matrix multiplication arrays in NN accelerators.

Discussion

To sum up, we have designed 2D floating-gate memories working in a gate-injection mode as potential device units for large-scale NCH. The CG, floating gate, and channel are decoupled through this design, so that a bi-pulse state programming strategy could be adopted to realize 8-bit conductance states. This is because the subsequent tuning voltage can promote the de-trapping process of unstable charges captured by the dielectric defects that have a lower potential barrier. The states are highly distinguishable with intervals larger than three times the standard deviations and very stable with retention times longer than 10,000 s. The devices also show good endurance of over 105 cycles. In addition, because charges are injected and erased from the CG through the shared Al2O3 layer via FNT, the state programming is almost symmetry. And through changing the capacitance ratio by varying the aera of the CG, a 3 V programming voltage can be achieved. Moreover, the fabricated 256 devices exhibit a 94.9% yield, good uniformity and repeatability. Then, a 9 × 2 device array was fabricated and experimental image convolutions were carried out with results matching well with that of software simulations. Leveraging the device’s multi-state programming capability, we successfully transferred 38,592 convolutional kernel parameters from a pre-trained VGG16 network to the array. Finally, we studied the image recognition accuracies of fixed-point NNs containing different levels of precisions. Notably, no matter whether NNs designed by downloading pre-trained networks or directly training networks locally, the inference accuracies at 8-bit precision could approach the ideal values. Our work validates the potential of GIM 2D FGMs for high-performance neuromorphic computing accelerators.

Methods

Device fabrication

A p-doped silicon substrate with 300 nm thermal-oxidized SiO2 was firstly coated with poly(methyl methacrylate) (PMMA) and baked for 2 min at 150 °C. After that, the floating gate Pt was patterned and deposited by electron beam lithography (EBL) and electron beam evaporation, respectively. After a standard lift-off process, a layer of Al2O3 with 8-nm thickness was deposited on the floating gate by atomic layer deposition (ALD). The ALD was processed at 150 °C, using water and trimethylaluminum as precursors. Then, mechanically exfoliated MoS2 (purchased from Shanghai Onway Technology Co., Ltd.) with Scotch tape was transferred onto the top surface of the Al2O3/Pt stack by a standard wet-transfer method, using polypropylene carbonate and polydimethylsiloxane as holders. At last, source, drain and gate electrodes of Cr/Au (8/80 nm) were patterned and deposited using EBL and thermal evaporation. To fabricate the 256 devices, a large-scale few-layer MoS2 was grown by CVD on a 1 × 0.5-cm-sized sapphire substrate. The CVD-grown material was transferred with PMMA, patterned through EBL and etched with CF4 and O2 through reactive ion etching.

Array fabrication

Before the fabrication of the wired 9 × 2 array, the CVD-grown MoS2 was transferred onto a substrate on which the wiring metal patterns were pre-deposited, following the transfer process illustrated in Supplementary Fig. 48. During the fabrication process, a 25-nm-thick layer of ALD-deposited HfO2 was used as the insulating layer for the isolation of the overlapped drain and gate lines in the array. The other array fabrication steps were the same as the abovementioned device fabrication process.

Electronic measurements

Except for the 9 × 2 array, the electronic performance of the as-fabricated devices was tested on a probe station (Lakeshore, TTP4) under a high vacuum condition (<106 Torr), which is equipped with Keysight B1500A semiconductor analyzer system. All tests on the 9 × 2 array were conducted on a homemade probe station equipped with an electrical testing system (National Instruments, cDAQ-9189) under atmosphere conditions.

Simulation of large-scale CNNs

The adopted large-scale CNNs are pre-trained models loaded from the Keras platform and they were coded with Python scripts for convenient handling of the internal weights. The ImageNet samples incorporated here for evaluation were all collected from the ILSVRC2012 validation data set, which contains 50,000 images with each labelled with its class. Before evaluation, all the pre-trained weights of CNNs were replaced by the normalized conductance states with the corresponding bit precisions. During the evaluation process for each kind of CNN, the 50,000 images were clipped to a certain size of 224 × 224 and taken as the inputs of the model one by one. The output scores were translated to the recognized class for every image and all the correctly recognized images were summed for the calculation of the evaluated final recognition accuracy on this data set. The three-layer FCNN was also constructed on the Keras platform layer-by-layer. The relu function was used as the activation function, the cross-entropy method was used to calculate the loss function, and a learning rate of 10−3 was adopted for model training.