Introduction

The current landscape of technological revolution, featured by the astonishing achievements of artificial intelligence (AI) technologies, is ushering in a groundbreaking epoch1,2,3. State-of-the-art AI hardware, rooted in the classic Von Neumann architecture, showcases unparalleled performance but at a considerable energy cost4,5,6. For instance, cutting-edge H100 GPUs are projected to consume over 13 terawatt-hours of energy annually7,8, underscoring a critical concern for energy efficiency, particularly in data-driven era.

Human visual system, renowned for its remarkable efficiency in perceiving, transducing, and interpreting information at approximately 1.0 pJ/bit9,10,11,12, provides a promising avenue to tackle the aforementioned concern. This efficiency stems from the intricate mechanisms of visual attention, encompassing both passive autonomic attention (PAA) and active volitional attention (AVA)13,14. As illustrated in Fig. 1a, PAA, naturally attracted by external stimuli, preferentially dominates the process of forming ideology directly from significant signal attention. It involves the fundamental light signal receiving, converting and pre-processing, realizing the primary motion detection. The AVA is further intervened as the subjective selecting command is sent to the brain’s central thinking region guided by prior knowledge and objectives. Accordingly, visual attention is directed through priority maps to efficiently focus on key information while ignoring irrelevant stimuli, which facilitates selective feature extraction and thus an extreme energy efficiency. Leveraging insights from the human visual system holds immense significance in the development of intelligent devices15,16,17,18.

Fig. 1: The principle of implementation path and characterization based on the volitional neuromorphic devices.
figure 1

Schematic representation for a active volitional attention (AVA) modulation of and b the proposed volitional neuromorphic devices. Three key operations are included, incident light receiving for photoelectric conversion, neural network training for spectral reconstruction, and a feedback mechanism for error calibration. c The TEM image and EDX mapping for the vertical cross-section of MoSe2/h-BN/MoS2 heterostructure. d The Raman spectra of MoS2, h-BN and MoSe2/h-BN/MoS2 heterostructure. The shadings represent the characteristics Raman peaks for clarity. e Resolved XPS of Mo, Se, S, B and N core levels.

Encouragingly, the integration of visual attention into two-dimensional materials-based retinomorphic vision devices has demonstrated milestone breakthroughs, navigating versatile and complicated scenarios, such as intelligent imaging, in-sensor computing, all-in-one hardware and etc19,20,21,22,23,24,25,26. Nevertheless, they predominantly rely on PAA, constraining their capacity for efficient feature extraction. This results in redundant sensory data and heightened power consumption, especially in multi-sport scenarios such as road tracking, biology follows, adaptive cruise control, etc27. which highlights the critical necessity of developing volitional neuromorphic devices with exceptional energy efficiency28.

Here, we demonstrate volitional neuromorphic devices with extreme energy efficiency of sub-picojoules per bit by emulating the hierarchical functions of the human visual system. In addition to PAA for dynamic feature extraction, AVA empowers our devices to precisely target specific objects and track their trajectories based on spectral features with an average accuracy over 93%. Such high precision originates from the active feedback and correction attained by optimizing gate-tunable differential spectral photo-response, inspired by biological transsaccadic memory. As a result, the volitional neuromorphic devices exhibit a data compression ratio of 1.17%, minimizing redundant data while approaching the IEE of the human visual system at 0.625 pJ/bit. This advancement is poised to redefine the landscape of AI hardware development, emphasizing brain-like energy efficiency in non-Von Neumann architecture-based systems.

Results

Design of volitional neuromorphic devices

Given that the participation of AVA enables a high operation efficiency and low power consumption, we proposed volitional neuromorphic devices with extreme energy efficiency by simulating the fovea response to regions of interest, configured with a van der Waals heterostructure of MoSe2/h-BN/MoS2 (Fig. S1). The structural, interfacial, morphological and elementary characteristics are evaluated by conducting transmission electron microscope (TEM), Raman, X-ray photoelectron spectroscopy (XPS), atomic force microscope (AFM) and energy dispersive X-ray spectroscopy (EDS), respectively. As shown in Fig. 1c, clear interfaces and hierarchical element distribution are observed in the vdWs heterostructure. The corresponding characteristic E2g (383 cm−1, 233 cm–1 and 1350 cm–1) and A1g (408 cm–1) peaks are identified for few-layer MoS2 flakes (as floating gate), MoSe2 flakes (as conduction channel) and h-BN flakes (as potential barrier), respectively (Fig. 1d)29,30,31. To further evaluate the interfacial characteristics, the resolved XPS spectra of Mo 3d, Se 3d, S 2p, B 1s and N 1s core levels are analyzed in Fig. 1e29,30,31, and the symmetric peak profiles of h-BN suggest that vdWs interactions predominantly govern the interfacial characteristics32. In addition, the morphology profile and element distribution are recorded in Fig. S2, S3, further confirming the successful fabrication of proposed heterostructure with floating gate. As the aforementioned AVA in human visual system (Fig. 1a), the fovea reflects the polarity peak change of the cone photoreceptor, ensuring the fundamental information pre-processing. When an attention is focused on a region of interest, the corresponding neuronal potential activity is immediately enhanced and feeds back to the cone photoreceptors, promoting selective task perception and cognition. Similarly, as shown in Fig. 1b, the device-level AVA is proposed by introducing an active spectral feedback circulation, which progressively manipulates voltage feedback until outputs optimal spectral reconstruction accuracy. Specifically, MoSe2 photoactive channel receives optical signals and simulates the weight of synapse, and MoS2 conduction channel maintains the persistent current to realize the memory function. Subsequently, programmable gate voltage (Vg) pulses, serving as electrical stimulus, are applied to modulate the photoconductance, which generate both non-volatile positively and negatively photoconductive photocurrent (PPC and NPC) over visible spectrum to achieve polarity regulation (similar to cone cells). The differential operation between NPC and PPC photoconductive currents enables polarity-regulated spectral reconstruction by synergistically suppressing interference and amplifying target signals. For the same reason, to observe specific objects within all spectral information, the input voltage is carefully adjusted in the manner of designated optimal Vg for each wavelength, referring to the specific wavelength with the maximum reconstructed accuracy. Accordingly, only the spectral signal of specific interest can be emphasized instead of another irrelevant spectrum. This aligns with the natural way human processes and prioritizes visual information, ultimately enabling the targeting and localization of objects (e.g., a runner wearing a blue shirt). This can maintain 93% spectral recognition accuracy even under 1.17% data compression by dynamically balancing noise suppression and feature retention. Such operation guarantees an intelligent detection with subjective feature selections, which significantly reduces irrelevant redundant sensory data while effectively allocates processing resources to reduce energy consumption. As a result, AVA-involved volitional neuromorphic device adheres to the intelligent philosophy of the human brain that prioritizes appointed spectral sensory inputs over others, leading to an extreme energy efficiency, which provides an innovative way to establish autonomous selection of targeting and tracking.

Versatile characteristics of volitional neuromorphic devices

The AVA-involved sensing, memory, and computing capabilities are essential prerequisites in volitional neuromorphic devices, which ensure accurate targeting and tracking with extreme energy efficiency33,34. The intrinsic responsive characteristics are priorly examined under illumination with primary colors, as shown in Fig. 2a. The as-fabricated device demonstrates an obvious photoresponse that over two orders of magnitude larger than dark current, accompanied by exceptional responsivity, detectivity and noise-equivalent-power with a low incident light power of 0.1 mW (Figs. S4S6). The photoconductive behaviors are subsequently evaluated by applying programmable Vg pulses (Fig.2b, c). A single Vg pulse is initially applied for 1 s, and the light is simultaneously switched on once disconnecting the gate voltage. It is clear that both PPC and NPC are of progressive output, good uniformity and reproducibility. Notably, their response are much more prompt (rise time <160 μs, Fig. S7) than HVS (~50 ms), which effectively protects from capturing ghost images, especially in the scenario of high-speed motion recognition35,36. Good photoresponsive characteristics and floating gate effect are also validated in the manner of pulse number and bias voltage (Figs. S8S10). Meanwhile, this device possesses a linear memory window of ~80 V with a large stored charge density of 5.65 × 1012 cm–2, and maintains a good non-volatile stability (Fig. S11), which is beneficial for successive differential operation. These attractive data-perceiving and storing capabilities are originated from the appropriate arrangement of vdWs heterojunction that allows an efficient Fowler-Nordheim tunneling. The corresponding band alignment and carrier dynamics are elucidated in Fig. S12. Briefly, electrons and holes initially undergo accumulating (MoSe2 conduction channel) and trapping (MoS2 modulation channel) processes as applying negative Vg. The electrons subsequently tunnel to MoS2 followed by the recombination with the trapped holes upon light illumination. Accordingly, the decreased number of electrons weakens the photoconductivity, resulting in a reduced photocurrent (i.e., NPC) and vice versa for PPC37.

Fig. 2: Photoconductivity properties of volitional neuromorphic devices.
figure 2

a Logarithmic I–V curves measured in dark and under illumination of primary colors with the same intensity of 0.1 mW/cm2. Cumulative b positive and c negative photoconductivity measurements with progressive multilevel states. The pulse width and interval of incident light are programmed as 200 ms and 10 s, respectively. d Photocurrent mapping as a function of gate voltage within the spectrum ranging from 450 to 800 nm. The gate voltage ranges from −40 to 40 V with a step size of 2 V and a bias voltage of 1 V. e The histogram of the optimal gate voltage corresponding to the minimum differential photoresponse. Each color represents its corresponding wavelength.

The corresponding spectral response database is established via adequately mapping the photoresponse as functions of spectrum (450–800 nm) and gate voltage (−40 to 40 V), laying a foundation for the proposed device-level AVA feedback. As shown in Fig. 2d, PPC clearly transforms to NPC as tuning the direction of in-vertical electric field by switching Vg from positive to negative, which is consistent with aforementioned device physics. Notably, both PPC and NPC possess a gradual response to both distinctive wavelength and gate voltage, indicating effective photoconductive modulation that enables the subsequent differential operation and thus spectral reconstruction. As mentioned earlier, differential current (Idiff) obeying Kirchhoff’s law is essential for realizing visual attention, and the smaller Idiff, the better object contour clarity, as evidenced in Fig. S1323,38. Meanwhile, differential signals can suppress common-mode noise and lead to a higher information compression ratio39,40. In addition, spatial-temporal operations (edge enhancement and extraction) of static and dynamic objects are validated by manipulating PPC/NPC characteristics of the device (Figs S14S17). In this case, the distribution of optimal gate voltages (\({V}_{{{\rm{g}}}}^{{I}_{{{\rm{diff}}}-\min }}\)) corresponding to each wavelength with an interval of 10 nm is recorded as the differential current reaches its minima (Idiff-min), as shown in Fig. 2e. It is worth noting that each color equips its corresponding optimal set of gate voltage, e.g., blue (4 and −14 V), green (8 and −32 V) and red (10 and −34 V). By legitimately analyzing the statistics of \({V}_{{{\rm{g}}}}^{{I}_{{{\rm{diff}}}-\min }}\) distribution, it is reasonably believed derived operation condition of primary colors can facilitate the subsequent implementation of AVA with adjustable feedback.

Spectral active feedback mechanism

With the aforementioned differential current response for different wavelengths in the volitional neuromorphic devices, the device-level AVA with spectral active feedback is conceptualized. This operation leverages the dynamic modulation capabilities of neuromorphic devices to enhance the accuracy of motion spectra and thus focus on objects. Therefore, the crucial aspect of implementation lies in the reconstruction of the object spectrum. The reconstruction process is briefly outlined in Fig. 3ac. The single differential photocurrent matrix (Idiff-pc) with different Vg and wavelength is inputted, and each unknown reflection image input yields a reconstructed spectrum after neural network training via gradient descent, explained by the formula below:

$${I}_{{{\rm{diff}}}-{{\rm{pc}}}}=\varGamma ({V}_{{{\rm{positive}}}},{V}_{{{\rm{n}}}{{\rm{eg}}}{{\rm{a}}}{{\rm{tive}}}}){S}_{{{\rm{vector}}}}(\lambda )+\nu$$
(1)

where Г(Vpositive, Vnegative) is the spectral response matrix, Svector(λ) is a vector representing the spectrum, which is dependent on the wavelength resolution, and ν denotes noise. By performing a comparative analysis between the reconstructed spectra Svector(λ) and the corresponding reference spectra Svector(λ) within the training dataset, the neural network undergoes optimization through the resolution of Eq. 2:

$${S}_{{{\rm{vector}}}}(\lambda )={argmin}\parallel \varGamma ({V}_{{{\rm{positive}}}},{V}_{{{\rm{negative}}}}){S}_{{{\rm{vector}}}}(\lambda )-{I}_{{{\rm{pc}}}}{\parallel }_{2}^{2}$$
(2)
Fig. 3: Schematic of the working principle of spectral reconstruction.
figure 3

a Parameter input. By employing nonlinear transformation expansion, the dimensionality of the input space is increased to 16 dimensions. b Residual neural networks (ResNet) training. c Spectral reconstruction of white light. d Average reconstruction accuracy for all voltage combinations under 450–800 nm. e The corresponding voltage combination for 733 nm light signals. The optimal voltage combinations marked as red squares are assigned to the minimum value of Idiff-pc. f The accuracy change curve for the three primary colors (460 nm, 554 nm, and 697 nm) and the selected 733 nm. The shadings represent the corresponding error bars for each color, and the “star” symbols suggest the optimal gate voltage for achieving maximum spectral reconstruction accuracy.

During training, a combination of Mean Squared Error and L1 norm was used as the loss function. After only 79 rounds of training, the model ultimately stabilized, minimizing convergence issues and restricting the solution space (Figs. S18, S19)41,42. Utilizing the preferred light source (xenon lamp), simulative reconstruction of different transmission spectra can be obtained in Fig. 3c and Fig. S20. We observed that the spectrum reconstructed from the optimized neural network agrees well with the measured reference spectrum. To further evaluate the model performance, the reconstruction error of the model at various wavelengths is recorded, as presented in Fig. 3d. Across the entire dataset, the model achieved an accuracy rate of 92.2% with a spectral resolution of approximately 0.24 nm.

The accuracy of spectral reconstruction hinges on voltage and the resulting Idiff-pc, and the lower Idiff-pc can enhance the effectiveness of motion recognition. By optimizing the voltage corresponding to Idiff-pc, accuracy can be improved for specific wavelengths, thereby modulating signals from desired motion targets. This approach effectively suppresses non-target spectral signals, highlighting signals from spectral motion target, the corresponding process is illustrated in Fig. 3df. To begin, we establish the average reconstruction accuracy as the baseline for each wavelength across all voltage combinations in Fig. 3d. Taking the spectral information at 733 nm as an example, the Vg mapping for this wavelength can be derived (Fig. 3e). By sorting the differential currents, the minimum differential photocurrent and the corresponding optimal set of gate voltages are obtained. Multiple minima of Idiff-pc values are identified, guiding the subsequent spectral reconstruction process for each voltage combination. By subtracting the established average accuracy curve from the accuracy spectrum obtained under the voltage combination corresponding to the minimum Idiff-pc, we reveal the accuracy variation across the spectrum. This comparison aids in assessing whether the wavelength achieves optimal improvement in spectral reconstruction accuracy. As depicted in Fig. 3e, spectral reconstruction accuracy peaks at 733 nm when utilizing the specific voltage combination of (–20, 38 V). This optimal voltage configuration is determined through iterative optimization, reflecting the device’s learning process and serving as a crucial condition for subsequent operational “consciousness” formation of the device. Furthermore, the same procedure demonstrates maximum reconstruction accuracy in red, green, and blue light (Fig. 3f), further validating the effectiveness of this method across different wavelengths.

Active target demonstration

The device-level AVA is believed to easily demonstrate its advantages in numbers of scenarios to actively identify the desired target among multiple moving objects. Building upon feature extraction, it can achieve fusion and superposition of features through accumulation and averaging. This operation is beneficial for obtaining trajectories of multispectral moving objects captured in the images. The proof-of-concept demonstration is illustrated in Fig. 4a–c, sustained attention is illustrated for individuals moving in red, green, and blue colors. The 3D plots effectively display each person’s trajectory in a distinct color, with the backdrop of a tiled floor for spatial reference. The figures clearly delineate the paths of the individuals, demonstrating the system’s ability to track them accurately amidst potential distractions. It is evident that the information captured in these images is clear and precise, offering a stark contrast to the data obtained through PAA, which appears cluttered and chaotic (Fig. S21). In addition, the extraction of moving objects and color tracking were demonstrated through single device whiskbroom scanning system in Figs. S22, S23. Figure 4d–f illustrates the recognition accuracy of the device over time at different wavelengths. For all frame rates, the recognition accuracy often exceeds 93%, with small fluctuations and relatively few outliers, indication of reliable recognition performance. The AVA significantly enhances image compression by prioritizing areas with high informational content, thereby optimizing storage and improving the efficiency of data transmission without sacrificing key visual details. Dividing the pixels in the grayscale range of 0–256 into 20 equally spaced statistical units. The spectral reconstruction mechanism achieves high-efficiency data compression by selectively preserving the target object’s characteristic spectral channels. Key principles include: (1) A threshold comparator filters pixels with effective edges in Fig. 4g (lower part). (2) The original image spectrum is compressed to only the color channel of the target object in Fig. 4g (upper part). Static backgrounds and irrelevant moving objects are ignored, which can significantly reduce data volume, ultimately achieving a compression ratio of 1.17%, which outperforms other neuromorphic devices as summarized in Table S4 (Detailed calculations in Supplementary Information). In addition, the AVA mechanism contributes to low energy consumption by selectively processing only relevant information. The information-energy efficiency (IEE) is comprehensively evaluated through the power consumption aspect of human like visual information processing, quantified in joules per bit(J/bit)10,43,44. Utilizing an evaluation method based on the resting state and action potential of human brain neurons (Fig. S24), the IEE of our volitional neuromorphic device can be as low as 0.625 pJ/bit. This efficiency is comparable to the energy consumption of neurons in the human retina, which is approximately 0.714 pJ/bit, and is slightly higher than the energy consumption observed in the neurons of mouse and Drosophila, as shown in Fig. 4h10,45,46. Meanwhile, IEE demonstrates obvious advantages compared to other neuromorphic devices (Table S5). This comparison highlights the potential of the AVA mechanism to approach near-biological levels of energy efficiency in AI systems. Notably, a good device-to-device repeatability has been substantiated via additional validations on vdWs heterostructure, device performance and accuracy simulation (Figs. S2529). By employing AVA, a more comprehensive and refined understanding of the scene was achieved, facilitating effective tracking and analysis of moving objects with multispectral features.

Fig. 4: Simulation of AVA mechanism to recognize and select target moving objects.
figure 4

a Red, b green and c blue moving object and their motion trajectories. The box plots denoting the distributions of recognition accuracy of device over time for moving objects in (d) red, (e) green and (f) blue. All box plots include median line, mean values, outliers and interquartile range (25–75%). g The fundamental of data compression after AVA mechanism. The color channel only selects objects and discards background data, and the pixel brightness distribution of the original image and AVA mechanism is normalized. h Comparison of information-energy efficiency for different objects: 0.625 pJ/bit of volitional neuromorphic devices (VND), 0.714 J/bit of human brain, 0.574 pJ/bit of mouse and 0.329 pJ/bit of Drosophila.

We have successfully demonstrated a sub-picojoule-per-bit volitional neuromorphic device by introducing an AVA. The intervened AVA operation with active feedback and real-time correction, leveraging gate-tunable differential spectral characteristics, enables our device precisely identify and track specific objects with an impressive accuracy over 93%. As a result, the volitional neuromorphic device boasts a data compression ratio of 1.17%, which significantly reduces redundant data and achieves an extreme information energy efficiency of 0.625 pJ/bit. These advancements mark a major breakthrough in neuromorphic engineering, highlighting the potential to redefine the future of AI hardware with brain-like efficiency.

Methods

Materials: Heavily p-doped Si substrates coated with SiO2 layer (90 nm) were purchased from Corning Inc. The MoSe2, h-BN and MoS2 were supplied by MaiTa Corp. (Nanjing, China). Acetone (IPA, anhydrous, 99.5%), isopropanol (anhydrous, 99.5%), ethanol (anhydrous, 99.9%) were purchased from Aladdin.

Device fabrication: Thoroughly clean the SiO2/Si substrate in acetone, IPA, and ethanol using ultrasonic treatment for 10 min. Use Scotch tape to mechanically exfoliate off MoS2 flakes, h-BN, and MoSe2 flakes, and sequentially transfer them onto a SiO₂/Si substrate with PDMS assistance, heating to 75 °C, 65 °C, and 75 °C, respectively, during each transfer. The electrode patterning was carried out by ultraviolet lithography of the MoSe2/h-BN/MoS2. Then, the Cr/Au (10/60 nm) contact pads were deposited by electron beam evaporation, followed by a standard lifted-off process in acetone. In order to improve the ohmic contacts between Au electrodes and MoSe2, the as-fabricated devices were annealed at 300 °C in the argon atmosphere. Furthermore, we have successfully fabricated a 3 × 3 array device (Figs. S30, S31). Despite slight device-to-device variations (Fig. S32), the essential functional switching mechanism remains consistently robust. Future work will therefore focus on refining heterostructure fabrication process to enhance device-to-device reproducibility and fabrication variability, which represents one of the critical steps for scaling towards intelligent systems.

Material and device characterizations: Raman, PL, and AFM (Atomic Force Microscope) of MoSe2/h-BN/MoS2 were measured by a Raman-atomic force system (Alpha300RA, WITec) under 532 nm excitation laser diode (2 mW). TEM and Energy Dispersive X-Ray Spectroscopy (EDX) of MoSe2/h-BN/MoS2 were represented by Electron microscope Talos F200S and Spectrum SUPER X, respectively. The morphology and elemental mapping were measured by Scanning Electron Microscopy (SEM, ZEISS EV0MA15) and Energy Dispersive Spectrometer (EDS, SDD type 80T), respectively. XPS spectra of heterojunctions were measured using Thermo Scientific K-Alpha with an Al ka (hv = 1486.8 eV) emission source. The optoelectronic properties of the photoelectric memory were measured with the SemiProbe probe station and a semiconductor parameter analyzer (Keithley 4200), and Platform Design Automation (PDA, FS-Pro). The lasers of 450, 520 and 635 nm were emitted and controlled by a Programmable DC Power Supply (Itech Electronic, IT6100B), a function generator and an irradiatometer. The combined control of a supercontinuum light source (SuperK Compact) and tunable fiber filter (WLTF-NM-P-1550) is used to emit lasers of different wavelengths and light with varying linewidths. The noise spectral density was measured using a semiconductor parameter analyzer equipped with a noise testing module (FS-Pro, Primarius). All the measurements for devices were operated 4 ambient conditions.

Implementation of ResFCNet: In the AVA simulation framework, the neural network consists of an input layer accepting 16-dimensional feature vectors, followed by a sequence of residual blocks with progressively increasing dimensionality (256, 512, and 1024 units, respectively). Each residual block is composed of multiple dense layers with regularization and ReLU activation functions, with skip connections to preserve gradient flow in deep network configurations. The output layer is designed to produce 1650-dimensional vectors for spectral reconstruction. For a comprehensive guide on how to reproduce the specific results and figures presented in this work, including the model training (Fig. S19) and spectral reconstructions (Fig. 3c–f and Fig. S20), please refer to the Appendix: Implementation and Utilization of the ResFCNet Architecture in the Supplementary Information.