Introduction

The surge in neural-network scale precipitates an exponential rise in energy expenditure and training cost, underscoring the need for more efficient alternatives1. Optical computing stands out as an effective substitute for electronic computation devices, particularly in the artificial intelligence applications, in which optical neural networks demonstrate marked superiority in speed and energy efficiency over their electronic counterparts2,3,4,5,6,7,8,9,10,11,12,13. Passive optical diffractive neural networks (DNNs) have faced challenges in programmability, whereas programmable metasurfaces provide real-time programmability14,15,16, which are crucial for applications like computational imaging17,18 and adaptive intelligent devices19. Recent advancements in DNNs in the microwave band include the programmable diffractive deep neural networks20, dynamic holography21, planar diffractive deep neural networks22, and surface plasmon polariton neural networks23,24. These studies highlight the microwave DNN’s inheritance of the key strengths in optical computing, including the high-speed processing, parallelism, and energy efficiency.

Considering that the normal diffraction process is linear, how to achieve efficient and high-speed nonlinear activation is a key in the research of DNNs. Hence, significant research efforts have been directed towards developing DNNs that are integrated with the nonlinear activation functions25,26,27,28,29,30,31,32. However, these nonlinearities face substantial practical limitations. For instance, the nonlinearity derived from semiconductor quantum dots (QDs)33 is inherently constrained by the conversion efficiency of QDs, which is characterized by low transmission and significant energy attenuation, impeding the cascading of multiple nonlinear layers. Optical nonlinear activation mechanisms, such as the photonic saturable absorbers29,34 and saturation of radiofrequency amplifiers35, exhibit high thresholds and necessitate high-energy inputs to elicit the nonlinear responses. The conventional CMOS sensor-based photoelectric nonlinearities36,37 constrained by exposure duration exhibit millisecond (ms) or microsecond (μs) level latency, which impede the high-speed computational performance. Furthermore, the nonlinearities that rely on analog-to-digital converters (ADCs), digital-to-analog converters and microcontroller units (MCUs)23 demand complex peripheral devices, which significantly limit the system’s computation rate due to the processing constraints of digital circuits. Thus, these issues further impede the capability of DNNs to demonstrate systematic advantages over traditional digital computations in practical tasks.

To address these challenges, here we propose a multilayer nonlinear diffractive neural network (MN-DNN) with the programmable and fast Rectified Linear Unit (ReLU) activation function operating at 5.8 GHz. The network features multiple nonlinear layers composed of nonlinear metasurfaces integrated with RF detectors, amplifiers, and voltage adders, which exhibit the intensity-dependent nonlinear transmission coefficients, and hence can effectively act as the ReLU activation function. By selecting an optimized high-speed device, the engineered single-layer nonlinear metasurface achieves nanosecond-scale delays (17.7 ns), representing a substantial reduction compared to the millisecond (ms) and microsecond (µs) latency that is typically associated with conventional nonlinear optoelectronic conversion processes26,36,37. The ReLU thresholds can be adjusted through diverse bias voltages, liberating the nonlinear activation from depending on high-intensity illumination, and enhancing adaptability across varied intensity conditions. The gain of nonlinear metasurfaces is controllable via bias voltage, mitigating energy loss, and stacking of multiple nonlinear layers. We show that MN-DNN can enhance the image classification on Modified National Institute of Standards and Technology (MNIST) and Fashion-MNIST datasets, with the accuracy of 92.81% on MNIST and 78.8% on Fashion-MNIST, surpassing the linear networks by over 4% in both cases. It can also directly classify the human postures from electromagnetic waves in real time with the accuracy of 93.06%, surpassing the linear network’s accuracy of 81.85%. The advantages of this network in low latency and high accuracy make it ideal for applications in real-time sensing, motion detection, wireless communication, radar signal processing, and beyond.

Results

Principle of MN-DNN

To address the challenges of nonlinear activation, we propose an MN-DNN featuring low latency and adjustable nonlinearity for fast and accurate computations. The traditional DNNs require digital encoding of the input data, which increases the complexity and latency. Our architecture transcends the conventional encoding and enables the processing of both encoded input data and direct electromagnetic wave information, such as for real-time human posture classification shown in Fig.1a. The direct processing significantly enhances the computational speed and substantially simplifies the hardware. The MN-DNN architecture comprises linear and nonlinear layers, and the latter employs the metasurfaces that incorporate RF amplifiers, detectors, and voltage adders to emulate the ReLU nonlinear activation function based on the power between input and output. Additionally, the implementation of diverse activation functions such as Hyperbolic Tangent(Tanh) and Leaky Rectified Linear Unit(Leaky-ReLU) is straightforward with the alternative active circuit configurations, as elaborated in the “Discussion” section.

Fig. 1: Multi-task processing with MN-DNN.
Fig. 1: Multi-task processing with MN-DNN.
Full size image

a Schematic of the MN-DNN operation. The network is able to perform both dataset-based image recognition and real-time human posture classification. Incident EM waves, encoded with input information, are processed by the MN-DNN comprising three linear and three nonlinear layers. Each nonlinear unit integrates an RF detector, an amplifier, and a voltage adder. The classification result corresponds to the location of the maximum energy on the output plane. b Input-output response of the nonlinear metasurface unit, exhibiting a ReLU-like characteristic with a dynamically adjustable threshold and slope. The output is negligible for weak inputs and increases linearly beyond the threshold. c Experimentally measured time delay of a single nonlinear metasurface layer (17.7 ns). Experimental demonstration of classification capabilities on the MNIST (d), Fashion-MNIST (e) and human postures (f) tasks.

In addition to real-time processing advantages, the MN-DNN architecture also exhibits strengths in its programmability and low latency. The nonlinearity is tunable via external bias voltage, enabling adjustments to the activation threshold and slope, as shown in Fig. 1b. This flexibility eliminates dependence on high power, broadening its applicability across various scenarios. Low latency is essential for swift information processing, but the nonlinearities in the current neural networks often introduce millisecond to microsecond delays owing to their reliance on optoelectronic conversion23,26,36,38, hindering the potential for computation at the speed of light. However, our architecture achieves a latency of 17.7 ns (Fig. 1c), marking a substantial reduction in the delay and underscoring its importance for real-time processing. Hence, MN-DNN boosts the information extraction capabilities, outperforming the linear networks with over 4.2% higher accuracy in both classifications of MNIST and Fashion- MNIST datasets, as depicted in Fig. 1d, e. The MN-DNN also enables real-time extraction of information from electromagnetic(EM) waves. As shown in Fig. 1f, it accurately classifies eight postures with the 93.06% accuracy, enhancing the computational efficiency and reducing the hardware complexity over the conventional methods. Hence, it can be used in human-computer interaction, intelligent surveillance, and industrial inspection.

Architecture of nonlinear metasurface unit

In MN-DNN, the nonlinear layer is implemented with a nonlinear metasurface. The nonlinear unit structure comprises four metal layers and three dielectric layers. The three dielectric layers, from top to bottom, are as follows: 0.5 mm-thick F4B (relative permittivity εr = 2.65), 0.2 mm-thick FR4 (relative permittivity εr = 4.4), and another 0.5 mm-thick F4B (relative permittivity εr = 2.65). The RF energy is coupled by the receiving antenna, passes through the active circuit, and is radiated by the transmitting patch antenna into the air, as illustrated in Fig. 2a.

Fig. 2: Architecture and operational characteristics of the nonlinear metasurface unit.
Fig. 2: Architecture and operational characteristics of the nonlinear metasurface unit.
Full size image

a Unit structure and functional equivalence. b Front and back views of the unit. c Operational schematic. d RF detector’s output DC voltage increases with incident power density. e RF amplifier gain is controlled by DC voltages VCC and VCTRL. Gain response versus VCC (f) and VCTRL (g), both showing a cutoff region at low voltages, followed by an increase and eventual saturation. h Linearity of the output energy versus input under high VBIAS (VBIAS > 1.2 V), and gain increases with VCC. i, j Transmission coefficient rises from cutoff to saturation with incident energy; VBIAS lowers energy threshold (i), and VCC raises transmission saturation (j). k With VBIAS = 1 V and VCC = 5 V, transmission is zero below 1 mW and increases linearly above it, resembling a ReLU response. l Activation threshold decreases with VBIAS. m Gain increases with VCC.

The front-side passive structure comprises an octagonal receiving patch antenna and a T-junction unequal power divider, as shown in Fig. 2b. Optimized for operations at 5.8 GHz, the patch antenna exhibits a side length dimension of D = 7.57 mm to achieve proper resonance characteristics. For impedance matching, the octagonal patch antenna is slotted with a length L1 = 4.91 mm and a width W1 = 1 mm. The microstrip line width W2 is 1 mm. The detection arm of the T-junction features a two-stage impedance transformation, refined by the corner chamfering to ensure a smoother impedance transition. The first branch has a width W3 = 0.5 mm with length L3 = 1.746 mm, while the second branch has a width W4 = 0.254 mm with length L4 = 1 mm. The unit’s back-side features an octagonal transmitting patch antenna identical in size to the receiving antenna, as illustrated in Fig. 2b. The operational diagram of the nonlinear metasurface unit is shown in Fig. 2c. The spatial RF energy, post-coupling by the receiving antenna and microstrip line transmission, is split by the T-junction power divider. The bulk of energy is amplified by the RF amplifier and proceeds to the transmitting antenna, while a fraction of the energy is diverted to a high-speed RF detector and is converted into direct current (DC) voltage VOUT. The combined voltage of VOUT and external bias voltage VBIAS is summed by a voltage adder: VCTRL = VOUT + VBIAS. This control voltage is used to modulate the amplifier. Details of the nonlinear metasurface unit structure are presented in Supplementary Note 1.

The RF detector’s output voltage VOUT is positively correlated with the input RF energy, as illustrated in Fig. 2d. The gain of RF amplifier is modulated jointly by DC voltages VCC and VCTRL, as shown in Fig. 2e. The effect of DC voltages VCC on the RF amplifier’s gain is presented in Fig. 2f. We note that the gain falls below −15 dB when the voltage is less than 0.5 V, effectively muting the amplifier with negligible energy transfer. Above the threshold, increasing VCC will shift the gain into the amplification zone and reach a stable gain. Fig. 2g indicates that low VCTRL values result in a transmission coefficient below −10 dB, halting the transmission. Above 1.2 V, VCTRL initiates the signal amplification, maintaining a stable gain despite further increases. With a high external bias voltage VBIAS (e.g., VBIAS = 1.5 V), the amplifier’s VCTRL surpasses the amplification threshold, independent of the detector’s output voltage VOUT. As shown in Fig. 2h, In the linear amplification regime, the output power of the metasurface unit is linearly proportional to its input, acting as a linear amplitude modulator with VCC governing the amplitude of the transmission coefficient.

With low VBIAS, the unit is operated nonlinearly, and its transmission coefficient changes with incident power. By increasing the incident power, the detector’s VOUT and the amplifier’s control voltage VCTRL rise linearly. The amplifier gain rises from below −10 dB to a stable level, and the metasurface unit’s transmission coefficient increases from a low value to a stable state. Higher VBIAS leads to an earlier transition into the linear amplification, while lower VBIAS requires more input power to reach the turning point of transmission coefficient, as shown in Fig. 2i. The transmission coefficient S21 escalates with VCC magnitude under constant input power and VBIAS, as illustrated in Fig. 2j. We note that the detector’s VOUT and the amplifier’s VCTRL are low when the incident power on the metasurface is below 1 mW, leading to amplifier cutoff and zero output from the metasurface unit. Beyond 1 mW, the voltages will be increased, activating the amplifier’s linear amplification and causing the metasurface unit’s output power to rise linearly with the input power, which emulates the ReLU function in neural networks, as demonstrated in Fig. 2k. In this scenario, VBIAS is 0.5 V and VCC is 5 V. Figure 2l indicates that higher VBIAS reduces the nonlinear activation function’s threshold, enabling variable nonlinear activation thresholds by modulating the bias voltage VBIAS. Figure 2m shows that increasing VCC steepens the slope of the nonlinear activation function, resulting in elevated amplifier gain and transmission coefficients for the metasurface units. Based on the measured data (details of the nonlinear metasurface tests are provided in Supplementary Note 2), we construct scatter plots and employ curve fitting to derive the mathematical expression of the activation function, which exhibits a ReLU-like form:

$$y=b\times \max \left(0,\,x-a\right)$$
(1)

where x represents the input power, y denotes the output power, a is the threshold parameter, and b is the slope parameter. This nonlinear activation function offers 17.7 ns latency in the measurements (see Supplementary Note 3 for details). Each nonlinear unit exhibits a total power dissipation of approximately 429 mW, which consists of 150 mW for the RF amplifier, 53 mW for the RF detector, and 224 mW for the voltage adder. Supplementary Note 4 and Supplementary Table S2 compare the performance of diverse nonlinear implementation methods in DNNs, focusing on metrics such as time delay, threshold, multi-layer stacking, and programmability.

Handwritten digits classification by MN-DNN

The primary objective is the classification of MNIST dataset, which involves identifying ten classes of handwritten digits ranging from 0 to 9, as shown in Fig. 3a. The original 28 × 28 pixel images are down sampled to 14 × 14 pixels to match the dimension of the input-layer metasurface via bilinear interpolation, and subsequently binarized to values of 0 or 1. Further details on image processing for recognition are outlined in Supplementary Note 5, with the effect of image preprocessing on recognition accuracy analysis provided in Supplementary Note 6. Intensity modulation is achieved by mapping the input image pixels onto the first-layer metasurface. The background is opaque to the EM waves, with the digits being transparent (transmission coefficients of 0 dB for pixel 0 and 10 dB for pixel 1), as illustrated in Fig. 3b. Consequently, the EM wave conveys the information of the input image as it passes through the first layer.

Fig. 3: Handwritten digits classification.
Fig. 3: Handwritten digits classification.
Full size image

a Example images from the MNIST dataset, which comprises 10 digit classes (0–9). b Encoding binarized images into the transmission coefficients of the input metasurface. c Experimental test setup. d Schematic of the two network architectures: a nonlinear network with interleaved linear and nonlinear layers, and a linear network solely with linear layers. e t-SNE visualization of test set outputs for both networks, showing clearer class separation and tighter clustering in the nonlinear network. f Confusion matrices show 92.6% (nonlinear) and 88.5% (linear) accuracy on 5000 test images. g Sampled output field distributions generated by the linear and nonlinear networks for a simple image, demonstrating that both networks focus on the target region. h Sampled output field distributions for two complex images from both networks, in which the nonlinear network focuses on the correct regions while the linear network focuses on incorrect regions.

As shown in Fig. 3d, we independently train linear and nonlinear neural networks using the full MNIST dataset (60,000 for training and 10,000 for testing) to evaluate their image classification performance. The linear network comprises three linear phase-modulation layers spaced 20 cm apart, each composed of a 22 × 22 array of 19 × 19 mm2 units. The phase of each unit serves as a trainable parameter, adjustable in the range from 0 to 2π (see details in Supplementary Note 7). The nonlinear network features an alternating linear-nonlinear structure with three linear layers interspersed with three nonlinear metasurface layers, spaced at 10 cm intervals. Each nonlinear layer contains two independent trainable parameters: the threshold and slope of the ReLU activation function, with layer-specific values (see Supplementary Note 8 for the RF component math modeling). The nonlinear network consists of 1458 trainable parameters, with 484 parameters per linear layer and 2 parameters (threshold and slope) per nonlinear layer, yielding a total of 484 × 3 + 2 × 3 = 1458, and all parameters are trained together. In comparison, the standard fully connected networks for the MNIST classification typically require tens to hundreds of thousands of parameters, whereas our model achieves an order-of-magnitude reduction with only ~1500 parameters. The linear and nonlinear networks remain fully independent throughout their design and training phases, without shared parameters or cross-influence. The loss function is a weighted sum of the Mean Squared Error and SoftMax Cross-Entropy, with coefficients 0.4 and 0.6, respectively. The gradient descent optimization is performed using the Adam algorithm. The mathematical model of DNNs is detailed in Supplementary Note 9. The optimization procedure explicitly accounts for the field distribution of 5.8 GHz horn antenna on the input plane under the test conditions (Supplementary Note 10), to reduce the discrepancies between the simulations and measurements. The training outcomes indicate that the nonlinear network achieves a recognition accuracy of 92.81%, which is 4.18% higher than the 88.63% accuracy of the linear network. It is important to note that this improvement does not result from the increase of trainable parameters, as each nonlinear layer only adds two trainable parameters (threshold and slope), and three nonlinear layers contribute a total of six additional parameters. For comparison, a linear network with the same number of parameters achieves at most 89% accuracy, further confirming the performance gain from the nonlinear activation functions (see Supplementary Note 11). The output fields of both networks are reduced to two dimensions via t-SNE, as shown in Fig. 3e. The clustering of the linear network shows a looser intra-cluster structure with overlapping data points across categories, whereas the output of the nonlinear network displays higher local density within clusters and clearer separation between categories.

Based on the simulation results, nonlinear and linear network samples are fabricated, in which the metasurfaces are mounted on acrylic plates with plastic screws and aligned into the grooves of aluminum alloy brackets. Absorptive materials are applied to the plates and brackets to reduce the edge diffraction of EM waves and to minimize their influence on the EM fields, see Supplementary Note 12 for details. The nonlinear prototype includes an input layer, three linear layers and three nonlinear layers of metasurfaces. The input-layer metasurface functions as a configurable linear amplitude modulator in a 0–10 dB range, which is implemented by a nonlinear metasurface in linear amplification mode, controlled by a bias voltage. It consists of 14 × 14 units, where the transmission coefficient of each unit matches the input image pixel values, modulated by an MCU-regulated bias voltage. The activation function parameters of nonlinear metasurfaces are modulated by the external DC voltage. The linear prototype comprises an input-layer metasurface and three linear phase-modulating metasurfaces, configured analogously to the nonlinear network. Both linear and nonlinear prototypes are excited by a 5.8 GHz horn antenna.

The input-layer encoding was performed through intensity modulation. As depicted in Fig. 3c, a two-dimensional near-field scanning platform recorded the output field distributions. To comprehensively evaluate the network performance, we systematically assessed the recognition accuracy of both networks using 5000 randomly selected images. In the experimental setup, a rapid detection device was used to replace the conventional two-dimensional near- field scanning stage (see Supplementary Note 13), enabling efficient energy measurements across ten specific regions of the output field for each image. This approach significantly enhanced the detection throughput, allowing rapid acquisition of classification results for large-scale image datasets. The MNIST rapid classification experiments conducted with this device are presented in Supplementary Video 1. The experimental results demonstrate that the nonlinear network achieves a classification accuracy of 92.6% on the 5000 MNIST images (compared to a simulated accuracy of 92.81%), exhibiting a clear advantage over the linear network (88.5% accuracy), as shown in Fig. 3f. To further analyze the networks’ behavior, we performed output field distribution scans on a subset of images. For simple images with distinct features such as the handwritten digit “7” (Fig. 3g), both linear and nonlinear networks achieve accurate recognition, with their output energies precisely concentrated on the target category’s region. However, when processing complex images with easily confusable features, the nonlinear network shows a pronounced advantage. As seen in Fig. 3h, for a “4” sample prone to confusion with “9”, the nonlinear network’s output energy is correctly localized to the ‘4’ region, whereas the linear network misclassifies it, with its peak energy erroneously appearing in the “9” region, although the “4” region has the second-highest energy. This phenomenon remains consistent across other complex image classifications, demonstrating the nonlinear network’s superior performance in handling complex images. Further comparisons are presented in Supplementary Note 14. A detailed comparison between simulated and experimental results is given in Supplementary Note 15. The total power of MN-DNN is about 252.352 W (see Supplementary Note 16 for details).

Fashion-MNIST classification by MN-DNN

Next, we compare the classification performance of linear and nonlinear networks on the complete Fashion-MNIST dataset, which consists of 60,000 training samples and 10,000 test samples. This dataset comprises ten apparel categories: T-shirts, trousers, pullovers, dresses, coats, sandals, shirts, sneakers, bags, and ankle boots, as illustrated in Fig. 4a. The 28 × 28 pixel images are resized to 14 × 14 pixels to match the input layer metasurface using bilinear interpolation and then binarized. Intensity modulation is applied by mapping the pixel values onto the first-layer metasurface. To assess the image classification capabilities of linear and nonlinear neural networks, we optimize both structures to map the output field distributions of ten classes to ten distinct regions, as shown in Fig. 4b. The network architectures for Fashion-MNIST recognition mirror to the architectures employed in MNIST, and both linear and nonlinear networks undergo independent optimization processes without any parameter sharing. The nonlinear diffraction network achieves 78.8% accuracy on Fashion-MNIST, surpassing the linear network (74.43%) by 4.37 percentage points. Similar to the previous case, this improvement is not due to the increased parameter count, as each nonlinear layer only introduces two trainable parameters (the threshold and slope) with a  total of six additional parameters, which are negligible compared to the model’s 1458 parameters. Under the same parameter conditions, the linear network achieves 75.08% accuracy (Supplementary Note 11), further demonstrating the performance gain from the nonlinear architecture. The accuracy comparison with other networks is given in Supplementary Note 17. Based on the simulations, we fabricate the nonlinear and linear prototypes. During experimental evaluation, a set of 5000 randomly sampled Fashion-MNIST test images is intensity-encoded and processed through both nonlinear and linear network configurations. The recorded recognition accuracy reaches 77.4% and 72.5% for the nonlinear and linear networks, respectively. The resulting output field distributions of linear and nonlinear samples from the horn antenna irradiation are shown in Fig. 4c, d. For simple and distinct images like the ankle boot in Fig. 4c, both linear and nonlinear networks achieve accurate classification, with the output energy correctly focused on the target region. However, when dealing with complex images characterized by similar and confusable features, the nonlinear network demonstrates a clear advantage. As illustrated in Fig. 4d, the nonlinear network correctly classifies the sneaker image (often confused with ankle boots), with the output energy concentrated specifically in the sneaker region. In contrast, the linear network misclassifies this input, displaying the peak energy in the wrong ankle boot region despite the secondary activation in the correct sneaker category. This consistent pattern across challenging images (see Supplementary Note 18 for more results) confirms the nonlinear networks’ superior performance in recognizing the complex images.

Fig. 4: Fashion-MNIST classification.
Fig. 4: Fashion-MNIST classification.
Full size image

a The 10-class fashion item dataset for recognition. b Comparison of two network architectures: the nonlinear network with alternating linear and nonlinear layers, and the linear network with only linear layers. c Sampled output field distributions generated by the linear and nonlinear networks for two simple images. Both networks successfully focus on the target regions. d Sampled output-field distributions for three complex images from both networks. The nonlinear network focuses on correct regions and the linear network on incorrect regions. e Recognition accuracy increases with the number of nonlinear layers. f Effect of nonlinear layer positioning on the accuracy, with the front and back placements outperforming the central insertions. g Impact of activation function type. The ReLU function provides the most significant performance improvement.

In addition to the experiments, we conduct a series of simulation analyses to determine the impact of the number, placement, and type of activation functions in nonlinear layers on the recognition accuracy. We assess different network configurations with varying numbers of nonlinear layers: a three-layer purely linear network; a network with an additional nonlinear layer between the first and the second linear layers; a network with two nonlinear layers interspersed among the linear layers; and a network with three nonlinear layers evenly distributed among the linear layers. The results show that incorporating nonlinear layers can significantly enhance the recognition accuracy over the purely linear network, with the accuracy increasing as the number of nonlinear layers grows, as illustrated in Fig. 4e. In the three-linear-layer framework, we introduce nonlinear layers at three different locations: between the first and the second linear layers (front), between the second and the third linear layers (middle), and between the third linear layer and the output plane (rear). The results illustrate that the recognition accuracy increases with the addition of nonlinear layers at all positions, with the largest improvements at the front and rear, as depicted in Fig.4f. In a three-linear-layer structure, we introduce a nonlinear layer preceding the output plane to test the ReLU, SoftMax, and Tanh activation functions. Each function boosts the accuracy significantly, with ReLU showing the greatest enhancement, as depicted in Fig.4g. For more details, refer to Supplementary Note 19.

Static posture recognition by MN-DNN

We develop a posture recognition system using MN-DNN, as shown in Fig. 5a. A test person stands in front of MN-DNN and performs various postures, which affect the EM-wave focusing on the output plane. Our classification targets include eight distinct human body postures: stand, arms down, cheer, hands up, greet, scratch head, left arm, and sideway. Firstly, we collect a dataset of the postures. The system employs a standard horn antenna operating at 5.8 GHz to illuminate the test person. Posture information is extracted using an eight-element patch antenna array with 3-2-3 arrangement (5.8 GHz resonant frequency) that samples the scattered field distributions from the human body. Signals from these antennas are sequentially captured by Vector Network Analyzer (VNA), managed by a computer that oversees an MCU to control a single-pole eight-throw RF switch (refer to Supplementary Note 20). Four volunteers are designated as trainers and perform extensive data acquisition for eight postures under various conditions, as depicted in Supplementary Fig. S42. These include different rotation angles of human body relative to the antenna array (ranging from −30° to 30° in 10° increments) and varied distances between the human body and antenna array (from 20 cm to 50 cm in 10 cm increments), as presented in Supplementary Note 21. The input layer of our system features 8 active units with a transmission coefficient of 10 dB, arranged in a 3-2-3 array configuration, aligning with the antenna array’s positioning during the dataset acquisition. These units are tasked with spatially sampling the scattered EM waves emitted by the human body. The remaining input layer units are inactive, blocking the transmission of corresponding EM waves, as illustrated in Fig. 5a. The strategic selection of 8 sampling points in the input layer is crucial to attain high recognition accuracy in the limited dataset acquisition timeframe (see Supplementary Note 22 for more details). A total of 850 data samples are collected for each posture, resulting in 6800 samples across eight postures, with corresponding labels ranging from 0 to 7. Among these, 70.5% are allocated to the training set, 23.5% to the test set, and 6% to the validation set (see Supplementary Table S9 for more details).

Fig. 5: Posture recognition.
Fig. 5: Posture recognition.
Full size image

a Schematic of MN-DNN for posture recognition, where the scattered EM fields from the test person are processed by MN-DNN. b Equivalent diagram of MN-DNN. c Experimental setup for posture recognition. d Schematic of the detection setup on the output plane, consisting of eight zones each equipped with an antenna and detector that convert EM energy into DC voltages (V1V8). These voltages are processed by ADC of MCU, controlling the LED colors based on the maximum voltage channel, and transmitted to a computer for real-time display. e Experimental results from 50 repetitions of eight postures, showing the accurate classification as confirmed by voltage readings across the eight regions.

We further demonstrate a low-latency and high-fidelity posture recognition task based on the proposed MN-DNN architecture. In traditional methods, the input data streams are processed in the digital domain, which has limitation in processing speed and requires complex front-end analog-to-digital conversion modules. In the proposed MN-DNN, the scattered field undergoes neural network computation during its propagation through the network, thereby overcoming these limitations. The MN-DNN comprises three linear layers and three nonlinear layers, as previously described and depicted in Fig. 5b. The output field distributions for various postures are mapped to eight distinct regions on the output plane. Simulation results show that the nonlinear network can quickly achieve a recognition accuracy of 93.06%, which substantially surpasses the linear network’s 81.85%, as detailed in Supplementary Note 23.

A trainer is invited to perform the test, as depicted in Fig. 5c. The EM intensity across eight specific regions on the output plane corresponds to the classification of eight distinct postures. In each region, a patch antenna coupled with an RF detector is utilized to rapidly measure the energy of the area, converting it linearly into DC voltages (V1V8). The outputs from these detectors are connected to the ADC ports of the MCU for voltage detection. MCU compares the voltage magnitudes and controls the color of LEDs based on the channel with the maximum voltage. Each posture classification is associated with a unique LED color: “stand”—white, “arms down”—red, “cheer”—yellow, “hands up”—green, “greet”—blue, “scratch head”—light blue, “left arm”—purple, and “sideways”—orange. Additionally, MCU conveys the voltage values of the eight channels to a personal computer for real-time display (see Fig. 5d). The underlying principles of energy detection apparatus on the output plane are elaborated in Supplementary Note 24.

The tester performs each of the 8 postures 50 times, and the output voltages from eight RF detectors are logged to measure the intensity across eight distinct regions. Classification is determined by the posture corresponding to the detector with the highest voltage, as shown in Fig. 5e, where the mean output voltage of the detectors is represented by bar heights and their variances by error bar lengths. Consistent with expectations, the regions with the maximum energy on the output plane align with the executed postures. For example, during the “scratch head” posture, Detector 6 records an output voltage nearing 1.2 V, while others remain below 0.5 V, indicating the highest voltage and thus classifying the posture as “scratch head” in accordance with Fig. 5d. More experimental pictures are presented in Supplementary Note 25. To evaluate the system’s robustness, we conduct experiments with four additional participants to exhibit diverse body shapes and clothing in complex backgrounds, none of whom had participated in the initial data collection. The results show an average recognition accuracy of 87%. Future improvements could be focused on expanding the data collection scope by incorporating more sampling points and recruiting additional participants with varied body shapes, clothing styles, locations, and environmental conditions, which would enhance the system’s adaptability and robustness across diverse scenarios (see Supplementary Note 26 for detailed analysis).

Dynamic posture recognition by MN-DNN

MN-DNN can effectively recognize both static and dynamic postures. The tester performs a sequence of postures, holding each for approximately 4 s before transitioning to the next. The sequence encompasses stand, arms down, cheer, hands up, greet, scratch head, left arm, and sideway, as shown in Fig. 6a. The simulated field distributions corresponding to these postures on the output plane are presented in Fig. 6b, with the field’s focus moving sequentially through eight regions in chronological order. The output voltages from detectors are interfaced with MCU, which are then transmitted to a computer in real time via serial communication. Figure 6c presents the temporal voltage waveforms. During the interval from 0 to 4.3 s, the posture is “stand”, with Detector 1 exhibiting an output voltage of about 1.2 V, while other detectors show less than 0.55 V, resulting in its classification as “stand”. From 4.3 to 8.2 s, as the posture shifts to arms down, the voltage in Detector 1 drops significantly, and the voltage in Detector 2 increases to approximately 0.9 V, peaking and classifying the posture as “arms down”. This pattern holds for the subsequent postures, ending at the 32-s mark. Throughout each posture’s duration, the relevant detector’s output voltage dominates, allowing for accurate classification of dynamic movements despite minor outputs from other detectors. The complete tests are given in Supplementary Video 2. The total system latency is 2.48 μs, comprising both propagation delay through MN-DNN and processing delay incurred by the output detection. This represents an at least three-order-of-magnitude reduction in latency compared to the conventional camera systems coupled with digital neural network post-processing, which are typically operated with millisecond-level latency. For posture recognition tasks, real-time performance typically requires a time delay on the order of milliseconds39,40,41. The proposed system reduces the latency by at least three orders of magnitude, enabling efficient real-time motion recognition. By implementing a higher-speed output detection scheme with high-speed ADC modules and high-performance FPGA, the total system latency could be further reduced to 66.5–68.5 ns. See Supplementary Note 27 for details.

Fig. 6: Dynamic posture recognition with MN-DNN.
Fig. 6: Dynamic posture recognition with MN-DNN.
Full size image

a A tester performs eight distinct postures in front of the MN-DNN, switching approximately every 4 s. b Simulated output field distributions for the corresponding postures, with the focal points progressively moving across eight designated regions. c Real-time detected voltage waveforms from the eight regions on the output plane. During each posture, the voltage in the corresponding region is higher than that in the others, indicating the correct classification.

Discussion

We presented a novel MN-DNN by integrating RF amplifier, RF detector, and voltage adder into the metasurface unit, enabling ReLU-like nonlinearity with nanosecond delay, surpassing the conventional optoelectronic methods. The system latency can be further reduced through multiple optimization approaches (see Supplementary Note 28 for details). By employing high-performance active components in the nonlinear layer, the time delay can be optimized from the nanosecond down to picosecond range. For example, using operational amplifiers with high slew rates can reduce the computational delay to below 200 ps, while using fast- response diodes (e.g. fast Schottky barrier diodes) as RF detectors can shorten the delay to 220 ps. Further improvements can be achieved by replacing the conventional RF amplifiers with high-speed RF Schottky diode switches (~6 ps), enabling superior response speeds. The theoretical calculations indicate that a single-layer delay can be reduced to within 426 ps, which is comparable to the fastest reported nonlinear layers. With a moderate trade-off in reconfigurability, this delay can be further reduced to 226 ps. In addition, the propagation delay can be effectively minimized by optimizing the nonlinear network architecture (e.g., using increased operational frequency and reduced interlayer spacing), thereby reducing the radial distance between the input and output planes. For the type of activation functions, alternative activation functions can be implemented by modifying active circuits of the nonlinear metasurface, including Tanh and Leaky-ReLU, as presented in Supplementary Note 29.

Furthermore, MN-DNN has the potential for in-situ training, as it can adjust the bias voltages of nonlinear layer units via FPGA and extract the output voltages from the integrated detectors to represent node intensity. Future implementations will detect the output plane energy distributions with an RF detector array, followed by gradient computation and weight updates using gradient descent through FPGA for cyclic training. The system can be extended to millimeter-wave and terahertz frequencies by integrating the high-frequency semiconductor devices (e.g., InP/SiGe amplifiers42,43,44,45 and Schottky diode detectors46,47) with the metasurface designs. This configuration maintains integration density while supporting higher-frequency operation, resulting in highly integrated systems suitable for miniaturization. Improvements in energy efficiency can be achieved by adopting faster and lower-power devices while scaling up the network. Owing to its strong scalability, our network achieves higher power efficiency when scaled to larger single-layer configurations. For instance, the system with a 56 × 56-unit single nonlinear layer reaches 4 TOPS/W, positioning it among the state-of-the-art in energy-efficient designs. For image classification tasks, future architectural improvements could target two key aspects to enhance the recognition accuracy and adapt to a broader range of tasks. Firstly, the integration of a digital-to-analog converter in the input-layer metasurface can enable continuous grayscale inputs, thereby improving information fidelity and subsequent feature extraction accuracy (see Supplementary Note 30). Secondly, systematic scaling of the network dimensions by increasing the metasurface unit density and layers can strengthen the nonlinear processing capacity, ultimately boosting the pattern discrimination performance. In this work, we demonstrated that MN-DNN outperforms the linear networks significantly in image classification accuracy using the MNIST and Fashion-MNIST datasets. We also showed the capability of MN-DNN in real-time processing of spatial EM wave and accurately identifying static and dynamic postures. Owing to its low latency, good adaptability and wide applicability, the proposed MN-DNN holds significant promise for real-time perception, motion recognition, and information processing, indicating a bright future for technological innovation.

Methods

Time delay measurement of a nonlinear metasurface

The time-delay measurement system employs a 5.8 GHz signal generator to produce a continuous radio-frequency (RF) signal. This signal undergoes amplitude-shift keying modulation through a square wave (50% duty cycle) received from a waveform generator via an external trigger interface, thereby generating a modulated RF signal with a square-wave envelope. The modulated signal is then radiated by a transmitting antenna to excite the metasurface. To minimize the edge diffraction effects, we implement a coding strategy that activates only the central unit while deactivating the surrounding edge units, which are also covered with microwave-absorbing material. For real-time monitoring of the incident signal on the metasurface, a receiving antenna (Antenna 1) is positioned equidistant from the transmitting antenna and the central unit of metasurface. The RF signal captured by Antenna 1 is transmitted via a coaxial cable to Channel 1 of an oscilloscope. Simultaneously, a second receiving antenna (Antenna 2), placed on the opposite side of the metasurface, detects the transmitted wave and relays the signal through a coaxial cable of identical length to Channel 2 of the oscilloscope. By analyzing the temporal difference between the signal envelopes of two channels, the transmission delay introduced by the single-layer nonlinear metasurface is precisely quantified. For more details, see Supplementary Note 3.

Data collection for posture recognition

Trainers 1 to 4 conduct comprehensive data collection across eight human postures: stand, arms down, cheer, hands up, greet, scratch head, left arm, and sideway. Positioning themselves between a horn and an antenna array, they perform these postures facing the antenna array. The array captures scattered EM waves, which are then transmitted to a VNA for computer recording. To enhance dataset robustness, data collection encompasses a spectrum of scenarios, including body rotations (from −30° to 30° in 10° increments) relative to the antenna array’s normal direction and distances (from 20 cm to 50 cm in 10 cm increments) to the array. A total of 850 datasets are collected for each posture, resulting in an aggregate of 6800 across all postures, which are labeled from 0 to 7. These datasets are randomized and allocated into training (70.5%), testing (23.5%), and validation (6%) sets. Additionally, to further improve dataset robustness, we increase sampling points from 8 to 21. Eight participants with diverse body types and clothing materials perform eight postures repeatedly under both controlled and interference-rich conditions (5.8 GHz WiFi with multipath reflections). As shown in Supplementary Note 26, this enhanced dataset demonstrates significantly improved robustness in simulations.

Analysis of the count and position of nonlinear layers

(1) Number of nonlinear layers. We analyze the influence of nonlinear layer count on recognition accuracy using the MNIST and Fashion-MNIST datasets. The primary focus is on MNIST, where four distinct neural network configurations were simulated: a purely linear network with three layers, and three nonlinear networks with additional one, two, and three nonlinear layers, respectively. The architectures and simulation outcomes are detailed in Supplementary Fig. S35a–d, with the scatter plots illustrating the output fields post t-SNE dimensionality reduction. These plots reveal that increased nonlinearity enhances output field clustering, with intra-class samples clustering more tightly and inter-class boundaries becoming more distinct. Supplementary Fig. S35e corroborates that recognition accuracy escalates with an increment in nonlinear layers. Supplementary Fig. S36 delineates the Fashion-MNIST recognition outcomes across the four network categories, with accuracy escalating from 74.43 to 78.8%, underscoring a positive correlation between nonlinear layer count and recognition precision. Collectively, the findings from both datasets suggest that augmenting nonlinear layers positively influences image recognition accuracy.

(2) Position of nonlinear layers. We investigate the effect of nonlinear layer positioning on the recognition accuracy using MNIST and Fashion-MNIST datasets. The analysis focuses on four neural networks, each with three linear layers and a single nonlinear layer inserted at different points: inter-first-second, inter-second-third, and post-third linear layers. The detailed architectures and corresponding simulation results on the MNIST dataset are provided in Supplementary Fig. S37a–d. Results demonstrate that the introduction of nonlinear layers at any position improves recognition accuracy, with the highest gains observed when the nonlinear layer is near the output layer. This indicates that the performance is sensitive to the placement of the nonlinear components. Supplementary Fig. S38 shows the classification performance of these network configurations on the Fashion-MNIST dataset, confirming the MNIST dataset trend with the most significant accuracy improvements when the nonlinear layers are positioned near the output layer and the least improvement when placed in the middle.