Analog in-memory computing attention mechanism for fast and energy-efficient large language models

Leroux, Nathan; Manea, Paul-Philipp; Sudarshan, Chirag; Finkbeiner, Jan; Siegel, Sebastian; Strachan, John Paul; Neftci, Emre

doi:10.1038/s43588-025-00854-1

Download PDF

Article
Open access
Published: 08 September 2025

Analog in-memory computing attention mechanism for fast and energy-efficient large language models

Nature Computational Science volume 5, pages 813–824 (2025) Cite this article

50k Accesses
12 Citations
63 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Transformer networks, driven by self-attention, are central to large language models. In generative transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, graphics processing unit (GPU)-stored projections must be loaded into static random-access memory for each new generation step, causing latency and energy bottlenecks. Here we present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain-cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text-processing performance comparable to GPT-2 without training from scratch. Our architecture reduces attention latency and energy consumption by up to two and four orders of magnitude, respectively, compared with GPUs, marking a substantial step toward ultrafast, low-power generative transformers.

Back to recurrent processing at the crossroad of transformers and state-space models

Article 15 May 2025

One-transistor static random-access memory cell array comprising single-gated feedback field-effect transistors

Article Open access 09 September 2021

Arsenic-free Ge-Te-based ovonic threshold switching material with reduced leakage current

Article Open access 01 July 2025

Main

Transformers¹ are central to modern artificial intelligence (AI), powering advances in language models, image processing and beyond. However, their high computational demands lead to substantial energy consumption. Enhancing their efficiency is essential to reduce environmental impact and to keep pace with the exponentially growing size of AI models. The success of transformers as state of the art in sequence processing and generation is enabled by their attention mechanism². To capture dependencies across sequences, the attention mechanism performs dot products between different projections of multiple sequence elements, known as tokens. For generative tasks, the best performance is achieved by autoregressive, decoder-only transformers³. At each inference step, the decoder generates a token, which is then appended to the input sequence, forming the input for the subsequent step. To avoid recomputing the keys and values (KV cache) projections of the previously generated tokens, the so-called KV-caching method stores the projections from previous tokens in memory and updates the KV cache with the new projections⁴.

In a graphics processing unit (GPU), for each token, the entire KV cache must be transferred from main high-bandwidth memory to cache memory (static random-access memory (SRAM)). In addition, the KV cache is often much larger than the available SRAM memory owing to the dimensions of the stored projections and the sequence length⁵. For instance, the entire KV cache of the model Mistral 7B⁶ requires 8 Gb for a batch size of 1, as necessary for inference workloads. In recent technologies, the energy for data access exceeds the energy required for computations⁷. Loading the KV cache for the attention mechanism is thus a major bottleneck, causing increased energy consumption and latency in large language models (LLMs)⁸. To mitigate this bottleneck, a wide body of literature explores resource-efficient algorithms⁹. Alternative architectures to transformers with linear time complexity are developed to improve long-sequence processing efficiency^10,11. However, transformers continue to exhibit more stable training at scale than alternatives such as Mamba¹¹, which contributes to their ongoing dominance despite the efficiency of state-space models. Alternatively, different methods have been developed to reduce the memory requirements of KV caching through token pruning¹², latent KV-cache compression¹³ or low-rank approximations¹⁴, or by reusing the same KV-cache pairs across multiple heads (grouped-query attention)¹⁵.

While these algorithmic strategies reduce computational and memory overhead, achieving further energy efficiency increasingly depends on hardware innovation. Hardware systems dedicated to specific neural architectures can substantially outperform conventional central processing units and GPUs in terms of energy efficiency¹⁶. In particular, to mitigate data-transfer overhead of weights loading, several approaches leverage either near-memory or in-memory computing (IMC)^{17,18,19,20,21}. IMC is particularly beneficial when using non-volatile memories to store stationary weights in linear layers²². However, a full optimization of transformers’ inference also requires addressing the attention mechanism, which contributes substantially to the overall computational cost^9,18. Current IMC solutions do not yet meet all the requirements for efficient hardware implementation of attention. Specifically, KV cache demands fast and energy-efficient memory writing as it is input dependent and must be updated at every generation step. In addition, high parallelism is crucial for low-latency inference, while high memory density is needed for scaling to large models. Finally, long retention time is essential to avoid frequent memory refresh operations. KV cache has been implemented either by dynamic random-access memories (DRAMs)^21,23, which have limited parallelism requiring many digital sequential adders, or by SRAMs^19,24, which are limited by their volatility and relatively low density²⁵. Non-volatile memories can be used for linear layers of transformers¹⁷, but are too slow, energy expensive and are not endurant enough for dynamical KV-cache writing^18,22.

In this work, we propose an IMC hardware architecture based on emerging charge-based memory devices, known as gain cells^26,27, to store token projections and compute dot products for the attention mechanism. As a result, gain-cell crossbar arrays simultaneously serve to store the KV cache and to perform attention computation. Gain cells store information in a capacitor, with a dedicated read transistor generating current based on the capacitor’s voltage. Unlike DRAM, this enables non-destructive read operations, supporting highly parallel IMC computations. Gain cells have high endurance, fast write speeds and low write energy, and are multi-level. Oxide semiconductor field effect transistor (OSFET)-based gain cells (for example, indium gallium zinc oxide (IGZO) or indium tin oxide (ITO)) are capable of retaining their state for several seconds without a power supply^28,29,30, can be manufactured with very small feature sizes, achieving higher density than SRAM, and also support three-dimensional (3D) integration, which can further reduce effective area requirements for IMC applications^{28,29,30,31,32,33}.

The analog-to-digital conversion required for analog IMC often hinders the advantages this approach offers, as analog-t-digital converters (ADCs) are power and area intensive³⁴. To mitigate this issue, charge-based integration is an energy-efficient alternative^35,36. Here, we choose to perform the core of the attention mechanism—two dot products, scaling and activation function—fully in the analog domains, using charge-to-pulse circuits for activation and inter-module communication, combined with pulse counters for final readout.

Practical applications of LLMs often rely on pre-trained models to reduce training costs. However, our co-optimization approach introduces specific hardware constraints to enhance architectural performance, which leads to a divergence from standard pre-trained models. The multiplications operated with gain cells are non-ideal. In addition, the normalization in softmax requires summing across all input elements, requiring global connections with an increased hardware complexity scaling with the sequence length^37,38. In our system, the activation function is instead operated element-wise with charge-to-pulse circuits implementing HardSigmoid functions.

To overcome this discrepancy, we introduce an algorithm that adapts a pre-trained language model to our architecture by scaling each layer according to its statistics and hardware characteristics. With our adaptation algorithm, our model achieves accuracy similar to a pre-trained GPT-2 model without having to train the model from scratch. Overall, the contributions of this study are:

An in-memory, mixed analog–digital computing design to store token projections and compute attention dot products with gain-cell arrays at high energy efficiency.
An end-to-end attention mechanism based on analog signals leveraging charge-to-pulse circuits to avoid power- and area-intensive ADCs.
Quantitative performance analysis of a scalable architecture with area floorplan including analog circuits and digital peripheries.
A software-to-hardware methodology to map pre-trained (ideal) models to non-traditional hardware reaching an accuracy equivalent to GPT-2.
Our architecture achieves up to five and two orders of magnitude lower energy consumption and latency, respectively, compared with GPUs.

After detailing the attention mechanism algorithm, we demonstrate its implementation using gain cells and charge-to-pulse circuits. We then show how our approach maps a pre-trained model to our hardware while maintaining high accuracy on common natural language processing (NLP) benchmarks. Finally, we evaluate the architecture’s performance in terms of energy consumption, latency and area footprint.

Results

Attention mechanism

Figure 1a shows the attention mechanism algorithm. In autoregressive transformers, new token projections called queries (Q), keys (K) and values (V) are created for each inference step from the weights ${W}_{Q,K,V}\in {{\mathbb{R}}}^{D,d}$ and an input token ${x}_{i}\in {{\mathbb{R}}}^{1,D}$ as:

$${Q}_{i},{K}_{i},{V}_{i}={W}_{Q,K,V}{x}_{i},$$

(1)

where i is the token index, D is the token dimension and d is the embedding dimension. The keys and values ${K}_{i}\in {{\mathbb{R}}}^{1,d}$ and ${V}_{i}\in {{\mathbb{R}}}^{1,d}$ are stored as part of the full KV cache with $K\in {{\mathbb{R}}}^{T,d}$ and $V\in {{\mathbb{R}}}^{T,d}$, where T is the sequence length. The query ${Q}_{i}\in {{\mathbb{R}}}^{1,d}$ is not stored but used for inference as

$${S}_{i}={Q}_{i}\cdot {K}^{T};\quad {A}_{i}=\phi \left(\frac{{S}_{i}}{\sqrt{d}}\right)\cdot V.$$

(2)

The dot product between the queries and keys produces an attention score matrix ${S}_{i}\in {{\mathbb{R}}}^{1,T}$. In standard transformers, the activation function ϕ is typically a softmax function, but other nonlinear activation functions can yield similar accuracy^10,39,40. In particular, sigmoid-based attention has been shown to match softmax-based attention on models up to 7-billion-parameters large⁴⁰. Recent studies show that in the case of sliding window attention⁴¹, the normalization of softmax leads to vanishing memory while sigmoid-based attention can lead to better information^42,43. The output of the attention mechanism A_i is then obtained by the dot product between the activation ϕ(S_i) and the values. In the transformer architecture, multiple attention ‘heads’ are computed in parallel, concatenated and provided to a subsequent linear layer to produce the final multi-head attention result.

**Fig. 1: Building blocks of the analog hardware attention mechanism.**

In decoder-based transformers, causal attention allows the score matrix S to compare the input token with all previous sequence elements. However, to prevent the physical memory size from scaling with the entire sequence length, we employ a type of attention that is both causal and local: sliding window attention⁴¹. In this approach, only a fixed number M of key and value projections are retained in memory and attention scores for elements older than the last M are masked (Fig. 2a). Although sliding window attention is local at each layer, it can still capture global information in deep networks because the receptive field grows with the number of layers⁶.

**Fig. 2: Analog hardware attention pipeline.**

End-to-end analog hardware attention

In this section, we first give an overview of how our architecture performs operations on analog signals to compute attention. Then, we detail how the different circuits operate. Keys K and values V are stored in two gain-cell arrays. The query Q_i is encoded as pulse-width modulation (PWM) pulses and is the input of the first array, performing the dot product Q_i ⋅ K^T. An intermediate charge-to-voltage pulse block integrates the output currents from the first array and outputs PWM voltage pulses for the second array, while applying a HardSigmoid activation function (Fig. 1c). The second array, computing ϕ(S) ⋅ V is read out using a signed charge-to-voltage pulse block, where the resulting pulse widths are measured by a digital counter.

The proposed gain cell, shown in Fig. 1d, contains a write stage for programming the capacitor C₁ and a multiplication stage approximating the product between the input and the capacitor voltage.

The storage capacitor is charged with a multi-level voltage pulse emitted by a digital-to-analog converter (DAC). The voltage pulse is gated to the designated capacitor by a write-enable transmission gate. Due to leakage in the storage capacitors, the voltages gradually decay over time. Figure 1f shows the simulated transient response of the storage capacitor voltage V_store, which corresponds to the cell weight for both extreme values 0 V and 0.9 V. An exponential decay fit of the gain cells leakage reveals that the time constant (that is, retention time) of our silicon complementary metal–oxide–semiconductor (CMOS)-based gain cell is τ = 5 ms. Note that an OSFET-based gain cell can achieve multiple orders of magnitude longer retention times²⁹.

The multiplication stage generates an analog current via a push–pull transistor pair, with its amplitude set by the stored capacitor voltage (V_store), as shown in Fig. 1e. This current is enabled only during the input pulse, which gates it onto the shared bitline, where currents from multiple cells are summed according to Kirchhoff’s law.

In each inference step, both arrays are updated with one column from the key and value matrices, as we will show in more detail in the section ‘Analog hardware sliding window attention data-flow’. The M columns of each array represent the K and V of the previous M tokens, while the rows correspond to the d distinct embedding elements.

Due to temporal input encoding, gain-cell outputs also vary over time and must be integrated to compute the dot product. This is performed by charge-to-pulse circuits (Fig. 1c), which emit PWM voltage pulses. The pulses’ width increase linearly with accumulated charge, up to a saturation threshold S_sat, as shown in Fig. 1g. The circuit emit pulses only for positive charge, implementing a HardSigmoid activation. Further circuit details are provided in Supplementary Fig. 2.

The pulses representing $\phi \left(S\right)\in {{\mathbb{R}}}^{M}$ are fed as inputs to the second gain-cell array to perform the dot product ϕ(S) ⋅ V. A different type of charge-to-pulse circuit integrates the output currents of the second array. Unlike the first one, this signed charge-to-pulse circuit is capable of generating pulses for both positive and negative input charges, while a D flip-flop stores the result’s sign. The behavior of this circuit for different inputs is highlighted in Fig. 1h. A 16-level digital counter measures the generated pulse widths and multiplies the result by the retrieved sign bit, resulting in a total precision of 32 levels.

Analog hardware sliding window attention data-flow

Having described how inference is performed for one token, we now describe how the architecture processes multiple tokens sequentially. In sliding window attention, the input query is multiplied only with the M most recent keys and values, corresponding to the window size M (Fig. 2a). At each time step, the keys and values must be updated with the most recent token and the oldest one must be forgotten. All other projections remain stationary until they are updated after M cycles. In our implementation, we write the array that encodes the keys and values at inference time in a column-wise manner (Fig. 2b).

Figure 2c illustrates the sequential execution of inference steps in the hardware performing sliding window attention. Read and write operations are interleaved for efficiency, as further detailed in ‘Analog sliding window attention timing and execution’ in Methods. To perform attention on sliding window sizes and embedding dimensions larger than a single array can support, sub-tiling is used to stack multiple arrays, as shown in Fig. 3, and detailed in ‘Sub-tiling to scale attention dimensions’ in Methods.

**Fig. 3: Multi-tile design and layout for multi-head attention.**

Pre-trained model hardware-aware mapping and fine-tuning

Using weights from pre-trained models is challenging because our attention mechanism differs from the conventional ones (Fig. 4a). The main differences are:

HardSigmoid activation used instead of softmax (Fig. 1b).
Sliding window attention is implemented instead of causal attention (Fig. 2a).
Input, stored projections and output are quantized in four, three and five bits, respectively, by digital PWMs, DACs and pulse counters (Fig. 1b).
Gain-cell arrays are split into sub-tiles before final result summation (Fig. 3a).
The relation between gain-cell input and stored voltages is nonlinear (Fig. 1e).
Capacitor leakage causes stored value decay (Fig. 1f).

The implementation of these hardware constraints in our simulations is explained in ‘Hardware-based neural network simulations’ in Methods. As the nonlinear relation between input voltage and stored voltage in gain cells is described by a third-order polynomial function, this substantially increases the computational complexity and memory requirements to train our gain-cell-based model. Therefore, to adapt the pre-trained public GPT-2 model to our hardware constraints, we first fine-tune it using an intermediate model. The intermediate model employs ideal linear dot products, but integrates all the other mentioned hardware constraints. The model is trained on predicting the next words of the open-source text collection OpenWebText⁴⁴, and the metric used for evaluation is perplexity, which measures the uncertainty of the prediction. In Fig. 4d, we see that our linear intermediate model (blue curve) achieves results equivalent to a public GPT-2 model in less than 3,000 iterations, whereas it takes more than 13,000 iterations for the model trained from scratch (magenta curve). This result shows that performing weight transfer is efficient even though the two models are different (in particular, HardSigmoid activation instead of softmax).

**Fig. 4: Hardware model adaptation and training.**

After fine-tuning the intermediate linear model, we transfer the weights to the final hardware model including the gain cell’s nonlinearity. This mapping is non-trivial, as all the layers have different statistics, making it difficult to apply a single fit to capture the gain cells’ nonlinearity. To circumvent this issue, we introduce scaling operations and an adaptation algorithm described in ‘Nonlinear model adaptation algorithm’ in Methods. In Fig. 4c, we show how the perplexity of the nonlinear gain-cell model is reduced from 1,757 to 21 during this adaption stage. In Supplementary Fig. 5, we show that this adaptation algorithm can generalize to other multiplication nonlinearities. After the adaptation algorithm, we can fine-tune the nonlinear model using backpropagation (Fig. 4d, green curve) to further improve the results. The entire process is described in Fig. 4a.

Downstream task benchmarks

To evaluate the proposed hardware attention mechanism, in Table 1, we benchmark two software baselines and three hardware models on standard language modeling tasks (see details in ‘Downstream tasks set-up’ in Methods). Our nonlinear hardware model, adapted from a linear baseline and fine-tuned, achieves accuracy comparable to the public GPT-2 model, and equal or better performance than a software model trained from scratch under the same conditions. We further observe that omitting nonlinearity-specific fine-tuning yields near-identical results on most tasks, except LAMBADA and WikiText-2. To test scalability, we apply the same training set-up as GPT-2-XL (1.5 billion parameters). While the hardware version falls slightly short of the public checkpoint, it clearly outperforms the smaller GPT-2 baseline and matches the from-scratch software GPT-2-XL. This indicates that remaining performance gaps are due to training iteration differences (the number of iterations for the public model is undisclosed), not hardware limitations.

Table 1 Downstream task results

Full size table

Circuit computing accuracy

The accuracy of our circuits for attention computation is highlighted in Fig. 5a,b. For each of the two dot products, we simulate one 64 × 64 array and the corresponding 64 charge-to-pulse circuits. The results of the first dot product, which are shown in Fig. 5a, are fed as input to the second dot product and are shown in Fig. 5b. For each plot, we compare the simulations performed with SPICE (a circuit simulation software) with the model used for neural network simulations.

**Fig. 5: Analog hardware attention mechanism accuracy and performances.**

Energy consumption and latency

The circuit’s operational speed and timing, on which the energy assumptions are based, are shown in Fig. 2d. The total latency of attention can be estimated to 65 ns.

The gain-cell arrays and charge-to-pulse circuits consume 1,120 pJ per token computation for the first dot product, and 700 pJ for the second dot product. The lower energy consumption in the second dot-product arrays is attributed to the sparser activation of its input ϕ(S), leading to less current in the second gain-cell array. The digital control and routing block consumes a total power of 113.7 mW, or 4 nJ per token, while the DACs require 330 pJ. Overall, we can estimate the power consumption of processing 1 token for 1 attention head to 6.1 nJ. A pie chart of the power composition attributed to each unit is shown in Fig. 5e.

The energy and latency of our architecture, compared with three different GPUs, are shown in Fig. 5c,d. Focusing on the attention mechanism alone, our architecture can lead to a speed-up of ×7,000 compared with Nvidia Jetson Nano, ×300 compared with Nvidia RTX 4090 and ×100 compared with Nvidia H100, as well as an energy reduction of ×40,000 compared with Jetson Nano, ×90,000 compared with RTX 4090 and ×70,000 compared with H100.

Area and floorplan

On the basis of our assumptions, described in ‘Area estimation’ in Methods, for the worst-case scenario, the area of the proposed gain cell is 1 μm². Figure 3c shows the floorplan of a single tile, which includes 64 shared DACs for writing the weights, 2-row address decoders and charge-to-pulse circuitry. The total area of 1 head, shown in the floorplan in Fig. 3b, is 500 × 10⁻³ mm² including digital control circuitry.

However, other studies have demonstrated substantially smaller gain-cell dimensions⁴⁵. On the basis of this, and following the methodology outlined in ‘Area estimation’ in Methods, we estimate that the area of the gain-cell crossbars required for the entire GPT-2 attention-head KV cache is approximately 15.7 × 10⁻³ mm², excluding digital control circuitry.

In Supplementary Fig. 7, we show that multiple attention heads can be executed using parallel tiles on-chip and stacked in 3D with multiple layers, sharing peripheral and digital logic. As discussed in ‘Area estimation’ in Methods, 3D stacking can further improve area efficiency. On the basis of ref. ⁴⁵, we estimate the total area required for a GPT attention-head KV cache, excluding digital control, to be $\frac{36.7}{N}\times 1{0}^{-3}\,{\text{mm}}^{2}$, where N denotes the number of vertical stacks. The resulting area is:

36.7 × 10⁻³ mm² for N = 1
9.2 × 10⁻³ mm² for N = 4
4.6 × 10⁻³ mm² for N = 8
3.1 × 10⁻³ mm² for N = 12

Discussion

In this work, we proposed an analog IMC architecture addressing the energy consumption and latency bottlenecks of the attention computations at the core of generative AI models.

Our design leverages capacitor-based gain cells, offering an efficient solution for both memory storage and computation, substantially improving energy efficiency and speed. To avoid power-intensive ADCs, we perform the attention computation in the analog domain, using charge-to-pulse circuits to transmit analog signals between computation stages. This approach introduces non-ideal operations compared with digital attention computations, but with substantial efficiency gains. Another contribution is a hardware-aware training methodology compensating for the circuit non-idealities. Nonetheless, future circuit optimizations could further reduce any discrepancies.

Our neural network simulations confirm that an LLM implemented with our hardware attention achieves results comparable to software-based networks, even on complex NLP tasks. Nonetheless, our larger network slightly underperforms the baseline, and therefore deeper neural network training will require further methods to mitigate the vanishing gradient issue due to clamping values. This slight performance gap should still be put in perspective with the reduced energy consumption. While our study uses device-level simulations to evaluate design performance, our adaptation algorithm demonstrates potential for measured device implementations, as it allows most of the training process to proceed without requiring precise device-specific models of nonlinear behavior, making the approach generically applicable and computationally efficient.

Our architecture can benefit from OSFET transistors that enable dense 3D integration^45,46. Moreover, the KV-cache size grows modestly compared with the overall models’ parameters count^14,15,47. Our system could therefore be applied to larger networks with a moderate area footprint. Latency is reduced by up to two orders of magnitude, and energy consumption by up to four orders for attention computations alone compared with GPUs. While we focus on the attention mechanism, a major bottleneck in generative transformers’ inference, substantial reductions in overall energy consumption require optimizing all components. In the future, our hardware attention mechanism can be integrated with other IMC techniques to implement low-power linear layers.

In conclusion, this work demonstrates hardware-algorithm co-optimization achieving low latency and energy consumption while maintaining high model accuracy. In addition, it highlights the promise of IMC with volatile, low-power memory for attention-based neural networks, marking an important step toward ultrafast, energy-efficient generative AI.

Methods

Hardware-based neural network simulations

We implement the sliding window attention by masking the elements of S outside the sliding window (blank spaces in the example Fig. 1). The HardSigmoid charge-to-pulse circuit is modeled by the equation

$$\phi (S)=\left\{\begin{array}{ll}{T}_{{\mathrm{max}}}\quad &{\rm{if}}\,S\ge {S}_{{\mathrm{sat}}}\\ \frac{{T}_{{\mathrm{max}}}}{{S}_{{\mathrm{sat}}}}S\quad &{\rm{if}}\,0 < S < {S}_{{\mathrm{sat}}}\\ 0\quad &{\rm{if}}\,S\le 0\end{array}\right.,$$

(3)

where T_max = 15 ns is the maximum pulse length for the input pulse generators. The input queries Q are quantized in 16 levels between 0 and 1, the stored K and V projections are quantized in 8 levels between 0 and 0.9, and the outputs of the second dot product are quantized in 32 levels between −1 and 1. The quantized models (linear intermediate hardware model and nonlinear hardware model) are trained with quantization aware training⁴⁸: quantization is done only in the forward pass and the backward pass is done in full precision.

For the nonlinear model of the gain cell, the third-order polynomials

$$\begin{array}{r}S=\mathop{\sum }\limits_{i}^{3}\mathop{\sum }\limits_{j}^{3-i}Q\cdot {\left({K}^{T}-{K}_{{\mathrm{offset}}}\right)}^{i}{V}_{{\mathrm{in}}}^{\,j}{C}_{i,\,j}\\ A=\mathop{\sum }\limits_{i}^{3}\mathop{\sum }\limits_{j}^{3-i}\phi \left(S\right)\cdot {\left(V-{V}_{{\mathrm{offset}}}\right)}^{i}{V}_{{\mathrm{in}}}^{\,j}{C}_{i,\,j}\end{array}$$

(4)

are used with S and A as the outputs, Q and ϕ(S) the input pulse width, K and V the stored voltage, the constant V_in = 0.9 V is the input voltage of the cell applied at the word line read (WLR) ports, the constant y_offset = 0.45 V corresponds to half the supply voltage (V_DD/2), and C_i,j as fit parameters from the curve Fig. 1e. To speed-up computation during training, we compute all the tokens in parallel with $Q\in {{\mathbb{R}}}^{T,D}$, ${K}^{T}\in {{\mathbb{R}}}^{D,T}$, $V\in {{\mathbb{R}}}^{T,D}$ and $\phi \left(S\right)\in {{\mathbb{R}}}^{T,T}$ (the batch dimension and the head dimension are omitted for simplicity).

The capacitor leakage leads to an exponential decay in the stored value. After discretization, the exponential decay is formulated as

$${y}_{t}={y}_{t-1}{{\mathrm{e}}}^{-\frac{{\Delta }_{t}}{\tau }};\quad {\Delta }_{t}=L{\delta }_{t},$$

(5)

where τ is the time constant of the capacitors, Δ_t is the time elapses between two inference steps, δ_t is the latency caused by each neural network layer, and L is the number of layers. To model the decay of all capacitors at all time steps in parallel, we introduce a decay mask $\alpha \in {{\mathbb{R}}}^{T,T}$ defined as

$$\alpha ={{\mathrm{e}}}^{-\frac{{\Delta }_{t}}{\tau }{m}_{t,{t}^{{\prime} }}};\quad {m}_{t,{t}^{{\prime} }}=\max \left(0,t-{t}^{{\prime} }\right),$$

(6)

where m is the relative tokens’ position. To optimize computation, the decay mask is directly integrated in the dot-product computation as

$$\begin{array}{l}S=\mathop{\sum }\limits_{i}^{3}\mathop{\sum }\limits_{j}^{3-i}\left(Q\cdot {\left({K}^{T}-{K}_{{\mathrm{offset}}}\right)}^{i}{V}_{{\mathrm{in}}}^{\,j}{C}_{i,\,j}\right){\alpha }^{i}\\ A=\mathop{\sum }\limits_{i}^{3}\mathop{\sum }\limits_{j}^{3-i}\left(\phi \left(S\right){\alpha }^{i}\right)\cdot {\left(V-{V}_{{\mathrm{offset}}}\right)}^{i}{V}_{{\mathrm{in}}}^{\,j}{C}_{i,\,j}\end{array}$$

(7)

In our simulation, we chose a time constant τ = 5 ms to be consistent with the data from Fig. 1h. We chose δ_t = 65 ns to be equal to the latency of our full hardware attention mechanism (Fig. 2c). Our decay factor is therefore $\frac{{\Delta }_{t}}{\tau }=\frac{12\times 65\times 1{0}^{-9}}{5\times 1{0}^{-3}}\simeq 1.6\times 1{0}^{-4}$. In a full transformer implementation, the latency per layer δ_t = will be higher than 65 ns as it will also include latency from other modules, such as feedforward neural networks. However, time constant τ of three orders of magnitude larger were reported in OSFET-based gain-cell memories^26,29, and therefore we conclude that the choice of decay factor of 1.6 × 10⁻⁴ is very conservative. In Supplementary Fig. 6, we study empirically the effect of the decay constant over language processing accuracy. It is noteworthy that the decay of stored keys and values may not necessarily hinder network performance: several approaches in deep learning leverage exponential decay masks to enhance memory structure^39,49. In Supplementary Information section ‘Effect of capacitor’s leakage’, we study the connection between the KV pairs decay and the relative positional embedding called AliBi⁴⁹.

To speed up our training process, we used the library Triton⁵⁰ to incorporate our simulations into an adapted version of the flash attention algorithm⁵¹, which optimizes the GPU resources. This method led to a factor of five latency reduction during training.

For the adaptation, the algorithm was repeated until the mean and standard deviation of the output of the scaling functions of the nonlinear model matches the mean and standard deviation of the linear model within a tolerance ratio: $\left\vert {\sigma }_{{\mathrm{NL}}}-{\sigma }_{{\mathrm{L}}}\right\vert < 0.0001$ and $\left\vert{\mu}_{{\mathrm{NL}}}-{\mu}_{{\mathrm{L}}}\right\vert$$<0.0001$.

Nonlinear model adaptation algorithm

$$y=ax+b$$

(8)

with distinct scalars a and b for each of the Q, K and V projections, as well as for the output of the attention, with separate factors applied across different attention heads and layers.

To choose the scaling parameters a and b, we develop an algorithm inspired by ref. ⁵², detailed in Supplementary Algorithm 1. Given a set of input samples, we use an iterative loop that updates the scaling parameters so that the output of the scaling function of the nonlinear model matches the statistics of the linear model (as sketched in Fig. 4b). First, we measure the standard deviation σ_L and the mean μ_L of the output of every scaling stage (see equation (8)) of the linear model on a large set of samples. Then, at each iteration, we measure the standard deviation σ_NL and the mean μ_NL for the scaling stage of the nonlinear model. For each iteration, the scaling parameters are updated as

$$\begin{array}{l}a\leftarrow a\frac{{\sigma}_{{\mathrm{L}}}}{{\sigma}_{{\mathrm{NL}}}}\\ b\leftarrow b+\left(\;{\mu}_{{\mathrm{L}}}-{\mu}_{{\mathrm{NL}}}\right)\end{array}.$$

(9)

Analog sliding window attention timing and execution

To support efficient sequential inference, our architecture implements sliding window attention using a pipelined read–write mechanism across analog gain-cell arrays. At each inference step, new (K, V) pairs are written into the arrays while the current query (Q) is applied, ensuring that memory access and computation overlap.

Each attention step begins with a 5 ns discharge phase to reset the storage capacitors of the gain cells. New K and V vectors are written to a column of the respective arrays using 10 ns multi-level voltage pulses generated by 3-bit DACs. In parallel, the input query Q is encoded as PWM voltage pulses with durations between 0 ns and T_max = 15 ns, generated by 4-bit (16 levels) voltage pulse generators operating at 1 GHz.

This parallelization is possible because the V array is not required during the Q ⋅ K^T computation phase and can therefore be updated while the first dot product is processed. Once the write is complete, the charge-to-pulse circuit for the V array is reset, and the resulting ϕ(S) pulses from the K array’s readout are applied to the V array to compute the second dot product ϕ(S) ⋅ V.

After M time steps, when all columns in the K and V arrays have been populated, the first column is overwritten, preserving a sliding attention window of fixed size M. The succession of write and read phases implements a sequential sliding window attention mechanism, with minimal idle time and continuous throughput. This pipelined execution scheme is visualized in Fig. 2c, and forms the basis for the latency and energy analysis presented in later sections.

Sub-tiling to scale attention dimensions

IR drop, caused by resistive losses in interconnects, results in reduced accuracy in large-scale analog crossbar arrays⁵³. To mitigate IR drop issues, we limit the size of our gain-cell arrays to 64 × 64. However, most NLP applications require larger either a larger window dimension M (columns) or a larger embedding dimension d (rows). To accommodate larger dimensions, we perform inference across multiple sub-tiles, as shown in Fig. 3a.

In this paper, we implement a GPT-2 model with an embedding dimension d = 64 and a sliding window size M = 1,024. Therefore, the entire KV cache of the window size M is divided into 16 sub-tiles, each having its charge-to-pulse blocks and storing a fraction of the K and V in two 64 × 64 arrays. A write address controller keeps track of the current write index. All tiles receive the same input Q generated by the digital block in parallel, are measured by pulse counters and summed by 64 digital adders, each with 16 inputs (Fig. 3b,c). In sliding window attention, the maximum attention span is equal to L(M − 1) + 1 (ref. ⁴³). Therefore, in the presented architecture, the maximum attention span can be increased by increasing the number of sub-tiles. However, this leads to additional area footprint scaling linearly with the sliding window dimension, and additional latency as each digital adder requires one clock cycle.

Hardware-based neural network training

To evaluate our training algorithm and the inference accuracy of our architecture, we implement the analog gain-cell-based attention mechanism on the GPT-2 architecture⁵⁴. GPT-2 is a transformer neural network with 124 million parameters, 12 layers, an attention mechanism input dimension of 768, 12 heads per attention block and a head dimension of 64. We used the open-source text collection OpenWebText⁴⁴ split between training and testing samples, and the pre-trained GPT-2 tokenizer to encode the plain text into tokens (vectors of size 50,304 each). Each training iteration had a batch size of 1,920, with sequences of length 1,024 per sample. We selected a sliding window size of 1,024, which matches the number of gain-cell rows in the memory. As the sequence length also equals 1,024, each gain cell is written only once per sequence, eliminating the need to overwrite cells during one sliding window iteration. For a larger sequence length, the gain cells would be overwritten, as described in the section ‘Analog hardware sliding window attention data-flow’. To train the network, the next token in the sequence is predicted for each input token. Thus, the target sequences are the input sequences shifted by one token. The cost function used was cross-entropy, calculated between the predicted sequence and the target sequence. We used backpropagation with the AdamW optimizer⁵⁵, with a learning rate of 6 × 10⁻⁴ and a weight decay of 0.1. The results of each evaluation are averaged over 4,000 samples.

Downstream tasks set-up

The datasets cover various types of problem. Our benchmarking set-up is inspired by refs. ^11,56 in terms of evaluated tasks and metrics. ARC-Easy and ARC-Challenge⁵⁷ focus on question answering, with ARC-Easy containing straightforward questions and ARC-Challenge featuring more difficult ones. WinoGrande⁵⁸ evaluates common-sense reasoning and co-reference resolution by presenting minimal pairs to resolve ambiguities. HellaSwag⁵⁹ tests common-sense inference, requiring models to predict the most plausible continuation of a given context. LAMBADA⁶⁰ evaluates models’ text understanding through a word prediction task that requires comprehension of broader discourse, not just local context. PIQA⁶¹ assesses physical common-sense reasoning, testing a model’s understanding of physical scenarios. WikiText-2⁶² is a general text corpus derived from Wikipedia articles to assess long-term dependencies processing, text prediction and generation capabilities. For WikiText-2, we report perplexity scores normalized by the word count in the original text. For fair comparisons, except for software public GPT-2, all the models were evaluated after the same number of training iterations. The linear hardware model was trained on 13,000 iterations, the nonlinear hardware model was mapped from the 13,000 iterations linear model using the adaptation algorithm but without fine-tuning, and the nonlinear hardware model with adaptation and fine-tuning was adapted from a linear model trained on 3,000 iterations, and then fine-tuned on 10,000 iterations.

Hardware SPICE simulations

To assess circuit performance accuracy, energy consumption and speed, we conducted SPICE array simulations using the TSMC 28 nm PDK within the Cadence Virtuoso environment. All simulations are based on a 64 × 64 array, corresponding to the tile size in our architecture (Fig. 3a). To extrapolate the energy and latency for a full attention head with a window size of 1,024, we multiply the per-sub-tile measurements by 16, reflecting the total number of sub-tiles comprising 1 attention head in our architecture. In these simulations, a parasitic wire capacitance of 0.8 fF and a series resistance of 2 Ω per array element are included. Both arrays, one performing ϕ(Q ⋅ K^T) and the other performing ϕ(S) ⋅ V, are simulated separately, but always in combination with their specific charge-to-pulse circuitry readout circuitry.

GPU attention latency and energy consumption measurements

To measure the latency and energy on Nvidia RTX 4090, Nvidia H100 and Nvidia Jetson Nano, which are a consumer GPU, a data-center GPU and an embedded application GPU, respectively, we perform 10 runs of 1,024 steps of autoregressive token generation with 12 attention heads using the method FlashAttention-2⁵¹, which optimizes attention computation in GPUs. The energy and latency consumption measurement solely focus on attention computation, and for a fair comparison, the linear projections are not implemented in this experiment as they are also not implemented by our hardware architecture, and the static power measured before inference is subtracted from the power measured during inference. For each run, we measure the latency and the power using the Nvidia-SMI python API, and average them.

Area estimation

Our floorplan is based on ITO gain cells, an emerging OSFET technology that has enabled low-area gain-cell designs⁴⁵. A two-transistor ITO gain cell occupies an area of 0.14 μm² (approximately 370 nm × 370 nm)⁴⁵, allowing for denser memories than CMOS-based gain cells. On the basis of the area results presented in these studies^45,46, we estimate the worst-case area of the proposed 6-transistor cell to be 1 μm², leading to a 19× area reduction compared with gain cells based on CMOS write transistors (our CMOS-based gain-cell layout is presented in Supplementary Fig. 1). The total area of 1 attention head is derived from this single-cell area estimation, as well as the charge-to-pulse circuit layout and the total floorplan incorporating the 16 sub-tiles and digital circuits, providing a precise representation of the space requirements. This structure is designed to be repetitive (vertical dimension in Fig. 3c), allowing multiple attention heads to be efficiently integrated on a single chip. Each attention head receives inputs from the lower digital block, while its outputs are processed by the upper digital block. To facilitate the connection of the bitline outputs of one array (that is, vertical metal lines) to the wordline input of the next array (that is, horizontal metal line), we employ wire tapping, as highlighted in Fig. 3d.

When considering 3D-stacked gain cells, the effective cell area is reported in ref. ⁴⁵ as 0.14/N μm², where N denotes the number of parallel oxide layers. Consequently, a signed gain-cell implementation would occupy 0.28/N μm², consisting of 2 gain cells, 1 for the positive part and 1 for the negative part.

Data availability

The data supporting the figures of this study are publicly available in a figshare repository⁶³. Source data for Figs. 1, 2, 4 and 5 are available with this paper. Data for Figs. 1, 2 and 5 were generated through simulations using SPICE. Data for Fig. 4 were produced using evaluations performed in the PyTorch framework. Data for Table 1 were obtained using the Language Model Evaluation Harness toolkit⁶⁴.

Code availability

The Python scripts used for the experiments are available without restriction at https://github.com/NathanLeroux-git/GainCellAttention/, and are archived with a DOI in the Zotero repository⁶⁵.

References

Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems, NIPS’17 6000–6010 (Curran Associates, 2017).
Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. Preprint at http://arxiv.org/abs/1409.0473 (2016).
Lin, T., Wang, Y., Liu, X. & Qiu, X. A survey of transformers. AI Open 3, 111–132 (2022).
Article Google Scholar
Pope, R. et al. Efficiently scaling transformer inference. Proc. Mach. Learn. Syst. 5, 606–624 (2023).
Google Scholar
Liu, Z. et al. KIVI: a tuning-free asymmetric 2bit quantization for KV cache. In Proc. 41st International Conference on Machine Learning, ICML’24 Vol. 235, 32332–32344 (JMLR.org, 2024).
Jiang, A.Q. et al. Mistral 7B. Preprint at http://arxiv.org/abs/2310.06825 (2023).
Jouppi, N. P. et al. Ten lessons from three generations shaped Google’s TPUv4i: industrial product. In Proc. 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) 1–14 (IEEE, 2021); https://doi.org/10.1109/ISCA52012.2021.00010
Fu, Y. Challenges in deploying long-context transformers: a theoretical peak performance analysis. Preprint at https://arxiv.org/abs/2405.08944 (2024).
Xu, M. et al. Resource-efficient algorithms and systems of foundation models: a survey. ACM Comput. Surv. 57, 110–111039 (2025).
Article Google Scholar
Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. Transformers are RNNs: fast autoregressive transformers with linear attention. In Proc. 37th International Conference on Machine Learning, ICML’20 Vol. 119, 5156–5165 (JMLR.org, 2020); https://doi.org/10.5555/3524938.3525416
Gu, A. & Dao, T. Mamba: linear-time sequence modeling with selective state spaces. In Proc. Conference on Language Modeling (2024); https://openreview.net/forum?id=tEYskw1VY2
Adnan, M. et al. Keyformer: KV cache reduction through key tokens selection for efficient generative inference. Proc. Mach. Learn. Syst. 6, 114–127 (2024).
Google Scholar
DeepSeek-AI et al. Deepseek-v3 technical report. Preprint at https://arxiv.org/abs/2412.19437 (2024)
Chang, C.-C. et al. Palu: KV-cache compression with low-rank projection. In Proc. 13th International Conference on Learning Representations (2025); https://openreview.net/forum?id=LWMS4pk2vK
Ainslie, J. et al. GQA: training generalized multi-query transformer models from multi-head checkpoints. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 4895–4901 (Association for Computational Linguistics, 2023); https://doi.org/10.18653/v1/2023.emnlp-main.298
Vogginger, B. et al. Neuromorphic hardware for sustainable AI data centers. Preprint at https://arxiv.org/abs/2402.02521 (2024).
Yang, X., Yan, B., Li, H., Chen, Y. ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration. In Proc. 39th International Conference on Computer-Aided Design, ICCAD ’20 92 (Association for Computing Machinery, 2020); https://doi.org/10.1145/3400302.3415640
Laguna, A. F. Hardware–software co-design of an in-memory transformer network accelerator. Front. Electron. 3, 847069 (2022).
Article Google Scholar
Sridharan, S., Stevens, J. R., Roy, K. & Raghunathan, A. X-former: in-memory acceleration of transformers. IEEE Trans. Very Large Scale Integr. VLSI Syst. 31, 1223–1233 (2023).
Article Google Scholar
Bhattacharjee, A., Moitra, A. & Panda, P. Clipformer: key–value clipping of transformers on memristive crossbars for write noise mitigation. IEEE Trans. Comput. Aided Design Integr. Circuits Syst. 44, 592–601 (2025).
Article Google Scholar
Wu, Y., Wang, Z. & Lu, W. D. PIM GPT a hybrid process in memory accelerator for autoregressive transformers. Npj Unconv. Comput. 1, 4 (2024).
Article Google Scholar
Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 15, 529–544 (2020).
Article Google Scholar
Zhou, M., Xu, W., Kang, J. & Rosing, T. TransPIM: a memory-based acceleration via software–hardware co-design for transformer. In Proc. 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA) 1071–1085 (IEEE, 2022); https://doi.org/10.1109/HPCA53966.2022.00082
Liu, S. et al. HARDSEA: hybrid analog-ReRAM clustering and digital-SRAM in-memory computing accelerator for dynamic sparse self-attention in transformer. IEEE Trans. Very Large Scale Integr. VLSI Syst. 32, 269–282 (2024).
Article Google Scholar
Lepri, N. et al. In-memory computing for machine learning and deep learning. IEEE J. Electron Devices Soc. 11, 587–601 (2023).
Article Google Scholar
Wang, Y. et al. An in-memory computing architecture based on two-dimensional semiconductors for multiply–accumulate operations. Nat. Commun. https://doi.org/10.1038/s41467-021-23719-3 (2021).
Gou, S. et al. 2T1C DRAM based on semiconducting MoS₂ and semimetallic graphene for in-memory computing. Natl Sci. Open 2, 20220071 (2023).
Article Google Scholar
Shi, M. et al. Counteractive coupling IGZO/CNT hybrid 2T0C DRAM accelerating RRAM-based computing-in-memory via monolithic 3D integration for edge AI. In Proc. 2023 International Electron Devices Meeting (IEDM) 1–4 (IEEE, 2023); https://doi.org/10.1109/IEDM45741.2023.10413876
Belmonte, A. et al. Lowest IOFF <3×10⁻²¹ A/μm in capacitorless DRAM achieved by reactive ion etch of IGZO-TFT. In Proc. 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 1–2 (IEEE, 2023); https://doi.org/10.23919/VLSITechnologyandCir57934.2023.10185398
Ye, H. et al. Double-gate W-doped amorphous indium oxide transistors for monolithic 3D capacitorless gain cell eDRAM. In Proc. 2020 IEEE International Electron Devices Meeting (IEDM) 28.3.–28.3.4 (IEEE, 2020); https://doi.org/10.1109/IEDM13553.2020.9371981
Raman, S. R. S., Xie, S. & Kulkarni, J. P. Compute-in-eDRAM with backend integrated indium gallium zinc oxide transistors. In Proc. 2021 IEEE International Symposium on Circuits and Systems (ISCAS) 1–5 (IEEE, 2021); https://doi.org/10.1109/ISCAS51556.2021.9401798
Tang, W. et al. Low-power and scalable BEOL-compatible IGZO TFT eDRAM-based charge-domain computing. IEEE Trans. Circuits Syst. I 70, 5166–5179 (2023).
Google Scholar
Lu, A. et al. High-speed emerging memories for AI hardware accelerators. Nat. Rev. Electr. Eng. 1, 24–34 (2024).
Article Google Scholar
Cai, F. et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nat. Electron. 2, 290–299 (2019).
Article Google Scholar
Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).
Article Google Scholar
Ambrogio, S. et al. An analog-AI chip for energy-efficient speech recognition and transcription. Nature 620, 768–775 (2023).
Article Google Scholar
Vatalaro, M. et al. A low-voltage, low-power reconfigurable current-mode softmax circuit for analog neural networks. Electronics https://doi.org/10.3390/electronics10091004 (2021).
Dube, A., Manea, P., Gibertini, P., Covi, E. & Strachan, J. P. Analog softmax with wide input current range for in-memory computing. In Proc. IEEE International Symposium on Circuits and Systems (ISCAS), paper 2530 (2025).
Ma, X. et al. Mega: moving average equipped gated attention. In Proc. 11th International Conference on Learning Representations (2023); https://openreview.net/forum?id=qNLe3iq2El
Ramapuram, J. et al. Theory, analysis, and best practices for sigmoid self-attention. In Proc. 13th International Conference on Learning Representations (2025); https://openreview.net/forum?id=Zhdhg6n2OG
Beltagy, I., Peters, M. E. & Cohan, A. Longformer: the long-document transformer. Preprint at https://arxiv.org/abs/2004.05150 (2020).
Gu, X. et al. When attention sink emerges in language models: an empirical view. In Proc. 13th International Conference on Learning Representations (2025); https://openreview.net/forum?id=78Nn4QJTEN
Fu, Z. et al. Sliding window attention training for efficient large language models. Preprint at https://arxiv.org/abs/2502.18845 (2025).
Gokaslan, A. & Cohen, V. OpenWebText Corpus. GitHub http://Skylion007.github.io/OpenWebTextCorpus (2019).
Liu, S. et al. Design guidelines for oxide semiconductor gain cell memory on a logic platform. IEEE Trans. Electron Devices 71, 3329–3335 (2024).
Article Google Scholar
Subhechha, S. et al. Demonstration of multilevel multiply accumulate operations for AiMC using engineered a-IGZO transistors-based 2T1C gain cell arrays. In Proc. 2023 IEEE International Memory Workshop (IMW) 1–4 (IEEE, 2023); https://doi.org/10.1109/IMW56887.2023.10145946
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Jacob, B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2704–2713 (IEEE, 2018).
Press, O., Smith, N. A. & Lewis, M. Train short, test long: attention with linear biases enables input length extrapolation. In Proc. International Conference on Learning Representations (2022); https://openreview.net/forum?id=R8sQPpGCv0
Tillet, P., Kung, H. T. & Cox, D. Triton: an intermediate language and compiler for tiled neural network computations. In Proc. 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019 10–19 (Association for Computing, 2019); https://doi.org/10.1145/3315508.3329973
Dao, T. FlashAttention-2: faster attention with better parallelism and work partitioning. In Proc. 12th International Conference on Learning Representations (2024); https://openreview.net/forum?id=mZn2Xyh9Ec
Mishkin, D. & Matas, J. All you need is a good init. Preprint at https://arxiv.org/abs/1511.06422 (2015).
Lepri, N., Glukhov, A., Mannocci, P., Porzani, M. & Ielmini, D. Compact modeling and mitigation of parasitics in crosspoint accelerators of neural networks. IEEE Trans, Electron Devices 71, 1900–1906 (2024).
Article Google Scholar
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In Proc. International Conference on Learning Representations (2019); https://openreview.net/forum?id=Bkg6RiCqY7
Beck, M. et al. xLSTM: extended long short-term memory. In Proc. 38th Annual Conference on Neural Information Processing Systems (2024); https://openreview.net/forum?id=ARAxPPIAhq
Clark, P. et al. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. Preprint at https://arxiv.org/abs/1803.05457 (2018).
Sakaguchi, K., Bras, R. L., Bhagavatula, C. & Choi, Y. WinoGrande: an adversarial winograd schema challenge at scale. Commun. ACM 64, 99–106 (2021).
Article Google Scholar
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. HellaSwag: can a machine really finish your sentence? In Proc. 57th Annual Meeting of the Association for Computational Linguistics, 4791–4800 (ACL, 2019).
Paperno, D. et al. The LAMBADA dataset: word prediction requiring a broad discourse context. In Proc. 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Erk, K. & Smith, N. A.) 1525–1534 (Association for Computational Linguistics, 2016); https://doi.org/10.18653/v1/P16-1144
Bisk, Y., Zellers, R., Bras, R. L., Gao, J. & Choi, Y. PIQA: reasoning about physical commonsense in natural language. In Proc. 34th AAAI Conference on Artificial Intelligence, 7432–7439 (AAAI, 2020).
Merity, S., Xiong, C., Bradbury, J. & Socher, R. Pointer sentinel mixture models. In Proc. International Conference on Learning Representations (2017); https://openreview.net/forum?id=Byj72udxe
Leroux, N. et al. Analog in-memory computing attention mechanism for fast and energy-efficient large language models source data. figshare https://doi.org/10.6084/m9.figshare.27763548 (2025).
Gao, L. et al. A framework for few-shot language model evaluation. Zenodo https://doi.org/10.5281/zenodo.5371628 (2025).
Leroux, N. et al. GainCellAttention. Zenodo https://doi.org/10.5281/zenodo.15856645 (2025).

Download references

Acknowledgements

This work was supported in part by the Federal Ministry of Education and Research (BMBF, Germany) in the project NEUROTEC II (project number 16ME0398K). We gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre (JSC).

Funding

Open access funding provided by Forschungszentrum Jülich GmbH.

Author information

These authors contributed equally: Nathan Leroux, Paul-Philipp Manea.

Authors and Affiliations

PGI-15, Forschungszentrum Jülich, Jülich, Germany
Nathan Leroux, Jan Finkbeiner & Emre Neftci
PGI-14, Forschungszentrum Jülich, Jülich, Germany
Paul-Philipp Manea, Chirag Sudarshan, Sebastian Siegel & John Paul Strachan
Faculty of Electrical Engineering, RWTH Aachen, Aachen, Germany
Paul-Philipp Manea, Jan Finkbeiner, John Paul Strachan & Emre Neftci

Authors

Nathan Leroux
View author publications
Search author on:PubMed Google Scholar
Paul-Philipp Manea
View author publications
Search author on:PubMed Google Scholar
Chirag Sudarshan
View author publications
Search author on:PubMed Google Scholar
Jan Finkbeiner
View author publications
Search author on:PubMed Google Scholar
Sebastian Siegel
View author publications
Search author on:PubMed Google Scholar
John Paul Strachan
View author publications
Search author on:PubMed Google Scholar
Emre Neftci
View author publications
Search author on:PubMed Google Scholar

Contributions

The study was designed by N.L. and P.-P.M., and supervised by J.P.S. and E.N. The analog circuit system schematic design and electrical simulations were carried out by P.-P.M. C.S. was responsible for the design and layout of all digital blocks, as well as the overall chip floorplanning. S.S. completed the layout of the analog components. Hardware parameter extraction was performed by P.-P.M. Neural network training was conducted by N.L. and neural network evaluation was conducted by N.L. and J.F. All authors contributed to the analysis of the results and writing of the paper.

Corresponding authors

Correspondence to Nathan Leroux or Paul-Philipp Manea.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Jianshi Tang, Yonghong Tian and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Text/Discussion, Figs. 1–7 and Algorithm 1.

Peer Review file (download PDF )

Supplementary Data 1 (download TXT )

Plot data for Supplementary Fig. 5.

Source data

Source Data Fig. 1 (download ZIP )

Plot data.

Source Data Fig. 2 (download ZIP )

Plot data.

Source Data Fig. 4 (download ZIP )

Plot data.

Source Data Fig. 5 (download ZIP )

Plot data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Leroux, N., Manea, PP., Sudarshan, C. et al. Analog in-memory computing attention mechanism for fast and energy-efficient large language models. Nat Comput Sci 5, 813–824 (2025). https://doi.org/10.1038/s43588-025-00854-1

Download citation

Received: 15 November 2024
Accepted: 22 July 2025
Published: 08 September 2025
Version of record: 08 September 2025
Issue date: September 2025
DOI: https://doi.org/10.1038/s43588-025-00854-1

This article is cited by

Neuromorphic principles in self-attention hardware for efficient transformers
- Nathan Leroux
- Jan Finkbeiner
- Emre Neftci
Nature Computational Science (2025)

Subjects

Abstract

Similar content being viewed by others

Main

Results

Attention mechanism

End-to-end analog hardware attention

Analog hardware sliding window attention data-flow

Pre-trained model hardware-aware mapping and fine-tuning

Downstream task benchmarks

Circuit computing accuracy

Energy consumption and latency

Area and floorplan

Discussion

Methods

Hardware-based neural network simulations

Nonlinear model adaptation algorithm

Analog sliding window attention timing and execution

Sub-tiling to scale attention dimensions

Hardware-based neural network training

Downstream tasks set-up

Hardware SPICE simulations

GPU attention latency and energy consumption measurements

Area estimation

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links