Fast nonlinear integration drives accurate encoding of input information in large multiscale systems

Nicoletti, Giorgio; Busiello, Daniel Maria

doi:10.1038/s42005-025-02339-z

Download PDF

Article
Open access
Published: 18 November 2025

Fast nonlinear integration drives accurate encoding of input information in large multiscale systems

Communications Physics volume 8, Article number: 437 (2025) Cite this article

1134 Accesses
1 Citations
Metrics details

Subjects

Abstract

Biological and artificial systems encode information through complex nonlinear operations across multiple timescales. A clear understanding of the interplay between this multiscale structure and the nature of nonlinearities at play is, however, missing. Here, we study a general model where the input signal is propagated to an output unit through a processing layer via nonlinear activation functions. We focus on two widely implemented paradigms: nonlinear summation, where signals are first nonlinearly transformed and then combined; nonlinear integration, where they are combined first and then transformed. We find that fast-processing capabilities systematically enhance input-output mutual information, and nonlinear integration outperforms summation in large systems. Conversely, a nontrivial interplay between the two strategies emerges in lower dimensions as a function of interaction strength, heterogeneity, and sparsity of conections between the units. Finally, we reveal a tradeoff between input and processing sizes in strong-coupling regimes. Our results shed light on relevant features of nonlinear information processing with implications for both biological and artificial systems.

Closed-form information capacity of canonical signaling models

Article Open access 21 November 2025

Pattern recognition in the nucleation kinetics of non-equilibrium self-assembly

Article Open access 17 January 2024

Microscale velocity-dependent unbinding generates a macroscale performance-efficiency tradeoff in actomyosin systems

Article Open access 12 May 2025

Introduction

The ability to encode and process information from the external world is essential to maintain robust functioning in biological systems¹. These goals are usually achieved through complex internal machinery that involves nonlinear operations. For example, multi-molecular reactions drive sensing and adaptation in chemical networks^2,3,4, gene regulatory dynamics is controlled by protein-mediated interactions leading to multi-stable phases corresponding to different cell fates^5,6, phase coexistence phenomena sustain noise reduction and functional organization in cellular environments⁷, and complex interaction networks underlie the computational capabilities of neural populations^8,9. As such, extracting information from a given input to generate a desired output is a fundamental problem that spans several fields, from signal processing in biochemical systems^10,11 to designing and training artificial neural networks¹². Many of these systems share the idea that inputs need to be processed via different types of nonlinear activation functions to enable non-trivial learning tasks. Despite remarkable results, understanding the key determinants of how the type of nonlinearity shapes information processing is an active area of theoretical research^13,14,15. Recent works have investigated the performance of computation instantiated by biological media, making an effort to bridge artificial and biochemical processing¹⁶ and highlighting the pivotal role of nonlinear encoding^17,18,19 and multiple timescales^20,21,22.

Information theory provides us with tools to quantitatively study information-processing capabilities of various systems ranging from stochastic processes^23,24 to biological scenarios^25,26,27. While the impact of timescales on information propagation has been understood in general frameworks^28,29, the role of internal nonlinear mechanisms remains unclear without focusing on specific models. One of the main difficulties resides in the lack of general analytical approaches, with results obtained for large systems and phenomenological models often relying on Gaussian approximations of various forms^30,31.

In terms of nonlinear processing, two paradigmatic schemes have been extensively implemented in a variety of different contexts^{32,33,34,35,36}: nonlinear summation, in which incoming signals are first nonlinearly transformed and then summed before affecting the target; and nonlinear integration, in which signals are integrated before the nonlinear processing step. Although they underscore different underlying physical processes, summation and integration have often been used interchangeably to describe neural systems^{37,38,39,40,41,42,43,44,45,46}, especially when there is no clear biological reason to choose one over the other. A similar dichotomy is present in gene networks, where, given the absence of microscopic models substantiating either one of these two schemes, there is no consensus on the use of summation or integration, with various works reporting opposite approaches^47,48,49,50. In these scenarios, despite their ability to describe the systems’ dynamics, summation and integration may present striking differences when considering processing performances. These differences may be especially relevant in the presence of multiple timescales²⁸. Even in those cases in which the implementation of nonlinear summation or integration is biologically motivated, it would be crucial to understand their effect on information processing. For instance, dendrites are believed to perform integration in neural circuits³⁵; in gene networks, gene-protein interactions appear in the form of a nonlinear summation when derived from first principles⁴⁹; many adaptive dynamical systems are usually modeled via nonlinear integration schemes due to the presence of a slow feedback dynamics^32,34; controlled stochastic chemical networks naturally implement summation along multiple reaction channels^51,52.

In this work, we study the information-theoretic features associated with nonlinear summation and integration in a generic multiscale information-processing system. To this end, we consider a possibly high-dimensional signal from an input unit, then processed by a processing unit, and finally encoded into an output unit. Interactions among the units form a general multilayer network structure that supports the propagation of the input information^28,53,54. Crucially, each unit may operate on a different timescale and is composed of an arbitrary number of individual degrees of freedom (dofs), such as neurons in neural networks, chemical species in a signaling architecture, or genes in a genetic network. Operations between the units are implemented by an activation function that can perform nonlinear summation or integration of the incoming signals from the connected unit. We first find the exact expression for the probability distribution (pdf) of the system in different timescale regimes. Then, we employ this result to characterize the mutual information between input and output. We find that, in the absence of a processing unit, there exists a crossover from a region in which nonlinear summation is favored to one in which integration leads to higher mutual information. Crucially, we also show that the presence of an intermediate processing unit enhances encoding performances when acting on a faster timescale than the output unit. Further, we study the effect of the system’s dimensionality, finding that fast nonlinear integration schemes are associated with higher input-output mutual information in large multiscale systems, emerging as the backbone of accurate processing in this scenario. On the other hand, nonlinear summation is beneficial in small systems with highly heterogeneous couplings. Finally, we show that nonlinear integration may lead to the spontaneous emergence of bimodality in the output layer even for Gaussian inputs, underpinning its role in implementing dynamical input discrimination that can be tuned by tinkering with internal parameters.

Results

Multiscale information-processing systems

We consider a general information-processing system with three different stochastic units: input I, processing P, and output O. Each unit is composed of M_μ degrees of freedom (dofs) with a shared timescale τ_μ and whose activity is denoted by ${{{\bf{x}}}}_{\mu }\in {{\mathbb{R}}}^{{M}_{\mu }}$, with μ = I, P, O. All dofs within the same unit are linearly coupled with an interaction matrix ${\hat{A}}_{\mu }\in {{\mathbb{R}}}^{{M}_{\mu }\times {M}_{\mu }}$; conversely, the coupling from unit ν to μ is implemented via a nonlinear activation function ${{{\boldsymbol{\phi }}}}_{\mu \nu }\in {{\mathbb{R}}}^{{M}_{\mu }}$ that depends, in principle, on all dofs within ν and an interaction matrix ${\hat{A}}_{\mu \nu }\in {{\mathbb{R}}}^{{M}_{\mu }\times {M}_{\nu }}$. The system’s dynamics is described by the following set of Langevin equations:

$${\tau }_{\mu }{\dot{{{\bf{x}}}}}_{\mu }=-{\hat{A}}_{\mu }{{{\bf{x}}}}_{\mu }+\mathop{\sum}\limits_{\nu \ne \mu }{g}_{\mu \nu }{{{\boldsymbol{\phi }}}}_{\mu \nu }({\hat{A}}_{\mu \nu };{{{\bf{x}}}}_{\nu })+\sqrt{2{\tau }_{\mu }}{\hat{\sigma }}_{\mu }{{{\boldsymbol{\xi }}}}_{\mu }\,,$$

(1)

where g_μν is the interaction strength between unit ν and μ, ξ_μ a vector of Gaussian white noises, and ${\hat{D}}_{\mu }={\hat{\sigma }}_{\mu }{\hat{\sigma }}_{\mu }^{T}\in {{\mathbb{R}}}^{{M}_{\mu }\times {M}_{\mu }}$ defines a diagonal diffusion matrix. For simplicity, we take ${\hat{D}}_{\mu }$ to be the identity matrix. We also assume that the input evolves independently, as this is the case for a relevant class of biophysical scenarios^31,55,56,57. Then, the input is passed to the processing unit through a directional coupling, i.e., g_PI ≠ 0 and g_IP = 0. After the processing step, the signal arrives at the output unit, again through a directional coupling, i.e., g_OP ≠ 0 and g_PO = 0.

To investigate how the mechanisms implementing internal nonlinear processing affect the information content of the system, we study the mutual information between input and output units,

$${I}_{IO} =\int\,d{{{\bf{x}}}}_{I}d{{{\bf{x}}}}_{O}{p}_{IO}({{{\bf{x}}}}_{I},{{{\bf{x}}}}_{O}) \, {\log }_{2}\frac{{p}_{IO}({{{\bf{x}}}}_{I},{{{\bf{x}}}}_{O})}{{p}_{I}({{{\bf{x}}}}_{I}){p}_{O}({{{\bf{x}}}}_{O})}\\ ={H}_{O}-\int\,d{{{\bf{x}}}}_{I}{p}_{I}({{{\bf{x}}}}_{I}){h}_{O| I}({{{\bf{x}}}}_{I})={H}_{O}-{\langle {h}_{O| I}\rangle }_{I}\,,$$

(2)

where ${h}_{O| I}={\langle {p}_{O| I}{\log }_{2}{p}_{O| I}\rangle }_{O}$. Here, p_IO(x_I, x_O) is the joint pdf of input and output dofs, p_I(x_I) and p_O(x_O) are their respective marginal pdfs, and H_O is the Shannon entropy⁵⁸ of the output dofs distribution computed in bits. I_IO quantifies the information shared between I and O, therefore acting as an unbiased proxy for processing accuracy in this paradigmatic setting⁵⁸.

As demonstrated in²⁸, if the dynamics of the input unit are the fastest at play (τ_I ≪ τ_P, τ_O), no mutual information can be generated between I and O. Conversely, a slow input is a necessary condition to have a non-zero I_IO. We still have the freedom to set the timescales of processing and output units, distinguishing two relevant cases: a fast-processing system (τ_P ≪ τ_O) and a slow-processing one (τ_P ≫ τ_O). However, a crucial role is also played by the specific type of nonlinearity at hand, encoded in the vector ϕ_μν. As discussed in the introduction, we distinguish two widely used but distinct cases, corresponding to different processing schemes: nonlinear summation (ns)^{36,39,40,45,46,59} and integration (int)^{32,34,35,37,38,60,61,62}. By setting a hyperbolic tangent as activation function, a customary modeling choice for recurrent neural networks^36,45,59, we have the following forms for the i-th component of interaction terms between units:

$$\begin{array}{rcl}{({\phi }_{\mu \nu }^{i})}^{{{\rm{ns}}}}&=&\frac{1}{{M}_{\nu }}\mathop{\sum}\limits_{j=1}^{{M}_{\nu }}{A}_{\mu \nu }^{ij}\tanh ({x}_{\nu }^{j})\\ {({\phi }_{\mu \nu }^{i})}^{{{\rm{int}}}}&=&\tanh \left(\frac{1}{{M}_{\nu }}\mathop{\sum}\limits_{j=1}^{{M}_{\nu }}{A}_{\mu \nu }^{ij}{x}_{\nu }^{j}\right)\,,\end{array}$$

(3)

where all nodes in unit ν contribute to the dynamics of node i in unit μ through the nonlinear activation function and the set of weights ${A}_{\mu \nu }^{ij}$, with j = 1, …, M_ν, mediating the coupling. These two cases represent different physical processes. For a nonlinear summation, the signals generated by each dof in unit ν are first nonlinearly transformed, and then linearly projected by means of the interaction matrix ${\hat{A}}_{\mu \nu }$. In contrast, for nonlinear integration, the signals from unit ν are first linearly combined via the weight matrix ${\hat{A}}_{\mu \nu }$, and then the resulting integrated signal is nonlinearly transformed by the activation function and passed to the i-th dof of unit μ.

Exact solution for fast and slow processing units

The first contribution of this work is to provide an analytical solution for the joint distribution of the whole system, p_IPO, that can be exploited to evaluate the input-output mutual information I_IO, and the output pdf p_O. While I_IO informs us on the processing performance of the system, p_O contains information on the ability to perform input discrimination. p_IPO satisfies the following Fokker-Planck equation⁶³:

$$\frac{\partial }{\partial t}{p}_{IPO}=\left(\frac{{{{\mathcal{L}}}}_{I}}{{\tau }_{I}}+\frac{{{{\mathcal{L}}}}_{P}}{{\tau }_{P}}+\frac{{{{\mathcal{L}}}}_{O}}{{\tau }_{O}}\right){p}_{IPO}$$

(4)

where ${{{\mathcal{L}}}}_{\mu }$ is the Fokker-Planck operator associated with the unit μ = I, P, O, as detailed in the Supplementary Notes 1 and 2. Although general exact expressions are out of reach without approximations, the limits of fast and slow processing can provide useful insights into system operations, provided the presence of a slow input unit. From^28,29, we know that in these two limiting regimes the joint pdf of input, processing, and output dofs is the product of conditional distributions. As we show in the “Methods” and the Supplementary Note 3 and 4, at the steady-state (i.e., when ∂_tp_IPO = 0) we have the steady-state or stationary distributions:

$${p}_{IPO}^{{{\rm{fp}}}}={p}_{I}^{{{\rm{st}}}}{p}_{P| I}^{{{\rm{st}}}}{p}_{O| I}^{{{\rm{eff,st}}}}\qquad \,{\mbox{fast processing}}$$

(5)

$${p}_{IPO}^{{{\rm{sp}}}}={p}_{I}^{{{\rm{st}}}}{p}_{P| I}^{{{\rm{st}}}}{p}_{O| P}^{{{\rm{st}}}}\qquad \,{\mbox{slow processing}}\,\,,$$

(6)

where the superscript “st” (omitted on the l.h.s.) stands for stationary, and “eff” indicates a pdf that solves an effective operator obtained from the ensemble average over dofs faster than its corresponding unit. We use the superscript “fp” and “sp” to indicate that these quantities are evaluated respectively for fast and slow processing. Let us inspect all these terms one by one. ${p}_{I}^{{{\rm{st}}}}$ is the multivariate Gaussian distribution of the input dofs with mean m_I and covariance matrix ${\hat{\Sigma }}_{I}$ that solves the Lyapunov equation ${\hat{A}}_{I}{\hat{\Sigma }}_{I}+{\hat{\Sigma }}_{I}{\hat{A}}_{I}^{T}=2{\hat{D}}_{I}$. By exploiting the fact that intra-unit interactions are linear, all the conditional distributions may be written as:

$${p}_{\mu | \nu }^{{{\rm{st}}}}={{{\mathcal{N}}}}_{\mu }\left({{{\bf{m}}}}_{\mu | \nu }({{{\bf{x}}}}_{\nu }),{\hat{\Sigma }}_{\mu }\right)\qquad \mu ,\nu =I,P,O$$

(7)

with ${{{\mathcal{N}}}}_{\mu }$ a Gaussian distribution over x_μ, ${\hat{\Sigma }}_{\nu }$ satisfying its corresponding Lyapunov equation, and the average containing the dependence on the conditional variable as follows:

$${{{\bf{m}}}}_{\mu | \nu }({{{\bf{x}}}}_{\nu })={g}_{\mu \nu }{\hat{A}}_{\mu }^{-1}{{{\boldsymbol{\phi }}}}_{\mu \nu }({\hat{A}}_{\mu \nu };{{{\bf{x}}}}_{\nu })\,.$$

(8)

Notice that the functional form of Eq. (8) depends on the nonlinear processing mechanism considered in Eq. (3). However, when an effective operator is involved, calculations become harder. By using a convergent expansion of the hyperbolic tangent, we show that:

$${p}_{O| I}^{{{\rm{eff,st}}}}={{{\mathcal{N}}}}_{O}\left({{{\bf{m}}}}_{O| I}^{{{\rm{eff}}}}({{{\bf{x}}}}_{I}),{\hat{\Sigma }}_{O}\right)$$

(9)

with again ${\hat{A}}_{O}{\hat{\Sigma }}_{O}+{\hat{\Sigma }}_{O}{\hat{A}}_{O}^{T}=2{\hat{D}}_{O}$ and

$$ {{\rm{ns:}}}\quad {{{\bf{m}}}}_{O| I}^{{{\rm{eff}}}} ={g}_{OP}{\hat{A}}_{O}^{-1}\left(\frac{{\hat{A}}_{OP}}{{M}_{P}}{{\boldsymbol{{{\mathcal{F}}}}}}({{{\bf{m}}}}_{P| I},{{\rm{diag}}}({\hat{\Sigma }}_{P}))\right)\\ {{\rm{int}}}\!:\quad {{{\bf{m}}}}_{O| I}^{{{\rm{eff}}}} ={g}_{OP}{\hat{A}}_{O}^{-1}{{\boldsymbol{{{\mathcal{F}}}}}}({{{\bf{m}}}}_{{{\rm{int}}}},{{{\bf{v}}}}_{{{\rm{int}}}})$$

(10)

where we employed the shorthand notation ${{{\mathcal{F}}}}^{i}({{\bf{x}}},{{\bf{y}}})={{\mathcal{F}}}({x}^{i},{y}^{i})$. In particular, ${{\mathcal{F}}}$ is a nontrivial nonlinear function defined in the “Methods” and in the Supplementary Note 4, and we introduced the following integrated quantities:

$${{{\bf{m}}}}_{{{\rm{int}}}}=\frac{1}{{M}_{P}}{\hat{A}}_{OP}{{{\bf{m}}}}_{P| I}\,,\quad {{{\bf{v}}}}_{{{\rm{int}}}}=\frac{1}{{M}_{P}^{2}}{\hat{A}}_{OP}{\hat{\Sigma }}_{P}{\hat{A}}_{OP}^{T}\,.$$

(11)

From Eq. (10), we notice that the dependence on x_I enters solely through m_O∣P, defined in Eq. (8). The main difference resides in the fact that, in the case of summation, the nonlinear function ${{\boldsymbol{{{\mathcal{F}}}}}}$ has to be averaged with processing weights ${\hat{A}}_{OP}$, while in the case of integration, ${{\boldsymbol{{{\mathcal{F}}}}}}$ must be directly evaluated on integrated quantities.

Putting all these results together, we obtain an analytical expression for the joint pdf of the whole system, p_IPO. We stress that p_IPO is a highly nonlinear distribution. However, our factorization into conditional Gaussian distributions incorporates the nonlinearities only into their means, allowing in particular for efficient sampling (see “Methods”). Furthermore, the structure of the resulting conditional dependencies is crucially different between fast and slow processing units, with fundamental implications for the mutual information between the input and the output. To obtain general results, we focus on the case of random interactions, an approach that has provided fundamental insights in several fields^{32,33,36,59,64,65,66}. We take interactions within the same unit μ to be distributed as ${A}_{\mu }^{ij} \sim {{\mathcal{N}}}(0,{\sigma }_{\mu }/\sqrt{{M}_{\mu }})$ with diagonal elements ${A}_{\mu }^{ii}=1$ for all i = 1, …, M_μ, so that, as unit dimensions increase, it remains linearly stable if σ_μ < 1 (see ref. ⁶⁷ and Supplementary Note 3). Interactions from unit ν to μ are instead distributed as ${A}_{\mu \nu }^{ij} \sim {{\mathcal{N}}}(0,{\sigma }_{\mu \nu })$, and all results are obtained by averaging over realizations of these random matrices. Intuitively, while g_μν describes the overall interaction strength from ν to μ, σ_μν models the intrinsic coupling heterogeneity.

In Fig. 1, we show stochastic trajectories and probability distributions at the steady state of the output degree of freedom for slow and fast processing, both in the case of summation and integration. For simplicity of computation and visualization, we will consider a one-dimensional output unit throughout this manuscript. While there is no striking difference between slow and fast processing at the dynamical level, nonlinear summation and integration lead to two very different distributions in the output node. Integrating incoming signals from one unit to the other favors the spontaneous emergence of a pronounced switching behavior that reflects into a bimodal distribution, a signature of input discrimination. The last part of this manuscript will be dedicated to quantitatively substantiating this observation.

**Fig. 1: Output distribution and trajectories for different nonlinear schemes and timescale orderings.**

Enhanced information by fast processing units

We can now exploit the exact factorization of the joint pdf of the system to evaluate the accuracy of processing the stochastic input and encoding it into the one-dimensional output, by means of the mutual information I_IO in Eq. (2). To establish a baseline for the full processing scheme described in the previous section, we first consider the simpler scenario of an input signal x_I that is directly passed to a one-dimensional output unit x_O. Once again, we focus on the limiting case of a slow input (τ_I ≫ τ_O) in which information can be transferred from the input to the output unit^28,29. The joint steady-state distribution of input and output dofs reads ${p}_{IO}^{{{\rm{np}}}}={p}_{I}^{{{\rm{st}}}}{p}_{O| I}^{{{\rm{np,st}}}}$, where the superscript “np” stands for “no processing”. Here, ${p}_{O| I}^{{{\rm{np,st}}}}$ is a Gaussian distribution whose variance is independent on x_I (see Eq. (7) and “Methods” for details). Thus, ${h}_{O| I}^{{{\rm{np}}}}$ does not depend on x_I and is equal to

$${h}_{O| I}^{{{\rm{np}}}}=\frac{1}{2}\left[{M}_{O}(1+{\log }_{2}(2\pi ))+{\log }_{2}\det ({\hat{\Sigma }}_{O})\right]$$

(12)

so that the mutual information simply reads

$${I}_{IO}^{{{\rm{np}}}}={H}_{O}^{{{\rm{np}}}}-{h}_{O| I}^{{{\rm{np}}}}\,.$$

(13)

Therefore, evaluating ${I}_{IO}^{{{\rm{np}}}}$ amounts to computing the Shannon entropy of the output distribution, which can be done using standard estimators (see “Methods”)⁶⁸.

In Fig. 2a–c, we compare the behavior of ${I}_{IO}^{{{\rm{np}}}}$ with nonlinear summation and nonlinear integration, as a function of the input-output coupling strength, g_OI, and the standard deviation of their interactions, σ_OI, that accounts for coupling heterogeneity. As expected, in both cases information increases with g_OI. Crucially, we also find that, while nonlinear integration performs better at small σ_OI, nonlinear summation becomes dominant at large σ_OI. This nontrivial switch signals the fact that, in the presence of large elements in the interaction matrix, ${I}_{IO}^{{{\rm{np}}}}$ is favored by nonlinear summation. Additionally, as shown in Fig. 2d, e, the output distribution with nonlinear integration becomes bimodal for large σ_OI due to the saturating effect of the hyperbolic tangent—a phenomenon much more pronounced when all the inputs are summed together in the argument of the activation function. We also find (Fig. 2f, g) that mutual information saturates to a finite value as the input gets closer to linear instability, a feature already observed in models of neural populations¹⁹. In particular, an input closer to linear instability appears to be always beneficial for nonlinear summation, further suggesting that large input values—either from x_I itself or due to specific large couplings—are better represented in the output by summing separate activation functions. With nonlinear integration, instead, linear instability and large values of the input may decrease ${I}_{IO}^{{{\rm{np}}}}$, as they may push the activation function to saturation.

**Fig. 2: Information-theoretic differences between nonlinear summation and integration with and without a one-dimensional processing unit.**

Then, we add back a processing unit to understand its effect on the mutual information between input and output units. We note that the system with or without processing is fundamentally different, as the processing unit evolves on its own timescale and therefore alters both the form of the Fokker-Planck operators and the structure of the associated joint probability distributions. In the case of fast processing, the joint steady-state distribution of input and output dofs is obtained from Eq. (5) by integrating over x_P, i.e., ${p}_{IO}^{{{\rm{fp}}}}={p}_{I}^{{{\rm{st}}}}{p}_{O| I}^{{{\rm{eff,st}}}}$. Thus, as before, the variance of the Gaussian distribution ${p}_{O| I}^{{{\rm{eff,st}}}}$ is independent of x_I (Eq. (9)), and the mutual information can be written following Eqs. (12) and (13) (using fp as a superscript). In the presence of a slow processing unit, instead, from Eq. (6) we have:

$${p}_{IO}^{{{\rm{sp}}}}={p}_{I}^{{{\rm{st}}}}\int\,d{{{\bf{x}}}}_{P}\,{p}_{O| P}^{{{\rm{st}}}}{p}_{P| I}^{{{\rm{sp}}},{{\rm{st}}}}:= {p}_{I}^{{{\rm{st}}}}{p}_{O| I}^{{{\rm{sp}}},{{\rm{st}}}}\,.$$

(14)

Although an expression for ${h}_{O| I}^{{{\rm{sp}}}}$ cannot be easily obtained in this case, we can efficiently sample ${p}_{O| I}^{{{\rm{sp}}},{{\rm{st}}}}$ by using Eq. (14) to compute the mutual information ${I}_{IO}^{{{\rm{sp}}}}$ from Eq. (2), as detailed in the “Methods”. In order to compare the results with the ones in the absence of a processing unit, we replace the previous output with a one-dimensional processing, i.e., we take g_OI → g_PI and σ_OI → σ_PI. Then, we add an additional one-dimensional layer that now constitutes the output of the system. In this way, by removing the processing unit, we get back the original input-output system, allowing a direct comparison between the two. We study the behavior of the mutual information as we change the processing-output coupling, g_OP, and interaction heterogeneity, σ_OP.

In Fig. 2h–j we show that, with a fast processing unit, for sufficiently large g_OP the mutual information ${I}_{IO}^{{{\rm{fp}}}}$ is larger than that of the corresponding input-output system. The coupling value for which such crossing takes place decreases with σ_OP, hinting at the fact that a system with either large g_OP or σ_OP can outperform its counterpart without processing. On the other hand, this is not possible with a slow processing unit, for which the presence of a one-dimensional processing layer seems to be detrimental, or immaterial at best (Fig. 2k–m). Indeed, we find that ${I}_{IO}^{{{\rm{sp}}}}\le {I}_{IO}^{{{\rm{np}}}}$, approaching this value with nonlinear summation at large g_OP. Once more, whether nonlinear integration or summation leads to more accurate input encoding primarily depends on g_OP and σ_OP.

Enhanced information by nonlinear integration

While a one-dimensional processing can be advantageous or detrimental depending on its timescale, increasing its dimensionality can modify this picture. We now explore this direction, starting with the case in which processing and input units have the same large dimension M_I = M_P = 50. In Fig. 3, we compare the mutual information between input and output for the case of nonlinear summation, ${I}_{IO}^{{{\rm{ns}}}}$, and integration, ${I}_{IO}^{{{\rm{int}}}}$, for both slow and fast processing. We first take σ_PI= σ_OP= 1 to study the effects of the couplings g_PI and g_OP. Figure 3a–c and d–f, respectively show that, independently of the internal timescale ordering, nonlinear integration leads to higher mutual information with respect to summation. Notice that, as in the previous one-dimensional case, for both nonlinear schemes we find that a fast processing unit systematically outperforms a slow one (see Fig. 3g–h). Therefore, from now on, we will only consider the fast-processing scenario. Furthermore, in Fig. 3i–j, we show that ${I}_{IO}^{{{\rm{int}}}}$ displays a nontrivial peak as a function of g_PI, whereas ${I}_{IO}^{{{\rm{ns}}}}$ saturates to lower values. This observation signals the existence of an optimal value of input-output coupling that helps maximize the encoding performance with (fast) high-dimensional processing units and nonlinear integration.

**Fig. 3: Information-theoretic differences between nonlinear summation and integration for fast and slow processing units.**

So far, we have focused on the case of small heterogeneity of processing-output interactions, where nonlinear integration displays a computational advantage. However, if we keep both M_P and M_I fixed, the situation is reversed at larger σ_OP and large g_PI (Fig. 3k, l). In this regime, ${I}_{IO}^{{{\rm{ns}}}}$ saturates at values that are larger than the peak of ${I}_{IO}^{{{\rm{int}}}}$. This suggests once more the presence of a nontrivial interplay between the two schemes as a function of interaction heterogeneity and coupling strengths.

Crucially, even in these strong-coupling and large-heterogeneity regimes, the computational advantage of nonlinear integration is restored at sufficiently large processing sizes (Fig. 4a–d). Intuitively, for small M_P, this may be due to large elements of the interaction matrices affecting only a few terms in the nonlinear summation scheme. On the other hand, the same elements may push nonlinear integration into the saturation regime if not balanced by any opposite signals, therefore masking the other interactions. Overall, we find that nonlinear summation provides higher mutual information at small M_P, and with large couplings and heterogeneity, but this effect becomes less and less prominent and eventually disappears as M_P increases (Fig. 4e). Importantly, this effect depends on the sparsity of the connections between the units. In this case, we rewrite Eq. (3) as

$${({\phi }_{\mu \nu }^{i})}^{{{\rm{ns}}}} =\frac{1}{{C}_{\mu \nu }^{i}}\mathop{\sum}\limits_{j=1}^{{M}_{\nu }}{A}_{\mu \nu }^{ij}\tanh({x}_{\nu }^{j})\\ {({\phi }_{\mu \nu }^{i})}^{{{\rm{int}}}} =\tanh \left(\frac{1}{{C}_{\mu \nu }^{i}}\mathop{\sum}\limits_{j=1}^{{M}_{\nu }}{A}_{\mu \nu }^{ij}{x}_{\nu }^{j}\right)\,,$$

(15)

where ${C}_{\mu \nu }^{i}$ is the number of connections from unit ν to the i-th node of unit μ. In Fig. 4e, we show that increasing this sparsity by reducing the probability of connection between the units, p_unit, is qualitatively equivalent to effectively reducing the processing unit dimension, as expected. Hence, for fixed M_P, a system with sparser inter-unit connections will favor nonlinear summation, in line with our previous considerations. Furthermore, our results are qualitatively robust with respect to changes in the processing unit topology. In particular, in Fig. 4f, we consider a Barabasi-Albert topology with Gaussian weights for ${\hat{A}}_{P}$, to introduce a processing structure with hubs and a nontrivial degree distribution. Although this sparser intra-unit topology tends to favor nonlinear integration, we find in general the same interplay between the two nonlinear schemes as in the fully connected case. In the Supplementary Note 4, we show that these results remain robust in the presence of other topologies of the processing unit, as well as in the case where p_unit is not constant but varies across nodes. While our results remain limited to a set of paradigmatic topologies, the observed robustness seems to indicate the fundamental role of key topological parameters, such as the degree of the processing layer and the number of inter-layer connections, in determining the processing performance of complex nonlinear systems.

**Fig. 4: Dependence of the input–output mutual information on the processing dimension and topology.**

Interplay between input and processing dimensionality

The results in Fig. 4 suggest that the size of the processing unit deeply affects the information between input and output dofs. To further explore this effect, we now focus on fast nonlinear integration and study the interplay between input and processing dimensionalities. In Fig. 5a, we show that, at a given M_I, there exists an optimal value of ${M}_{P}={M}_{P}^{* }$ that maximizes ${I}_{IO}^{{{\rm{int}}}}$. ${M}_{P}^{* }$ decreases by increasing the input size, so that smaller inputs are optimally processed by large processing units. This effect emerges at sufficiently strong coupling regimes, as we show in Fig. 5b and in the Supplementary Note 4. We also find that the optimal processing dimension ${M}_{P}^{* }$ increases with σ_OP (Fig. 5c).

In particular, for small g_OP and σ_OP, information is typically higher for smaller input sizes (Fig. 5d), and the optimal processing dimension ${M}_{P}^{* }$ remains small at any M_I. Slightly increasing the interaction heterogeneity σ_OP while keeping fixed g_OP drastically alters ${I}_{IO}^{{{\rm{int}}}}$ (Fig. 5e), revealing the nontrivial interplay between M_I and ${M}_{P}^{* }$. Heuristically, this provides evidence that a nonlinear embedding of a low-dimensional input in a higher-dimensional processing space favors information encoding. On the contrary, ${I}_{IO}^{{{\rm{int}}}}$ is maximal at small M_P for large M_I, so that information processing of high-dimensional inputs is favored by a nonlinear compression of the input in a lower-dimensional processing space. This behavior may reveal quantitative insights into optimal operation regimes and diverse strategies to encode information.

Emergent output bimodality

Nonlinear integration may also be advantageous from a dynamical perspective. In Fig. 6a, we consider the case of a fast processing unit and compare the bimodality of the output pdf obtained for nonlinear summation and integration by means of the Staple’s bimodality coefficient (see “Methods”), respectively denoted by ${b}_{O}^{{{\rm{ns}}}}$ and ${b}_{O}^{{{\rm{int}}}}$. We show that integration is associated with higher bimodality coefficients, i.e., more pronounced bimodality, thus enabling more accurate input discrimination in the output distribution. We note, however, that this effect is purely dynamic, as higher bimodality does not always imply larger input-output mutual information (see for instance Fig. 2d, e and Supplementary Note 5). The presence of a stochastic Gaussian input and random interaction matrices makes it particularly hard to pinpoint the input features that the system is discriminating. Crucially, however, this emergent bimodality may be tuned by introducing suitable processing biases in how the signal of certain nodes is encoded. We can add this ingredient in Eq. (3) by applying the substitution ${x}_{\nu }^{j}\to {x}_{\nu }^{j}-{\theta }_{\nu }^{j}$, where the bias is introduced as:

$$\begin{array}{rcl}{({\phi }_{\mu \nu }^{i})}^{{{\rm{ns,b}}}}&=&\frac{1}{{M}_{\nu }}\mathop{\sum}\limits_{j=1}^{{M}_{\nu }}{A}_{\mu \nu }^{ij}\tanh \left({x}_{\nu }^{j}-{\theta }_{\nu }^{j}\right)\\ {({\phi }_{\mu \nu }^{i})}^{{{\rm{int,b}}}}&=&\tanh \left(\frac{1}{{M}_{\nu }}\mathop{\sum}\limits_{j=1}^{{M}_{\nu }}{A}_{\mu \nu }^{ij}({x}_{\nu }^{j}-{\theta }_{\nu }^{j})\right)\,,\end{array}$$

(16)

where the superscript “b” indicates the presence of the bias. Note that θν can be, in principle, different, for each unit. In Fig. 6b–e, we consider a system in the presence of a fast processing unit and include the presence of a random bias θ_ν whose elements are drawn from ${{\mathcal{N}}}(0,1)$ for each ν. By comparing the same realization of random interaction matrices ${\hat{A}}_{\mu }$ and ${\hat{A}}_{\mu \nu }$ (as discussed for Fig. 3) without (Fig. 6b, c) and with (Fig. 6d, e) θ_ν, we find that the presence of a bias triggers an unbalance in the output bimodality. Notice that this emergent imbalance can be different between summation and integration due to the intrinsic randomness of all system’s components, as shown in Fig. 6. Thus, nonlinear integration enables more pronounced tunable output bimodality, allowing the system to statistically select one of the two emerging modes.

**Fig. 6: Emergence of tunable bimodality in the presence of a fast processing unit.**

Discussion

In this work, we studied nonlinear processing through different activation functions in a paradigmatic information-processing system. By leveraging the presence of multiple timescales, we analytically obtained the joint distribution of the system and computed the input-output mutual information. We compared two nonlinear processing schemes, summation and integration, which have been employed in several contexts. In systems implementing nonlinear summation, inputs are first nonlinearly transformed and then averaged, while inputs are first averaged and then transformed in systems supporting nonlinear integration. We showed that fast processing units outperform slow-processing ones, leading to higher input-output mutual information. Furthermore, we found that coupling strengths, interaction heterogeneity, and processing dimensionality— modulated by the number of connections between the units—determine whether nonlinear integration or summation is beneficial to the information shared between input and output units. Finally, we highlighted a nontrivial tradeoff between input and processing dimensions emerging in strong-coupling regimes.

Overall, our paradigmatic approach allowed us to investigate the emergence of accurate encoding in information-processing architectures. In particular, we highlighted the advantages of nonlinear integration in large multiscale systems and nonlinear summation in smaller ones. The information-theoretic differences between these two schemes may help in understanding optimal coding strategies and why certain biological systems implement integration or summation. The fact that nonlinear summation performs better with fewer degrees of freedom and more heterogeneous couplings may be especially relevant for those biochemical systems where a limited number of chemical species support signal propagation. Conversely, large neuronal networks may perform better by integrating incoming signals, with the specific topology of the interaction network determining the paths along which signals are dynamically propagated^8,9,69. In this direction, a deeper understanding of the interplay between the nonlinearities and specific topologies is needed and has not been explored in this work, which has solely focused on the case of random interactions. A fundamental extension would be to use real-world networks as backbone structures to build processing layers. On the one hand, these investigations might shed light on the ability of biological systems to process information; on the other hand, they might be informative for designing bio-inspired networks with optimal processing abilities. However, the design of information-processing systems faces the challenge of optimizing both nonlinear functions and network structure to implement given target functions. This goal will necessarily require the development of specific training algorithms that work for large stochastic systems. Similarly, studying scenarios with several output nodes is currently out of reach, since estimating the output entropy and mutual information in high dimensions is not feasible with the available numerical estimators.

Future works will need to consider other types of activation functions, evaluating their performances in terms of input-output information and the relative timescales between the units. Furthermore, it will be important to consider systems where different units implement different activation functions, allowing for more heterogeneity in terms of computational capabilities. In these scenarios, it will be interesting to systematically investigate how input, processing, and output dimensions shape information in specific real-world systems. Along this line, architectures with several processing units, possibly acting on a diverse range of timescales, may be necessary to deal with bio-inspired models and more structured inputs. This setting will also enable a natural implementation of multiple tasks whose presence might change the definition of processing performance, framing it in the context of decision-making and possibly connecting it with recent advances at the interface of information processing, decisions, and large language models across different scales⁷⁰. Our work will stand as a fundamental step for these explorations, unraveling how different types of dynamical nonlinearities underlie information and computation in real-world systems.

Methods

Exact solution for fast processing units

For a system with a fast processing unit, the timescale separation τ_I ≫ τ_O ≫ τ_P leads to the steady-state or stationary joint probability distribution (i.e., the solution obeying ${\partial }_{t}{p}_{IPO}^{{{\rm{st}}}}=0$)

$${p}_{IPO}^{{{\rm{fp}}}}={p}_{P| I}^{{{\rm{st}}}}{p}_{O| I}^{{{\rm{eff}}},{{\rm{st}}}}{p}_{I}^{{{\rm{st}}}}$$

(17)

where we omitted the superscript “st” on the l.h.s. for brevity, and

$${p}_{I}^{{{\rm{st}}}} ={{{\mathcal{N}}}}_{I}\left({{\bf{0}}},{\hat{\Sigma }}_{I}\right)\\ {p}_{P| I}^{{{\rm{st}}}} ={{{\mathcal{N}}}}_{P}\left({{{\bf{m}}}}_{P| I}({{{\bf{x}}}}_{I}),{\hat{\Sigma }}_{P}\right)\\ {p}_{O| I}^{{{\rm{eff,st}}}} ={{{\mathcal{N}}}}_{O}\left({{{\bf{m}}}}_{O| I}({{{\bf{x}}}}_{I}),{\hat{\Sigma }}_{O}\right)$$

(18)

as we show explicitly in the Supplementary Note 4. The covariance matrices obey their respective Lyapunov equations, e.g., ${\hat{A}}_{I}{\hat{\Sigma }}_{I}+{\hat{\Sigma }}_{I}{\hat{A}}_{I}^{T}=2{\hat{D}}_{I}$, with ${\hat{D}}_{I}={{\rm{diag}}}\left({D}_{I}^{1},\ldots ,{D}_{I}^{{M}_{I}}\right)$, and similarly for the processing and output unit. Importantly, provided that the deterministic system is stable—i.e., that the eigenvalues of ${\hat{A}}_{\mu }$ have all positive real parts—these solutions exist and are well-defined⁶³. Since we take ${A}_{\mu }^{ii}=1$, when interactions are randomly distributed as ${{\mathcal{N}}}(0,{\sigma }_{\mu }/\sqrt{{M}_{\mu }})$, in the large M_μ limit, the μ-th unit is stable if σ_μ < 1⁶⁷. The input-dependent mean of the processing, m_P∣I(x_I) is defined as in Eq. (8). Instead, the output distribution obeys an effective operator whose shape depends on whether nonlinear integration or nonlinear summation is employed. We find that, for nonlinear summation, the effective mean is given by

$${m}_{O| I}^{{{\rm{ns}}},i}({{{\bf{x}}}}_{I})=\frac{{g}_{OP}}{{M}_{P}}\mathop{\sum}\limits_{j=1}^{{M}_{O}}\mathop{\sum}\limits_{k=1}^{{M}_{P}}{\left({A}_{O}^{-1}\right)}^{ij}{A}_{OP}^{jk}{{\mathcal{F}}}\left({m}_{P| I}^{k}\left({{{\bf{x}}}}_{I}\right),{\Sigma }_{P}^{kk}\right)$$

(19)

with

$${{\mathcal{F}}}(x,v) = {{\rm{erf}}}\left(\frac{x}{\sqrt{2v}}\right)+\\ + \mathop{\sum}\limits_{n=1}^{\infty }{(-1)}^{n}{e}^{2{n}^{2}v}\left[{V}_{n}^{+}\left(x,v\right)-{V}_{n}^{-}(x,v)\right]$$

(20)

where we introduced the functions

$$\begin{array}{rcl}{V}_{n}^{+}(x,v)&=&{e}^{-2nx}{{\rm{erfc}}}\left(\frac{2nv-x}{\sqrt{2v}}\right)\\ {V}_{n}^{-}(x,v)&=&{e}^{2nx}{{\rm{erfc}}}\left(\frac{2nv+x}{\sqrt{2v}}\right)\,.\end{array}$$

(21)

For nonlinear integration, the calculations are more involved. As reported in the main text,

$${m}_{O| I}^{{{\rm{int}}},i}({{{\bf{x}}}}_{I})={g}_{OP}\mathop{\sum}\limits_{j=1}^{{M}_{O}}{\left({A}_{O}^{-1}\right)}^{ij}{{\mathcal{F}}}\left({m \, }_{{{\rm{int}}}}^{j}{{{\bf{x}}}}_{I},{v \, }_{{{\rm{int}}}}^{j}\right)$$

(22)

where

$${m}_{{{\rm{int}}}}^{i}({{{\bf{x}}}}_{I})=\frac{1}{{M}_{P}}\mathop{\sum}\limits_{j=1}^{{M}_{P}}{A}_{OP}^{ij}{m}_{P| I}^{j}({{{\bf{x}}}}_{I})\,.$$

(23)

and

$${v}_{{{\rm{int}}}}^{i}=\frac{1}{{M}_{P}^{2}}\mathop{\sum}\limits_{j=1}^{{M}_{P}} \mathop{\sum}\limits_{k=1}^{{M}_{P}}{A}_{OP}^{ij}{A}_{OP}^{ik}{\Sigma }_{P}^{jk}\,.$$

(24)

We present the detailed derivation of all these expressions in Supplementary Note 4.

Exact solution for slow processing units

For a system with a slow processing unit, the timescale separation τ_I ≫ τ_P ≫ τ_O leads to the steady-state joint probability distribution

$${p}_{IPO}^{{{\rm{sp}}}}={p}_{O| P}^{{{\rm{st}}}}\left({{{\bf{m}}}}_{O| P}({{{\bf{x}}}}_{P})\right){p}_{P| I}^{{{\rm{st}}}}\left({{{\bf{m}}}}_{P| I}({{{\bf{x}}}}_{I})\right){p}_{I}^{{{\rm{st}}}}$$

(25)

again omitting the superscript “st” on the l.h.s, with

$${p}_{I}^{{{\rm{st}}}} ={{{\mathcal{N}}}}_{I}\left({{\bf{0}}},{\hat{\Sigma }}_{I}\right)\\ {p}_{P| I}^{{{\rm{st}}}} ={{{\mathcal{N}}}}_{P}\left({{{\bf{m}}}}_{P| I}({{{\bf{x}}}}_{I}),{\hat{\Sigma }}_{P}\right)\\ {p}_{O| P}^{{{\rm{st}}}} ={{{\mathcal{N}}}}_{O}\left({{{\bf{m}}}}_{O| P}({{{\bf{x}}}}_{P}),{\hat{\Sigma }}_{O}\right)$$

(26)

as we show explicitly in the Supplementary Note 4. With respect to the previous case, the crucial difference is that now ${p}_{O| P}^{{{\rm{st}}}}$ is conditioned on the processing rather than on the input, due to the timescale structure. Furthermore, the input-dependent mean of the processing, m_P∣I(x_I) (see Eq. (8)) has the same form as the processing-dependent mean of the output, m_O∣P(x_P). We present the detailed derivation of all these expressions in the Supplementary Note 4.

Direct input–output connections

We consider here the case of a system with only an input unit, x_I, and an output unit, x_O, with M_O nodes. In the presence of a slow input, τ_I ≫ τ_O, the steady-state or stationary solution of the Fokker-Planck equation for the joint probability distribution, ${\partial }_{t}{p}_{IO}^{{{\rm{st}}}}=0$, reads

$${p}_{IO}^{{{\rm{np}}}}({{{\bf{x}}}}_{I},{{{\bf{x}}}}_{O})={p}_{I}^{{{\rm{st}}}}({{{\bf{x}}}}_{I})\,{p}_{O| I}^{{{\rm{st}}}}({{{\bf{x}}}}_{O}| {{{\bf{x}}}}_{I})$$

(27)

omitting “st” on the l.h.s., where

$${p}_{I}^{{{\rm{st}}}}={{{\mathcal{N}}}}_{I}\left({{\bf{0}}},{\hat{\Sigma }}_{I}\right),\quad {p}_{O| I}^{{{\rm{st}}}}={{{\mathcal{N}}}}_{O}\left({{{\bf{m}}}}_{O| I}({{{\bf{x}}}}_{I}),{\hat{\Sigma }}_{O}\right)$$

(28)

and

$${m}_{O| I}^{i}({{{\bf{x}}}}_{I})={g}_{OI}\mathop{\sum}\limits_{k=1}^{{M}_{O}}{\left({A}_{O}^{-1}\right)}^{ik}{\phi }_{OI}\left({A}_{OI}^{k,1},\ldots ,{A}_{OI}^{k,{M}_{I}};{{{\bf{x}}}}_{I}\right)$$

(29)

for all i = 1, …, M_O, as we explicitly show in the Supplementary Note 3.

Exact sampling scheme for fast processing

In general, although our approach allows us to factorize the joint distribution ${p}_{IPO}^{{{\rm{st}}}}$ into a product of Gaussian distributions for all timescale orderings, the input-output distribution, i.e., the one obtained from the marginalization over the processing dofs, remains highly nonlinear due to the nontrivial dependencies in the means of each Gaussian factor. However, our factorization allows for its efficient sampling, as all the nonlinearities appear as conditional dependencies. For a fast processing unit, for instance, we have

$${p}_{IO}^{{{\rm{fp}}}}={p}_{O| I}^{{{\rm{eff}}},{{\rm{st}}}}{p}_{I}^{{{\rm{st}}}}$$

so that

$${h}_{O| I}=\frac{1}{2}\left[{M}_{O}\left(1+{\log }_{2}(2\pi )\right)+{\log }_{2}\det {\hat{\Sigma }}_{O}\right]$$

(30)

is easy to compute. Thus, we only need to evaluate H_O numerically to estimate I_IO = H_O − h_O∣I. The procedure for N_sam samples is as follows:

1.
sample ${\{{{{\bf{x}}}}_{I}\}}_{i = 1}^{{N}_{{{\rm{sam}}}}}$ from the independent Gaussian distribution of the input;
2.
compute the means ${{{\bf{m}}}}_{O| I}({\{{{{\bf{x}}}}_{I}\}}_{i})$ for each sample i;
3.
for all i, sample x_O from the multivariate Gaussian with covariance ${\hat{\Sigma }}_{O}$ and means ${{{\bf{m}}}}_{O| I}({\{{{{\bf{x}}}}_{I}\}}_{i})$.

Then, the entropy H_O of the output distribution can be estimated from the samples ${\{{{{\bf{x}}}}_{O}\}}_{i}$ (see also Supplementary Note 4).

Exact sampling scheme for slow processing

In the case of a slow processing unit, due to the timescale ordering, the conditional dependencies are crucially different. The input-output distribution,

$${p}_{IO}^{{{\rm{sp}}}}({{{\bf{x}}}}_{I},{{{\bf{x}}}}_{O})={p}_{I}^{{{\rm{st}}}}({{{\bf{x}}}}_{I})\int\,d{{{\bf{x}}}}_{P}{p}_{O| P}^{{{\rm{st}}}}({{{\bf{x}}}}_{O}| {{{\bf{x}}}}_{P}){p}_{P| I}^{{{\rm{st}}}}({{{\bf{x}}}}_{P}| {{{\bf{x}}}}_{I}),$$

(31)

cannot be easily computed. Thus, the entropy of the conditional distribution h_O∣I(x_I) is not known analytically. To address this issue, we exploit the fact that we can efficiently sample ${p}_{O| I}^{{{\rm{st}}}}$, allowing us to estimate directly the conditional entropy ${H}_{O| I}={\langle {h}_{O| I}\rangle }_{I}$ with importance sampling. We detail below the sampling steps for a fixed number of input samples N_sam,I and N_sam output samples per input:

1.
sample a fixed input ${{{\bf{x}}}}_{I}^{(i)} \sim {{{\mathcal{N}}}}_{I}(0,{\hat{\Sigma }}_{I})$ for i = 1, …, N_sam,I;
2.
for each input sample ${{{\bf{x}}}}_{I}^{(i)}$, compute ${{{\bf{m}}}}_{P| I}\left({{{\bf{x}}}}_{I}^{(i)}\right)$, and extract the samples ${{{\bf{x}}}}_{P}^{(i,j)}$ from ${{{\mathcal{N}}}}_{P}\left({{{\bf{m}}}}_{P| I}\left({{{\bf{x}}}}_{I}^{(i)}\right),{\hat{\Sigma }}_{P}\right)$ for j = 1, …, N_sam;
3.
for each processing sample ${{{\bf{x}}}}_{P}^{(i,j)}$, compute the mean ${{{\bf{m}}}}_{O| P}\left({{{\bf{x}}}}_{P}^{(i,j)}\right)$ and extract the corresponding output ${{{\bf{x}}}}_{O}^{(i,j)}$ from ${{{\mathcal{N}}}}_{O}\left({{{\bf{m}}}}_{O| P}\left({{{\bf{x}}}}_{P}^{(i,j)}\right),{\hat{\Sigma }}_{O}\right)$;
4.
for each input sample ${{{\bf{x}}}}_{I}^{(i)}$, estimate the entropy ${h}_{O| I}\left({{{\bf{x}}}}_{I}^{(i)}\right)$ of the conditional distribution ${p}_{O| I}^{{{\rm{st}}}}$ from the output samples ${\{{{{\bf{x}}}}_{O}\}}_{i,j}$;
5.
estimate the conditional entropy H_O∣I via importance sampling.

Then, as in the case of fast processing, we can estimate the entropy H_O from the output samples, and then the mutual information I_IO = H_O − H_O∣I (see also the Supplementary Note 2).

Measures of bimodality

We measure the bimodality of the output distribution ${p}_{O}^{{{\rm{st}}}}$ by computing Sarle’s bimodality coefficient, defined as:

$$b=\frac{{s}^{2}+1}{\kappa +q({n}_{{{\rm{samples}}}})}$$

(32)

where n_samples is the number of samples at hand, q(x) = 3(n−1)²/[(n − 2)(n − 3)], s is the sample skewness,

$$s=\frac{1}{{n}_{{{\rm{samples}}}}}\frac{{\sum }_{i}{\left[{x}_{O}^{(i)}-\langle {x}_{O}\rangle \right]}^{3}}{{\left[\langle {x}_{O}^{2}\rangle -{\langle {x}_{O}\rangle }^{2}\right]}^{3/2}}$$

(33)

and κ is the excess kurtosis,

$$\kappa =\frac{1}{{n}_{{{\rm{samples}}}}}\frac{{\sum }_{i}{\left[{x}_{O}^{(i)}-\langle {x}_{O}\rangle \right]}^{4}}{{\left[\langle {x}_{O}^{2}\rangle -{\langle {x}_{O}\rangle }^{2}\right]}^{2}}-3\,.$$

(34)

Code availability

The code employed to perform simulations and compute the input-output mutual information is available on Zenodo (https://doi.org/10.5281/zenodo.15324149).

References

Tkačik, G. & Bialek, W. Information processing in living systems. Annu. Rev. Condens. Matter Phys. 7, 89–117 (2016).
Article ADS Google Scholar
Barkai, N. & Leibler, S. Robustness in simple biochemical networks. Nature 387, 913–917 (1997).
Article ADS Google Scholar
Aoki, S. K. et al. A universal biomolecular integral feedback controller for robust perfect adaptation. Nature 570, 533–537 (2019).
Article Google Scholar
Flatt, S., Busiello, D. M., Zamuner, S. & De Los Rios, P. Abc transporters are billion-year-old Maxwell demons. Commun. Phys. 6, 205 (2023).
Article Google Scholar
Mochizuki, A. An analytical study of the number of steady states in gene regulatory networks. J. Theor. Biol. 236, 291–310 (2005).
Article ADS MathSciNet MATH Google Scholar
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Article Google Scholar
Klosin, A. et al. Phase separation provides a mechanism to reduce noise in cells. Science 367, 464–468 (2020).
Article ADS Google Scholar
Vyas, S., Golub, M. D., Sussillo, D. & Shenoy, K. V. Computation through neural population dynamics. Annu. Rev. Neurosci. 43, 249–275 (2020).
Article Google Scholar
Dubreuil, A., Valente, A., Beiran, M., Mastrogiuseppe, F. & Ostojic, S. The role of population structure in computations through neural dynamics. Nat. Neurosci. 25, 783–794 (2022).
Article Google Scholar
Jordan, J. D., Landau, E. M. & Iyengar, R. Signaling networks: the origins of cellular multitasking. Cell 103, 193–200 (2000).
Article Google Scholar
Cheong, R., Rhee, A., Wang, C. J., Nemenman, I. & Levchenko, A. Information transduction capacity of noisy biochemical signaling networks. science 334, 354–358 (2011).
Article ADS Google Scholar
Szandała, T. Review and comparison of commonly used activation functions for deep neural networks. Bio-Inspired Neurocomputing 203–224 (Springer, 2021).
Karlik, B. & Olgac, A. V. Performance analysis of various activation functions in generalized MLP architectures of neural networks. Int. J. Artif. Intell. Expert Syst. 1, 111–122 (2011).
Google Scholar
Apicella, A., Donnarumma, F., Isgrò, F. & Prevete, R. A survey on modern trainable activation functions. Neural Netw. 138, 14–32 (2021).
Article MATH Google Scholar
Nwankpa, C., Ijomah, W., Gachagan, A. & Marshall, S. Activation functions: comparison of trends in practice and research for deep learning. Preprint at https://doi.org/10.48550/arXiv.1811.03378 (2018).
Dack, A., Qureshi, B., Ouldridge, T. E. & Plesa, T. Recurrent neural chemical reaction networks that approximate arbitrary dynamics. Preprint at https://doi.org/10.48550/arXiv.2406.03456 (2024).
Hayou, S., Doucet, A. & Rousseau, J. On the impact of the activation function on deep neural networks training. In Chaudhuri, K. & Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine Learning Research, https://proceedings.mlr.press/v97/hayou19a.html. 2672–2680 (PMLR, 2019).
Floyd, C., Dinner, A. R., Murugan, A. & Vaikuntanathan, S. Limits on the computational expressivity of non-equilibrium biophysical processes. Nat. Commun. 16, 7184 (2025).
Barzon, G., Busiello, D. M. & Nicoletti, G. Excitation-inhibition balance controls information encoding in neural populations. Phys. Rev. Lett. 134, 068403 (2025).
Article ADS MathSciNet Google Scholar
Cavanagh, S. E., Hunt, L. T. & Kennerley, S. W. A diversity of intrinsic timescales underlie neural computations. Front. Neural Circuits 14, 615626 (2020).
Article Google Scholar
Golesorkhi, M. et al. The brain and its time: intrinsic neural timescales are key for input processing. Commun. Biol. 4, 970 (2021).
Article Google Scholar
Mariani, B. et al. Disentangling the critical signatures of neural activity. Sci. Rep. 12, 10770 (2022).
Article ADS Google Scholar
Parrondo, J. M., Horowitz, J. M. & Sagawa, T. Thermodynamics of information. Nat. Phys. 11, 131–139 (2015).
Article Google Scholar
Nicoletti, G. & Busiello, D. M. Mutual information disentangles interactions from changing environments. Phys. Rev. Lett. 127, 228301 (2021).
Article ADS MathSciNet Google Scholar
Graf, I. R. & Machta, B. B. A bifurcation integrates information from many noisy ion channels and allows for millikelvin thermal sensitivity in the snake pit organ. Proc. Natl Acad. Sci. 121, e2308215121 (2024).
Article Google Scholar
Mattingly, H. H., Kamino, K., Machta, B. B. & Emonet, T. Escherichia coli chemotaxis is information limited. Nat. Phys. 17, 1426–1431 (2021).
Article Google Scholar
Bauer, M. & Bialek, W. Information bottleneck in molecular sensing. PRX Life 1, 023005 (2023).
Article Google Scholar
Nicoletti, G. & Busiello, D. M. Information propagation in multilayer systems with higher-order interactions across timescales. Phys. Rev. X 14, 021007 (2024).
Google Scholar
Nicoletti, G. & Busiello, D. M. Information propagation in Gaussian processes on multilayer networks. J. Phys. Complex. 5, 045004 (2024).
Article ADS Google Scholar
Tostevin, F. & Ten Wolde, P. R. Mutual information between input and output trajectories of biochemical networks. Phys. Rev. Lett. 102, 218101 (2009).
Article ADS Google Scholar
Nicoletti, G. & Busiello, D. M. Tuning transduction from hidden observables to optimize information harvesting. Phys. Rev. Lett. 133, 158401 (2024).
Article ADS MathSciNet Google Scholar
Pham, T. M. & Kaneko, K. Dynamical theory for adaptive systems. J. Stat. Mech. Theory Exp. 2024, 113501 (2024).
Article MathSciNet Google Scholar
Moran, J. & Tikhonov, M. Defining coarse-grainability in a model of structured microbial ecosystems. Phys. Rev. X 12, 021038 (2022).
Google Scholar
Herron, L., Sartori, P. & Xue, B. Robust retrieval of dynamic sequences through interaction modulation. PRX Life 1, 023012 (2023).
Article Google Scholar
Breuer, D., Timme, M. & Memmesheimer, R.-M. Statistical physics of neural systems with nonadditive dendritic coupling. Phys. Rev. X 4, 011053 (2014).
Google Scholar
Clark, D. G. & Abbott, L. Theory of coupled neuronal-synaptic dynamics. Phys. Rev. X 14, 021001 (2024).
Google Scholar
Maheswaranathan, N., Williams, A., Golub, M., Ganguli, S. & Sussillo, D. Universality and individuality in neural dynamics across large populations of recurrent networks. Adv. Neural Inf.n Process. Syst. 32 (2019).
Driscoll, L. N., Shenoy, K. & Sussillo, D. Flexible multitask computation in recurrent networks utilizes shared dynamical motifs. Nat. Neurosci. 27, 1349–1363 (2024).
Article Google Scholar
Kadmon, J. & Sompolinsky, H. Transition to chaos in random neuronal networks. Phys. Rev. X 5, 041030 (2015).
Google Scholar
Engelken, R., Wolf, F. & Abbott, L. F. Lyapunov spectra of chaotic recurrent neural networks. Phys. Rev. Res. 5, 043044 (2023).
Article Google Scholar
Sanzeni, A., Histed, M. H. & Brunel, N. Response nonlinearities in networks of spiking neurons. PLoS Comput. Biol. 16, e1008165 (2020).
Article ADS Google Scholar
Hennequin, G., Ahmadian, Y., Rubin, D. B., Lengyel, M. & Miller, K. D. The dynamical regime of sensory cortex: stable dynamics around a single stimulus-tuned attractor account for patterns of noise variability. Neuron 98, 846–860 (2018).
Article Google Scholar
Beiran, M. & Ostojic, S. Contrasting the effects of adaptation and synaptic filtering on the timescales of dynamics in recurrent networks. PLoS Comput. Biol. 15, e1006893 (2019).
Article ADS Google Scholar
Muscinelli, S. P., Gerstner, W. & Schwalger, T. How single neuron properties shape chaotic dynamics and signal transmission in random neural networks. PLoS Comput. Biol. 15, e1007122 (2019).
Article ADS Google Scholar
Hadjiabadi, D. et al. Maximally selective single-cell target for circuit control in epilepsy models. Neuron 109, 2556–2572 (2021).
Article Google Scholar
Rajan, K., Harvey, C. D. & Tank, D. W. Recurrent network models of sequence generation and memory. Neuron 90, 128–142 (2016).
Article Google Scholar
Inoue, M. & Kaneko, K. Entangled gene regulatory networks with cooperative expression endow robust adaptive responses to unforeseen environmental changes. Phys. Rev. Res. 3, 033183 (2021).
Article Google Scholar
Matsushita, Y. & Kaneko, K. Homeorhesis in Waddington’s landscape by epigenetic feedback regulation. Phys. Rev. Res. 2, 023083 (2020).
Article Google Scholar
Coussement, L. et al. A transcriptional clock of the human pluripotency transition. bioRxiv 2025–03 (2025).
Miyamoto, T., Furusawa, C. & Kaneko, K. Pluripotency, differentiation, and reprogramming: a gene expression dynamics model with epigenetic feedback regulation. PLoS Comput. Biol. 11, e1004476 (2015).
Article ADS Google Scholar
Yan, J. et al. Kinetic uncertainty relations for the control of stochastic reaction networks. Phys. Rev. Lett. 123, 108101 (2019).
Article ADS MathSciNet Google Scholar
De Los Rios, P. & Barducci, A. Hsp70 chaperones are non-equilibrium machines that achieve ultra-affinity by energy consumption. Elife 3, e02218 (2014).
Article Google Scholar
De Domenico, M. et al. Mathematical formulation of multilayer networks. Phys. Rev. X 3, 041022 (2013).
MATH Google Scholar
Ghavasieh, A., Nicolini, C. & De Domenico, M. Statistical physics of complex information dynamics. Phys. Rev. E 102, 052304 (2020).
Article ADS MathSciNet Google Scholar
Ma, W., Trusina, A., El-Samad, H., Lim, W. A. & Tang, C. Defining network topologies that can achieve biochemical adaptation. Cell 138, 760–773 (2009).
Article Google Scholar
Rahi, S. J. et al. Oscillatory stimuli differentiate adapting circuit topologies. Nat. methods 14, 1010–1016 (2017).
Article Google Scholar
Yi, T.-M., Huang, Y., Simon, M. I. & Doyle, J. Robust perfect adaptation in bacterial chemotaxis through integral feedback control. Proc. Natl Acad. Sci. 97, 4649–4653 (2000).
Article ADS Google Scholar
Cover, T. M. Elements of Information Theory (John Wiley & Sons, 1999).
Sompolinsky, H., Crisanti, A. & Sommers, H.-J. Chaos in random neural networks. Phys. Rev. Lett. 61, 259 (1988).
Article ADS MathSciNet Google Scholar
Lukoševičius, M. & Jaeger, H. Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3, 127–149 (2009).
Article MATH Google Scholar
Tanaka, G. et al. Exploiting heterogeneous units for reservoir computing with simple architecture. In Proc. 23rd International Conference on Neural Information Processing, ICONIP 2016, Kyoto, Japan, 16–21 October 2016, Proceedings, Part I 23, 187–194 (Springer, 2016).
Malik, Z. K., Hussain, A. & Wu, Q. J. Multilayered echo state machine: a novel architecture and algorithm. IEEE Trans. Cybern. 47, 946–959 (2016).
Article Google Scholar
Risken, H. Fokker-Planck Equation (Springer, 1996).
Tubiana, J. & Monasson, R. Emergence of compositional representations in restricted Boltzmann machines. Phys. Rev. Lett. 118, 138301 (2017).
Article ADS Google Scholar
Barbier, J., Krzakala, F., Macris, N., Miolane, L. & Zdeborová, L. Optimal errors and phase transitions in high-dimensional generalized linear models. Proc. Natl Acad. Sci. 116, 5451–5460 (2019).
Article ADS MathSciNet Google Scholar
Mézard, M. Mean-field message-passing equations in the Hopfield model and its generalizations. Phys. Rev. E 95, 022117 (2017).
Article ADS MathSciNet Google Scholar
May, R. M. Will a large complex system be stable? Nature 238, 413–414 (1972).
Article ADS Google Scholar
Vasicek, O. A test for normality based on sample entropy. J. R. Stat. Soc. Ser. B Stat. Methodol. 38, 54–59 (1976).
Article MathSciNet MATH Google Scholar
Ji, P. et al. Signal propagation in complex networks. Phys. Rep. 1017, 1–96 (2023).
Article ADS MathSciNet Google Scholar
Capraro, V., Di Paolo, R., Perc, M. & Pizziol, V. Language-based game theory in the age of artificial intelligence. J. R. Soc. Interface 21, 20230720 (2024).
Article Google Scholar

Download references

Acknowledgements

G.N. acknowledges funding provided by the Swiss National Science Foundation through its Grant CRSII5_186422. The authors acknowledge the support of the Munich Institute for Astro-, Particle and BioPhysics (MIAPbP), funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC-2094 - 390783311, where this work was first conceived during the MOLINFO workshop. D.M.B. is funded by the program STARS@UNIPD with the project “ActiveInfo.”

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Quantitative Life Sciences Section, The Abdus Salam International Center for Theoretical Physics (ICTP), Trieste, Italy
Giorgio Nicoletti
ECHO Laboratory, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Giorgio Nicoletti
Department of Physics and Astronomy, University of Padova, Padova, Italy
Daniel Maria Busiello
Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
Daniel Maria Busiello

Authors

Giorgio Nicoletti
View author publications
Search author on:PubMed Google Scholar
Daniel Maria Busiello
View author publications
Search author on:PubMed Google Scholar

Contributions

G.N. and D.M.B. designed the study, performed calculations and numerical simulations, interpreted the results, and wrote the manuscript.

Corresponding author

Correspondence to Daniel Maria Busiello.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Physics thanks the anonymous reviewers for their contribution to the peer review of this work. [A peer review file is available].

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nicoletti, G., Busiello, D.M. Fast nonlinear integration drives accurate encoding of input information in large multiscale systems. Commun Phys 8, 437 (2025). https://doi.org/10.1038/s42005-025-02339-z

Download citation

Received: 24 June 2025
Accepted: 30 September 2025
Published: 18 November 2025
Version of record: 18 November 2025
DOI: https://doi.org/10.1038/s42005-025-02339-z