Introduction

The relationship between structure and function is fundamental in complex systems1,2,3, and important efforts have been invested in developing network models to better understand it. In particular, models of dynamics on networks4,5,6,7 have been proposed to assess the influence of network structure over the temporal evolution of the activity in the system. In turn, data-driven models8, 9, dimension-reduction techniques10,11,12,13,14, and mean-field frameworks15,16,17,18,19 have deepened our predictive capabilities. Among other things, these theoretical approaches have shed light on the relationship between dynamics criticality and many network properties such as the degree distribution15,17, the eigenvalue spectrum20,21,22, and their community structure23,24. Fundamentally, these contributions justify our inclination for measuring and using real-world networks as a proxy to predict the behavior of complex systems.

Models of dynamics on networks have also been used as reverse engineering tools for network reconstruction25, when the networks of interactions are unavailable, noisy26,27,28 or faulty29. The network reconstruction problem has stimulated many technical contributions30: Thresholding matrices built from correlation31 or other more sophisticated measures32,33 of time series, Bayesian inference of graphical models34,35,36,37,38,39, and models of dynamics on networks40, among others. These techniques are widely used (e.g., in neuroscience41,42,43, genetics44, epidemiology40,45, and finance46) to reconstruct interaction networks on which network science tools can then be applied.

Interestingly, dynamics prediction and network reconstruction are usually considered separately, even though they are related to one another. The emergent field of network neuroscience47,48 is perhaps the most actively using both notions: Network reconstruction for building brain connectomes from functional time series, then dynamics prediction for inferring various brain disorders from these connectomes49,50. Recent theoretical works have also taken advantage of these notions to suggest that dynamics may hardly depend on the structure. In ref. 51, it was shown that time series generated by a deterministic dynamics evolving on a specific graph can be accurately predicted by a broad range of other graphs. These findings highlight how poor our intuition can be with regard to the relationship between predictability and reconstructability. Furthermore, recent breakthroughs in deep learning on graphs have benefited from proxy network substrates to enhance the predictive power of their models52,53, with applications in epidemiology9, and pharmaceutics54,55. However, the use of graph neural networks and those proxy network substrates is only supported by numerical evidence and lacks a rigorous theoretical justification. As a result, their enhanced predictability remains to be fully corroborated. There is therefore a need for a solid, theoretical foundation of reconstructability, predictability, and their relationship in networked systems.

In this work, we establish a rigorous framework that lays such a foundation based on information theory. Information theory has been regularly applied to networks and dynamics in the past. In network science, it has been used to characterize random graph ensembles56,57,58—e.g. the configuration model59,60 and stochastic block models61,62—, to develop network null models63 and to perform community detection64,65. In stochastic dynamical systems, information-theoretical measures have been proposed to quantify their predictability66,67,68,69,70,71,72,73, complexity74,75 and causal emergence76. In statistical mechanics, information transmission has been shown to reach a maximum value near the critical point of spin systems in equilibrium77,78.

Our objective is to combine these ideas into a single framework, motivated by recent works involving spin dynamics on lattices79,80 and deterministic dynamics51. Our contributions are fourfold. First, we use mutual information between structure and dynamics as a foundation for our general framework to quantify the structure-function relationship in complex systems. Second, this codependence naturally leads to the definition of measures of predictability and reconstructability. Doing so allows us to conceptually unify prediction and reconstruction problems, i.e., two classes of problems that are usually treated separately. Third, we design efficient numerical techniques for evaluating these measures on large systems. Finally, we identify a new phenomenon—a duality—where our prediction and reconstruction capabilities can vary in opposite directions. These findings further our understanding of the complexity of modeling networked complex systems, such as the brain, where both prediction and reconstruction techniques play critical roles.

Results

Information theory of dynamics on random graphs

Let us consider a random graph G whose support, \({{{{{{{\mathcal{G}}}}}}}}\), consists in the set of all graphs of N vertices, each of which has its respective nonzero probability P(G = g) with \(g\in {{{\mathcal{G}}}}\). In our framework, P(G) can be any graph distribution and reflects, from a Bayesian perspective, our prior knowledge of the structure of the system. We also consider a general discrete-time stochastic process (also called dynamics hereafter) with T time steps evolving on a realization of G and representing the possible states of the system. More precisely, we denote P(XG) the probability of a random and discrete-state time series \({{{{{{{\bf{X}}}}}}}}={({X}_{i,t})}_{i,t}\) conditioned on G, where Xi,t is the random state, with discrete support Ω, of vertex i {1,...,N} at time t {1,...,T}. We stress that X is at this point any stochastic process be it Markovian or not. The initial condition of the process is \({{{{{{{{\boldsymbol{X}}}}}}}}}_{1}={({X}_{i,1})}_{i}\). While we only exposed our framework in terms of discrete-time and discrete-state processes, it can be used for continuous-state deterministic dynamics (see Supplementary Note III) and in principle, it can also be generalized to continuous-state stochastic processes by considering a probability density function ρ(XG).

The variables X and G form themselves a Bayesian network G → X, where the arrow indicates conditional dependence81. From this model, we are interested in the mutual information between X and G—denoted I(X; G)—which is a symmetric measure that quantifies the codependence between the dynamics X and the structure G82, where I(X; G) = 0 when they are independent. It is equivalently given by

$$I({{{{{{{\bf{X}}}}}}}};G)=H({{{{{{{\bf{X}}}}}}}})-H({{{{{{{\bf{X}}}}}}}}| G)$$
(1a)
$$=H(G)-H(G| {{{{{{{\bf{X}}}}}}}})\,,$$
(1b)

where \(H(G)=-\left\langle \log P(G)\right\rangle\) and \(H({{{{{{{\bf{X}}}}}}}})=-\left\langle \log P({{{{{{{\bf{X}}}}}}}})\right\rangle\) are respectively the marginal entropies of G and X, and \(H(G| {{{{{{{\bf{X}}}}}}}})=-\left\langle \log P(G| {{{\bf{X}}}})\right\rangle\) and \(H({{{\bf{X}}}}| G)=-\left\langle \log P({{{\bf{X}}}}| G)\right\rangle\) are their corresponding conditional entropies. In the previous equations, the marginal distribution for X, the evidence, is defined as \(P({{{{{{{\bf{X}}}}}}}})={\sum }_{g\in {{{{{{{\mathcal{G}}}}}}}}}P(G=g)P({{{{{{{\bf{X}}}}}}}}| G=g)\), and the posterior is obtained from Bayes’ theorem as P(GX) = P(G)P(XG)/P(X), using the given graph prior P(G) and the dynamics likelihood P(XG). I(X; G) is a non-negative measure bounded by \(0\le I({{{{{{{\bf{X}}}}}}}};G)\le \min \left\{H(G),H({{{{{{{\bf{X}}}}}}}})\right\}\). Figure 1a provides an illustration of Eq. (1) in terms of information diagrams.

Fig. 1: Information diagram of dynamics on random graphs.
figure 1

a Areas represent amounts of information: The entropies related to G are shown on the left in blue and those related to X are on the right in orange. Mutual information---the red intersection of X and G---corresponds to the information shared by both G and X. b The highly predictable / weakly reconstructable scenario, where H(G) H(X) meaning that I(X; G) contains most of the information related to the dynamics, but only a small fraction of the information related to the graph. c The reverse scenario, i.e., highly reconstructable / weakly predictable, where H(X) H(G) meaning that I(X; G) contains most of the information related to the graph, but only a small fraction of the information related to the dynamics.

The measures presented in Eq. (1) and above can all be interpreted in the context of information theory. Information is generally measured in bits which in turn is interpreted as a minimal number of binary—i.e., yes/no—questions needed to convey it. While entropy measures the uncertainty of random variables like X and G, i.e., the minimal number of bits of information needed to determine their value, mutual information represents the reduction in uncertainty about one variable when the other is known. The fact that it is symmetric means that this reduction goes both ways: The reduction in the dynamics uncertainty when the structure is known is equal to that of the structure when the dynamics is known. Hence, mutual information measures the amount of information shared by both X and G.

As an illustration, let us consider the physical example of a spin system that depends on G through a coupling parameter J 0, where the spins are more (large J) or less (small J) likely to align with their first neighbors in G. At J = 0, the spins are completely uncorrelated and flip with probability \(\frac{1}{2}\). In this case, H(XG) = NT bits, corresponding to the maximum entropy of X: We need precisely one binary question for each spin at each time for a given structure G—e.g., is the spin of vertex i at time t up? When J > 0, correlation is introduced between connected spins. As a result, a single question about the spin of vertex i at time t can provide additional information about the spins of other vertices at other times, and thus, H(XG) < NT. The interpretation of H(X) is analogous to that of H(XG), as it measures the number of binary questions needed to determine X when the graph is unknown. From this perspective, the mutual information I(X; G), as expressed by the difference between H(X) and H(XG), is the reduction in the number of questions needed to predict X ensuing from the knowledge of G. Hence, I(X; G) measures to which extent the knowledge of the graph G improves our ability to forecast X, i.e. its temporal predictability.

Similar observations can be made from the structural perspective. Suppose that X is the spin dynamics mentioned previously and G is a random graph, where each edge exists independently with probability p. This yields \(H(G)=-\left(\begin{array}{l}N\\ 2\end{array}\right)[p\log p+(1-p)\log (1-p)]\), where \(\left(\begin{array}{l}N\\ 2\end{array}\right)\) is the total number of possible undirected edges. When \(p=\frac{1}{2}\), we have \(H(G)=\left(\begin{array}{l}N\\ 2\end{array}\right)\) bits, which is again the maximum entropy of G. We therefore need precisely one binary question for each of the \(\left(\begin{array}{l}N\\ 2\end{array}\right)\) edges in the graph—e.g., is there an edge between i and j?—to completely determine its state. When the dynamics X is known, H(GX) is interpreted similarly to H(G), but also takes into account the observation of the spins X which introduces correlation between the edges of G. As a result, each bit can provide information about more than one edge, even in the case \(p=\frac{1}{2}\) where we a priori need one bit per possible edge to fully reconstruct G. Consequently, the knowledge of X reduces uncertainty about G (i.e., H(GX) ≤ H(G), see ref. 82, Theorem 2.6.5), and therefore allows for its reconstruction; I (X; G) thus measures the reconstructability of G, i.e. the extent to which information about G can be revealed from X.

In practice, I(X; G) can be used to explain the performance of both prediction and reconstruction algorithms (see “Performance of prediction and reconstruction algorithms” section for further detail). From the prediction perspective, it quantifies the sensitivity of the time series to the structure of interactions prescribed by G, i.e., the gain in predictability of including G for the extrapolation of X. This can be measured by comparing the true transition probabilities of the process X as given by the conditional model P(XG), with those predicted by models that do not include G in their predictions. This experiment was performed in ref. 51 for deterministic dynamics on graphs, to show that high prediction accuracy of time series can sometimes be achieved without the knowledge of the true graph. In Fig. 2a, we use the mean absolute error—the same measure as in ref. 51—to perform the comparison. In turn, we associate the high predictive capabilities of the true conditional model where the error with the graph-independent model is high. Likewise, I(X; G) provides strong insights into the reconstruction accuracy of algorithms such as the transfer entropy method33 (Fig. 2b). By interpreting the reconstruction problem as a binary classification, we are allowed to quantify the reconstruction accuracy with the area under the curve (AUC) of the receiver operating characteristic (ROC) curve. In all cases, I(X; G) peaks in the same coupling interval as the different reconstruction methods even if the two measures are a priori different.

Fig. 2: Comparison between the mutual information and algorithm performance measures.
figure 2

a Prediction algorithms and b reconstruction algorithms. This comparison is performed with time series of length T = 100 generated with the Glauber dynamics evolving on Erdős-Rényi graphs with N = 100 nodes and M = 250 edges, for different coupling constants J. a The mean absolute error between the true transition probabilities used in P(XG) and the ones predicted by different graph-independent models models: a logistic regression (green diamonds) and a multilayer perceptron (MLP, purple triangles). b The average area under the curve (AUC) of the receiver operating characteristic (ROC) curve for different reconstruction algorithms: the correlation matrix method31 (light blue squares), the Granger causality method32 (green diamonds), and the transfer entropy method33 (purple triangles). In both panels, we use two axes to represent (left axis) I(X; G), denoted by the gray area bounded by the two biased estimators (red lines, see “Estimators of the mutual information” section), and (right axis) the performance measures; the maximum of I(X; G) is shown with the horizontal dashed line. See “Performance of prediction and reconstruction algorithms” section for further details.

The mutual information I(X; G) is, therefore, both a measure of predictability and reconstructability, thereby unifying these two concepts. We say that a system is perfectly predictable when the mutual information contains all the information about X, that is when I(X; G) = H(X) (see Fig. 1b). Likewise, we say that it is perfectly reconstructable when I(X; G) = H(G) (see Fig. 1c). Consequently, whenever I(X; G) > 0, we expect the system to be predictable and reconstructable to a certain degree. Otherwise, when I(X; G) = 0, the system is said both unpredictable and unreconstructable. Yet, I(X; G) by itself is hardly comparable from one system to another. Indeed, a specific value of I(X; G) may correspond to opposing scenarios when it comes to predictability and reconstructability, as shown in Fig. 1b, c. Thus, it is more convenient to use normalized quantities such as the uncertainty coefficients

$$U({{{{{{{\bf{X}}}}}}}}\,| \,G)=\frac{I({{{{{{{\bf{X}}}}}}}};G)}{H({{{{{{{\bf{X}}}}}}}})}\,,$$
(2a)
$$U(G\,| \,{{{{{{{\bf{X}}}}}}}})=\frac{I({{{{{{{\bf{X}}}}}}}};G)}{H(G)}\,,$$
(2b)

which are bounded between 0 and 1. Contrary to I(X; G), U(XG) and U(GX) represent relative amount of information. For instance, U(GX) = 1 implies that I(X; G) = H(G), which in principle means that perfect reconstruction can be achieved as all the information of G is contained in X. Likewise, U(XG) = 1 means that I(X; G) = H(X), which indicates that all the information in X is determined by G: a perfectly accurate prediction of X can be made with G alone. This maximum value is guaranteed when X is deterministic and there is only one initial condition (see Supplementary Note III). Having I(X; G) = 0 implies that U(XG) = U(GX) = 0, which again means that G and X are independent. Any value in-between of U(XG) and U(GX) represents different degrees of predictability and reconstructability, respectively.

The “Simple example” section will present simple concrete examples to provide a better intuition about these concepts. Before we get to these examples, we investigate the influence of the knowledge of the past of X over the relationship between its future and its structure, as measured through reconstructability and predictability.

Past-dependent mutual information

It is often the case that predictability measures the sensitivity to the initial conditions of a process X. For instance, refs. 66,68,83,84 used different versions of the mutual information between X1 and X as a direct measure of predictability. Then, a system is more predictable if the past allows it to better predict the future. In this spirit, we generalize our framework in such a way that the mutual information between the process X and its structure G includes some information about the past of X.

We define Xpast as the past of X and \({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}\) as its future, such that \({{{{{{{\bf{X}}}}}}}}=({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}},{{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}})\), see Fig. 3a. We define τ as the length of Xpast and T − τ as the length of the future \({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}\). Our measure of interest in this case is \(I({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}};G| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})\), and it is equal to

$$I({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}};G| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})=I({{{{{{{\bf{X}}}}}}}};G)-I({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}};G)\,,$$
(3)

which is a conditional mutual information—the green intersection in Fig. 3a. In turn, a small τ includes less contribution to the observed past, which leads to a scenario increasingly similar to that presented in “Information theory of dynamics on random graphs” section as shown by Fig. 3b. As τ gets larger, more contribution is left to Xpast resulting in a smaller \(I({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}};G| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})\), even though the total mutual information I(X; G)—the union of the red, green, and gray sets—is large (see Fig. 3c). Similarly to “Information theory of dynamics on random graphs” section, we then define the partial uncertainty coefficients, bounded between 0 and 1:

$$U({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}| G;{{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})=\frac{I({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}};G| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})}{H({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})}\,,$$
(4a)
$$U(G| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}};{{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})=\frac{I({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}};G| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})}{H(G| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})}\,,$$
(4b)

measuring the partial predictability of \({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}\) from G and partial reconstructability of G given \({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}\), respectively. The above quantities can be expressed in terms of previously visited ones. For instance, in Eq. (3), I(X; G) and I(Xpast; G) can be expressed using Eq. (1). Likewise, the normalizing factor \(H({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})\) is expressed in terms of state entropies using the joint entropy \(H({{{{{{{\bf{X}}}}}}}})=H({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}},{{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}})\), i.e., \(H({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})=H({{{{{{{\bf{X}}}}}}}})-H({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})\). And finally, H(GXpast) is evaluated similarly to H(GX).

Fig. 3: Information diagrams for the past-dependent information measures.
figure 3

In a, we show the information diagram of the random variable triplet \(({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}},{{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}},G)\), where Xpast represents the past states, \({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}\), the future states, and G, the structure of the system. The quantities of interest are \(I({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}};G| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})\) indicated by the green set, \(H({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})\) shown by the union of the pink and green sets, and H(GXpast), represented by the union of the blue and green sets. b, c Two extreme scenarios where the length of the past τ is small and large, which illustrates how the different information measures change with τ.

Whereas the interpretation of the partial uncertainty coefficients is analogous to those presented in the previous section, they nevertheless measure conceptually different quantities. Indeed, by using \(I({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}};G| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})\), it is implied that the information about the past has been removed from the total mutual information between X and G. As a result, the partial predictability \(U({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}| G;{{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})\) measures the gain in predictability over \({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}\) when including G in the prediction, compared to a model which only uses Xpast. Additionally, the removed information likely includes some information about G, since I(Xpast; G)  0. Hence, the partial reconstructability, as defined by \(U(G| {{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}};{{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{past}}}}}}}}})\), measures the reconstructability of the remaining information about G when observing \({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}\), i.e., information which has not been unveiled from the observation of Xpast.

In essence, for some ξ > 0, the case τ = 1 with T = ξ + 1 is similar to the case τ = T − ξ with T > ξ since \({{{{{{{{\bf{X}}}}}}}}}_{{{{{{{{\rm{future}}}}}}}}}\) have the same length ξ in both cases. From a reconstruction perspective, they quantify the reconstructability of G from a process with ξ time steps. However, the reconstructed information is quite different in both cases, since with τ = 1 and T = ξ + 1 no prior information is given—assuming that the initial conditions X1 are independent from G—, while a lot of information has already been processed when τ = T − ξ. Furthermore, increasing τ draws our attention away from the actual relationship between X and G of interest, since this relationship should exclude all information about Xpast. For this reason, we will mostly focus on the case τ = 1 in the remainder of the paper.

Simple example

The interpretation of reconstructability and predictability in terms of U(GX) and U(XG) can be grasped more firmly through an elementary example. We consider a system where only two graphs are possible, namely g1 and g2, such that P(g1) = p and P(g2) = 1 − p (see Fig. 4a). The entropy of this graph is therefore \(H(G)={{{{{{{\mathcal{H}}}}}}}}(p)\), where \({{{{{{{\mathcal{H}}}}}}}}(p)=-p\log p-(1-p)\log (1-p)\) is the binary entropy. These two graphs can generate together three outcomes for X, i.e., x1, x2, or x3. The graph g1 generates x1 and x2 with probabilities r and 1 − r respectively. Likewise, g2 generates x2 and x3 with respective probabilities s and 1 − s. As we can see, x1 can only be generated by g1 and x3 can only be the outcome of g2, while x2 can be generated by both graphs.

Fig. 4: Example with two graphs and three time series.
figure 4

a The stochastic system consists of two possible graphs, \(g_{1}\) and \(g_{2}\), and three possible time series, \({{\bf{x}}}_{1},\, {{\bf{x}}}_{2}\), and \({{\bf{x}}}_{3}\). b, c The reconstructability \(U(G|{{{\bf{X}}}})\) (solid blue line) and the predictability \(U({{{\bf{X}}}}|G)\) (dashed orange line) are shown for \(s=1\) (b) and \(s=\frac{1}{2}\) (c). The shaded area in (c) indicates the region where \(U({{\bf{X}}}|G) \, {{{\rm{and}}}} \, U(G|{{{\bf{X}}}})\) vary in opposite directions.

We now focus on the scenario with s = 0—the general expressions for I(X; G) and the other entropies are obtained in Section II of the Supplementary Information. In this case, only g1 can generate x1 and x2, while g2 can only generate x3. Therefore, we have perfect reconstructability of either graphs, meaning U(GX) = 1 for any p and r, since the outcome of X tells us immediately which graph generated it. However, X is imperfectly predictable from G since \(U({{{{{{{\bf{X}}}}}}}}| G)=\frac{{{{{{{{\mathcal{H}}}}}}}}(p)}{{{{{{{{\mathcal{H}}}}}}}}(p)+p{{{{{{{\mathcal{H}}}}}}}}(r)} \, < \, 1\) when 0 < p, r < 1, even though x3 can be perfectly predicted from g2 as it is its only possible outcome. The remaining entropy, i.e. the second term of the denominator, \(p{{{{{{{\mathcal{H}}}}}}}}(r)\), corresponds to the uncertainty related to whether g1 generates x1 or x2.

When s = 1, the system is both partially predictable and reconstructible, with \(U({{{{{{{\bf{X}}}}}}}}\,| \,G)=1-\frac{p{{{{{{{\mathcal{H}}}}}}}}(r)}{{{{{{{{\mathcal{H}}}}}}}}(pr)} < 1\) and \(U(G\,| \,{{{{{{{\bf{X}}}}}}}})=\frac{{{{{{{{\mathcal{H}}}}}}}}(pr)-p{{{{{{{\mathcal{H}}}}}}}}(r)}{{{{{{{{\mathcal{H}}}}}}}}(p)} < 1\) for all 0 < p, r < 1. Both g1 and g2 can generate x2, but the probability that g1 generates x2 decreases with r. This results in a gradual increase of predictability and reconstructability as r approaches 1, where the systems tend to a one-to-one mapping between the outcomes of G and X.

The intermediate cases when 0 < s < 1 are also interesting because they give rise to an interval in r where the system becomes less predictable but more reconstructable as r increases, as highlighted by the gray area in Fig. 4c. This happens because, as r increases for a fixed s, the growth of entropy of X dominates I(X; G), resulting in the dual behavior of U(XG) and U(GX).

θ-duality between predictability and reconstructability

Predictability and reconstructability in dynamics on random graphs offer two perspectives of the same information shared by G and X—two sides of the same coin. However, as we have previously seen with simple examples, predictability, and reconstructability do not necessarily go hand in hand even though they are related: An increasing U(GX) according to some parameter θ of the system does not necessarily imply an increase of U(XG) and vice versa. Furthermore, a high value of U(GX) is not tied to a high value of U(XG), and conversely, as illustrated in Fig. 1b, c. Indeed, U(GX) and U(XG) can take opposing values, depending on H(G) and H(X), for the same value of I(X; G). This phenomenon can also be observed in the performance of prediction and reconstruction (see Supplementary Note I). In the literature, a hint of the existence of such dual behavior was recently corroborated in ref. 51 for continuous-state deterministic dynamics. The authors showed that high prediction accuracy can be achieved with graphs reconstructed from the very time series they want to predict, even if they are different from the original graph that generated the time series. This phenomenon can be understood through our framework (see Section III of the Supplementary Information) and we now devote the rest of the section to precisely define and characterize the somewhat counterintuitive phenomenon of duality.

We identify a duality when U(XG) and U(GX) vary in opposite directions when a parameter, say θ, is changed. More specifically, we say that they are dual with respect to θ, or θ-dual, in an interval Θ if and only if the signs of their derivative with respect to θ are different for every θ* Θ:

$${\left[\frac{\partial U(G| {{{{{{{\bf{X}}}}}}}})}{\partial \theta }\frac{\partial U({{{{{{{\bf{X}}}}}}}}| G)}{\partial \theta }\right]}_{\theta={\theta }^{*}} < \,0\,.$$
(5)

This criterion formally relies on the existence of regions Θ where the variations of U(GX) and U(XG) with respect to θ are opposite, regardless of their amplitude (see also “Formal definition of θ-duality” section). We use this criterion to relate the existence of extrema of U(GX) and U(XG) with that of regions of θ-duality (see Lemma 1 in “Formal definition of θ-duality” section).

With our intuition being established from simple examples and our precise definition, we are finally ready to state one of the main results of the paper. Recalling that T is the length of process X, we prove that reconstructability and predictability are T-dual for a vast class of Markov chains.

Theorem 1

Let X = (X1, X2,,XT) be a Markov chain of length T whose transition probabilities are conditional to some discrete random variable G that is independent of T and such that H(Xt+1Xt) > 0 for all t {1,…,T − 1} (i.e., X is non-deterministic). Moreover, suppose that the state spaces of X and G are finite, and that X has a finite nonzero entropy rate and that G has a nonzero entropy. Then there exists a positive constant ϕ such that the uncertainty coefficients U(GX) and U(XG) are T-dual for all Tϕ.

The proof of this theorem is in “Proof of the universality of the T-duality” section. It is a consequence of the fact that the mutual information is strictly increasing with T—and so is U(GX) since H(G) is independent of T—whenever the entropy rate of X is positive. As a result, U(GX)—and numerator, I(X; G)—stagnates at some point in T, while U(XG) keeps decreasing because its denominator increases in an asymptotically linear manner with T. We refer to this opposing behavior as a duality between U(GX) and U(XG) with respect to T, or a T-duality for short (not to be confused with target space duality in string theory85.) When the entropy rate is not well-defined, like for non-stationary processes, the universality of the T-duality might not hold, while it remains possible to observe it in localized intervals of T.

Figure 5 illustrates the universality of the T-duality using the special case of binary Markov chains (i.e., \({{\Omega }}=\left\{0,1\right\}\), see “Binary Markov chains on graphs” section). These systems are parametrized by their activation (0 → 1) and deactivation (1 → 0) probability functions, denoted α(ni,t, mi,t) and β(ni,t, mi,t), respectively. In general, the activation and deactivation functions depends solely on ni,t and mi,t, i.e., the number of active and inactive neighbors of vertex i at time t. We present multiple examples of binary Markov processes with different origins in Table 1: The Glauber dynamics, the Suspcetible-Infectious-Susceptible (SIS) dynamics, and the Cowan dynamics.

Fig. 5: T-duality in binary dynamics evolving on small Erdős-Rényi random graphs.
figure 5

a, d Glauber dynamics, b, e SIS dynamics, and c, f Cowan dynamics. Each panel shows the reconstructability U(GX)  [0, 1] (blue) and the predictability coefficient U(XG)  [0, 1] (orange) as a function of the number of time steps T. We used graphs of N = 5 vertices and E = 5 edges, meaning an average degree of 〈k〉 = 2; we fixed τ = 1 in the top row, and τ = T/2 in the bottom row. Each symbol corresponds to the average value measured over 1000 samples. We also show different values of the coupling parameters using different symbols: a, d \(J \in \left\{\frac{1}{2},1,\,2\right\}\) for Glauber, b, e \(\lambda \in \left\{\frac12,1,\, 2\right\}\) for SIS, and c, f \(\nu \in \left\{\frac12,1,\,2\right\}\) for Cowan.

Table 1 Activation and deactivation probability functions α(n, m) and β(n, m)

The aforementioned Glauber dynamics86, which have been used to describe the time-reversible evolution of magnetic spins aligning in a crystal, have been tremendously studied because of its critical behavior and its phase transition. Its stationary distribution is given by the Ising model which has found many applications in condensed-matter physics87 and statistical machine learning81,88, to name a few. The SIS dynamics is a canonical model in network epidemiology5 often used for modeling influenza-like disease89, where periods of immunity after recovery are short. In this model, susceptible (or inactive) vertices get infected by each of their infected (active) first neighbors, with a constant transmission probability, and recover from the disease with a constant recovery probability. The simplicity of the SIS model has allowed for deep mathematical analysis of its absorbing-state phase transition15,17,20. Finally, the Cowan dynamics90 has been proposed to model the neuronal activity in the brain. In this model, quiescent neurons fire if their input current, coming from their firing neighbors, is above a given threshold. Its mean-field approximation91 reduces to the Wilson-Cowan dynamics92, one of the most influential models in neuroscience93. For each model, we can identify an inactive state—down, susceptible, or quiescent—and an active one—up, infectious, or firing. The corresponding activation and deactivation probabilities are given in Table 1.

Figure 5 numerically supports Theorem 1 and clearly illustrates the T-duality for each dynamics, with different values of their parameters and different past lengths τ. We used the Erdős-Rényi model as the random graph on which these dynamics evolve. The support \({{{{{{{\mathcal{G}}}}}}}}\) is the set of all simple graphs of N vertices with E edges, and

$$P(G)={\left(\begin{array}{l}\left(\begin{array}{l}N\\ 2\end{array}\right)\\ E\end{array}\right)}^{-1}\,.$$
(6)

Note that, in this example, we consider the well-known Erdős-Rényi model for simplicity (Eq. (6)). Furthermore, we considered very small graphs of size N = 5, because the exact evaluation of I(X; G) is computationally intractable. For larger systems, biased estimators can be designed to bound I(X; G) as we show in “Estimators of the mutual information” section. We demonstrate the flexibility of our framework with regard to the random graph models by using more sophisticated and data-driven graph models in the following section.

The T-duality persists for the past-dependent measures presented in “Past-dependent mutual information” section, as illustrated by the bottom row of Fig. 5, for τ = T/2. However, note that for sufficiently large τ, the duality seems to disappear. We refer to Section VII of the Supplementary Information for further detail. One can only wonder how many different kinds of parameters can lead to θ-dualities. Maybe some may control the general behavior of the dynamics, and others some aspect of the system structure which, in turn, may also impact the dynamics. In the next section, we investigate those that are related to critical phenomena in complex systems.

Duality and criticality

Despite their different nature and range of applications, the three models presented in Table 1 share several properties of interest. For instance, each model has a coupling parameter that controls the influence of the state of the first neighbors on the transition probabilities. They also all feature a phase transition in the infinite size limit whose position is determined by the coupling parameter (see Section IX of the Supplementary Information). We now investigate the influence of criticality over the existence of θ-dualities, where θ is a coupling parameter.

For the Glauber dynamics, this parameter is the coupling constant J, which dictates the reduction (increase) in the total energy of a spin configuration when two neighboring spins are parallel (antiparallel). The Glauber dynamics features a continuous phase transition at a critical point Jc between a disordered and an ordered phase, where for J < Jc the spins are disordered resulting in a vanishing magnetization, and for which this magnetization is nonzero when J > Jc.

For the SIS dynamics, it is the transmission rate λ that acts as a coupling parameter. Like the Glauber dynamics, the SIS dynamics possesses a continuous phase transition where, when λ < λc, the system reaches an absorbing—or inactive—state from which it cannot escape, and an active state, when λ > λc, where a nonzero fraction of the vertices remain active over time. It should be emphasized that in our considered version of the SIS dynamics, referring to the system reaching a true absorbing state is not strictly accurate due to the allowance for self-infection ϵ, which enables escape from the completely inactive state. Instead, the system approaches a metastable state with most vertices becoming asymptotically inactive. However, it can be shown that the two-phase transitions are quite similar for small ϵ94.

The Cowan dynamics can both feature a continuous or a first-order phase transition between an inactive and an active phase depending on the value of slope a, for which the coupling parameter is ν, i.e., the potential gain for each firing neighbors. The continuous and first-order phase transitions of the Cowan dynamics are quite different in that the latter is characterized by two thresholds, namely the forward and backward thresholds \({\nu }_{{{{{{{{\rm{c}}}}}}}}}^{{{{{{{{\rm{b}}}}}}}}} \, < \, {\nu }_{{{{{{{{\rm{c}}}}}}}}}^{{{{{{{{\rm{f}}}}}}}}}\), respectively (see Section IX in the Supplementary Information). Hence, the Cowan dynamics has a first-order phase transition that exhibits a bistable region \(\nu \in ({\nu }_{{{{{{{{\rm{c}}}}}}}}}^{{{{{{{{\rm{b}}}}}}}}},\, {\nu }_{{{{{{{{\rm{c}}}}}}}}}^{{{{{{{{\rm{f}}}}}}}}})\), where both the inactive and active phases are reachable depending on the initial conditions.

To account for the heterogeneous network structure observed in a wide range of complex systems1, we simulate the dynamics on the configuration model, a random graph whose—potentially heterogeneous—degree sequence k is fixed and whose support \({{{{{{{\mathcal{G}}}}}}}}\) corresponds to the set of all loopy multigraphs of degree sequence k. The probability of a multigraph g in this ensemble is

$$P(G=g)=\frac{(2E)!!}{(2E)!}\frac{{\prod }_{i}{k}_{i}!}{{\prod }_{i < j}{M}_{ij}!{\prod }_{i}{M}_{ii}!!}\,,$$
(7)

where Mij counts the number of edges connecting vertices i and j in the multigraph g and 2E = ∑iki is the number of half-edges in g. Like the Erdős-Rényi model, the configuration model fixes the number of edges, but also fixes the degree sequence k = (k1,, kN).

Figure 6 shows the predictability and reconstructability, as estimated by the MF estimator, of the three dynamics evolving on instances drawn from the configuration model. The top row shows the results when using a synthetic degree sequence obtained from a geometric degree distribution, while for the bottom row, degree sequences from different real networks are used for each dynamics. These distributions are shown in Fig. 7. We used the Little Rock Lake food web95 (as in ref. 40) jointly with the Glauber dynamics to simulate a simplification of the interaction between species. In the case of the SIS dynamics, we considered the European airline network96 to mimic the spread of an epidemic. Finally, to simulate the neural activity of the Cowan dynamics, we used the C. Elegans neural network97.

Fig. 6: Dynamics evolving on configuration model graphs.
figure 6

a, d Glauber dynamics, b, e SIS dynamics, and c, f Cowan dynamics. We used the configuration model (see Eq. (7)) to generate multigraphs of varying sizes and degree distributions. In the top row, we generated multigraphs with geometric degree distribution of size N = 1000 and with M = 2500 edges (see Fig. 7a). In the bottom row, we used the degree distribution of real networks: d Little Rock Lake food web95, e European airline route network96, f C. Elegans neural network97. The parameters used to generate the time series are the same in the top and bottom panels (see Table 1), except in f the time series length is T = 5000 while in the others T = 2000. Similar to Fig. 5, U(GX) is shown in blue (left axis) and U(XG) is shown in orange (right axis). We show, for each dynamics, the uncertainty coefficients as a function of the coupling parameter: J for Glauber, λ for SIS, and ν for Cowan. Each shaded area indicates a range of couplings over which duality was observed. The vertical dotted-dashed lines correspond to the phase transition thresholds of each dynamics, which are estimated from Monte Carlo simulations (see Section IX of the Supplementary Information). For the Cowan dynamics, the forward and backward branches are shown with their corresponding thresholds and dual regions (see main text).

Fig. 7: Degree distributions of the graphs used in Fig. 6.
figure 7

a multigraphs with geometric degree distribution ρ(k) = (1 − p)pk where p = 5/6, with N = 1000 nodes and M = 2500 edges, b Little Rock Lake food web95, c European airline route network96, d C. Elegans neural network97. See “Real networks” section for further details about the graphs.

First, the results of Fig. 6 show a meaningful comparison between the dynamics for different types of structures. For example, on the one hand, the Glauber dynamics is globally less predictable than the other two, since its predictability coefficient is overall smaller. In other words, the knowledge of a graph g provides less information about X in the Glauber dynamics in comparison with the others, relative to the total amount of information needed to predict X. This is related to the time reversibility of the Glauber dynamics, which allows any vertex to transition from the inactive to the active state (and vice versa) with nonzero probability, at any time, effectively making the Glauber dynamics more random than the others—i.e. H(X) is greater for Glauber than the other processes. On the other hand, the SIS and Cowan dynamics are shown as practically unpredictable and unreconstructable when their coupling parameter is below their respective critical point. This precisely occurs in the inactive phase, where the system rapidly reaches the inactive state and no mutual information can be generated. By contrast, the Glauber dynamics does not reach an inactive state below its critical point, which explains the gradual increase in predictability and reconstructability in that region.

Several additional observations are worth making. All dynamics exhibit maxima for U(XG) and U(GX), which delineate a region of duality illustrated by the shaded areas (two for Cowan, that is one for each branch). These regions are close to, but systematically above, their respective phase transition thresholds, regardless of type of degree sequence. A similar phenomenon in spin dynamics on non-random lattices has been reported by previous works79,80, in which the information transmission rate between spins—a measure akin to I(X; G)—is maximized above the critical point. Our numerical results are consistent with theirs, and suggest that their findings regarding near-critical systems even apply beyond spin dynamics on fixed lattices, to other types of processes on more heterogeneous and random structures.

Discussion

In this work, we used information theory to characterize the structure-function relationship with mutual information. We showed how mutual information is a natural starting point to define both predictability and reconstructability in dynamics on networks, and even how it explains the performance accuracy of prediction and reconstruction algorithms. In turn, we demonstrated how prediction and reconstruction in complex systems are intrinsically related. Our approach is quite general, allowing the exploration of different configurations of dynamics on networks of the form G → X, thus varying the nature of the process itself as well as the random graph on which it evolves. Our framework could be extended to adaptive systems98,99,100,101 where both X and G influence each other (i.e., X ↔ G). The relationship between X and G could also go the other way around: A system in which X generates a graph G (i.e., X → G). Hyperbolic graphs102,103 fall into this category, where X represents a set of coordinates, and our framework could be extended to quantifying the feasibility of network geometry inference104,105,106.

We exposed various examples where our measures can be computed analytically and found efficient ways to estimate them numerically when needed, thus allowing thorough investigation of large systems. More work on this front is required, however, since the evaluation of these estimators remains quite computationally costly. It would be worth investigating dimension-reduction methods11,13,14 and approximate master equations18,107, among others, for obtaining more efficient and reliable approximations of I(X; G), U(XG) and U(GX).

Central to our findings is the peculiar discovery that predictability and reconstructability are not only related, but sometimes dual to one another. We found many examples of this duality in systems of increasing complexity, while we also emphasized that its universality is limited to certain circumstances. One of those circumstances occurs when we change the length of the processes, for which we mathematically proved the existence of duality. We also presented numerical evidence of duality near-critical points in three different dynamics on real networks. These findings generalize and formalize—while being consistent with—previous works79,80 and suggest that the reconstructability-predictability duality with respect to order parameters is closely linked to the criticality in these systems.

From a practical perspective, the existence of such a θ-duality can be critical to network modeling applications, since it also suggests a predictability-reconstructability trade-off. On one hand, by choosing the parameter θ, we can minimize the uncertainty of the reconstructed structure, but this may result in a structure that is less informative regarding the dynamics. On the other hand, we can consider the reverse case, where the process is maximally influenced by the inferred structure, whose uncertainty is nevertheless not minimized. Analogous to the position-momentum duality in the Heisenberg uncertainty principle of quantum mechanics, the predictability-reconstructability duality must be accounted for in our network models if we are to disentangle complex systems.

Methods

Binary Markov chains on graphs

The models used throughout the paper are for the most part Markov chains X = (X1,X2,...,XT), that are governed by a conditional probability P(XG) that can be factored as follows:

$$P({{{{{{\bf{X}}}}}}}\mid G)=P({{{{{{\bf{X}}}}}}}_1)\prod\limits_{t=1}^{T-1} P\left({{{{{{\bf{X}}}}}}}_{t+1} \mid {{{{{{\bf{X}}}}}}}_{t},G\right)\,.$$
(8)

The probability P(Xt+1Xt, G) is the global transition probability from state Xt to state Xt+1, and P(X1) represents the probability distribution of the initial conditions, which is independent of G in our case. More specifically, we assume that Xi is a random binary vector of size N, and that the global transition probability can be factored in terms of local transition probabilities as follows:

$$\begin{array}{ll} P\left({{{{{{\bf{X}}}}}}}_{t+1} \mid {{{{{{\bf{X}}}}}}}_{t},G\right)=\prod\limits_{i=1}^N &\left\{ \left[\alpha(n_{i,t},m_{i,t})\right]^{(1 - X_{i,t}) X_{i,t+1}} \left[1 - \alpha(n_{i,t},m_{i,t})\right]^{(1 - X_{i,t}) (1 - X_{i,t+1})}\right. \\ &\left. \left[\beta(n_{i,t},m_{i,t})\right]^{X_{i,t} (1 - X_{i,t+1})} \left[1 -\beta(n_{i,t},m_{i,t})\right]^{X_{i,t} X_{i,t+1}} \right\}\,. \end{array}$$
(9)

As mentioned in “θ-duality between predictability and reconstructability” section, the functions α and β correspond to the activation and deactivation probabilities. In the general case, they are dependent on the number of active neighbors mi, and inactive neighbors ni of a node i such that mi + ni = ki where ki is the degree of this node.

Performance of prediction and reconstruction algorithms

To substantiate our claim about the interpretation of I(X; G), we used different prediction and reconstruction algorithms and compared in Fig. 2 their performance with I(X; G). In this section, we elaborate on this analysis.

Prediction algorithms

The prediction algorithms used in Fig. 2 correspond to Markov models that predicts a transition—activation and deactivation—probability matrix P, where Pi,t corresponds to the probability that node i at time t transition to the active state in the next time step. To make the comparison with I(X; G), we compare the transition probability matrix P* of the true model—in the case of Fig. 2, the Glauber dynamics where the entries of P* are given by the activation α and deactivation β probabilities (see Table 1)—with those predicted by models learned from time series generated by the Glauber dynamics. These models are trained with 100 concatenated time series, each generated using a different graph sampled from the Erdős-Rényi model. The models are then trained to predict the time series without the knowledge of the structure. The input of these models is the complete state of the system at time t, i.e., Xt, and the output is a vector \({\hat{P}}_{t}=({\hat{P}}_{1,t},\cdots \,,{\hat{P}}_{N,t})\), where \({\hat{P}}_{i,t}\) is the predicted probability that node i transition to the active state at time t. We use the mean absolute error (MAE) between P* and \(\hat{P}\) to compare them, i.e.,

$${{{{{\rm{MAE}}}}}}\left({P}^ {\ast},\, {\hat{P}}\right)=\frac{1}{NT}\sum\limits_{{i}={1}}^N\sum\limits_{{t}={1}}^T \left|{P}^ {\ast} _{{i},{t}} - {\hat{P}}_{{i},{t}}\right|\,.$$
(10)

In doing so, the MAE quantifies the difference between a graph-dependent model and a graph-independent one, which highlights the importance of G over the prediction of X, which is a proxy of I(X; G).

We consider two graph-independent prediction models: a logistic regression model and a multilayer perceptron (MLP). In both models, the predicted transition probabilities at time t are given by

$${\hat{P}}_{t}=\frac{1}{{e}^{-f({{{{{{{{\bf{X}}}}}}}}}_{t})}+1}\,,$$
(11)

where f(Xt) is a learnable function, that is linear for the logistic regression model, i.e.

$${f}_{{{\!\!{{{{{\rm{logistic}}}}}}}}}({{{{{{{{\bf{X}}}}}}}}}_{t})={{{{{{{\bf{A}}}}}}}}{{{{{{{{\bf{X}}}}}}}}}_{t}+{{{{{{{\bf{b}}}}}}}}\,,$$
(12)

and non-linear for the MLP:

$${f}_{{{\!\!{{{{{\rm{MLP}}}}}}}}}({{{{{{{{\bf{X}}}}}}}}}_{t})={{{{{{{\rm{ReLU}}}}}}}}\left[{{{{{{{{\bf{W}}}}}}}}}_{2}{{{{{{{\rm{ReLU}}}}}}}}({{{{{{{{\bf{W}}}}}}}}}_{1}{{{{{{{{\bf{X}}}}}}}}}_{t}+{{{{{{{{\bf{b}}}}}}}}}_{1})+{{{{{{{{\bf{b}}}}}}}}}_{2}\right]\,,$$
(13)

such that

$${{{{{{{\rm{ReLU}}}}}}}}(x)=\left\{\begin{array}{ll}x\quad &\,{{\mbox{if}}}\,x \, > \, 0\,{{\mbox{,}}}\,\\ 0\quad &\,{{\mbox{otherwise}}}\,\end{array}\right.\,.$$
(14)

The weight matrices A, W1, and W2, and bias vectors b, b1, and b2, are learned via stochastic gradient descent using a cross-entropy loss.

Reconstruction algorithms

In Fig. 2, we also illustrated the relationship between the performance of reconstruction algorithms and I(X; G). These algorithms are given the time series and they compute a score matrix S, such that Sij for each pair of nodes (i, j) correlates with a probability that an edge exists between them. For the correlation matrix method31, this score is simply the correlation coefficient:

$$S_{ij}=\frac{C_{ij}}{\sigma_i \sigma_j}\,,\,C_{ij}=\frac{1}{T}\sum\limits_{t=1}^T(X_{i,t} - {{\bar{X}}}_i) (X_{j,t} - {{\bar{X}}}_j)$$
(15)

where \({{\bar{X}}}_i=\frac{1}{T}{\sum}_{t=1}^T X_{i,t}\) and \(\sigma_i=\frac{1}{T}{\sum}_{t=1}^T (X_{i,t} - {{\bar{X}}}_i)^2\). In the Granger causality method32, we compare via a F-test the prediction of the time series of a single node i using a linear auto-regressive model, with another auto-regressive model that includes the time series of node j. Then, the test determines if the models error are similar or different by computing the following F-statistic:

$${S}_{ij}=\frac{{{{\Sigma }}}_{ij}}{{{{\Sigma }}}_{i}}\,,$$
(16)

where Σi is the error variance of the auto-regressive model of i, and Σij is the error variance of the other model that also includes j. Finally, in the transfer entropy method33, the score is given by the transfer entropy from the time series of j to the time series of i:

$${S}_{ij}={T}_{{X}_{j}\to {X}_{i}}$$
(17)

where

$${T}_{{X}_{j}\to {X}_{i}}=H({X}_{i,t}| {X}_{i,t-1})-H({X}_{i,t}| {X}_{i,t-1},{X}_{j,t-1})$$
(18)

The entropies involved in the computation of \({T}_{{X}_{j}\to {X}_{i}}\) are evaluated using the maximum likelihood estimators of the probabilities P(Xi,tXi,t−1) and P(Xi,tXi,t−1, Xj,t−1), estimated from the time series itself.

We quantify the accuracy of the reconstruction using the area under the curve (AUC) of the receiver operating characteristic (ROC) curve. This curve is obtained by comparing the true positive rate with the false positive rate, for different thresholds \(\phi \in [\min \left\{S\right\},\max \left\{S\right\}]\). The AUC, being the integral of that curve, therefore represents the probability that the score matrix S classifies correctly a node pair connected by an edge.

Formal definition of θ-duality

In what follows, we define the duality between predictability and reconstructability by taking a more general stance: Instead of considering a stochastic process X evolving on a random graph G, we let G be any discrete random variable conditioning the probability of X. First, we define the local duality of the uncertainty coefficients. The latter are considered as continuously differentiable functions with respect to a parameter θ whose domain is some non-empty interval of the real line.

Definition 1

(Local duality). The uncertainty coefficients U(XG) and U(GX) are locally dual with respect to θ at θ = θ* if and only if

$${\left[\frac{\partial U({{{{{{{\bf{X}}}}}}}}| G)}{\partial \theta }\frac{\partial U(G| {{{{{{{\bf{X}}}}}}}})}{\partial \theta }\right]}_{\theta={\theta }^{*}} < \, 0\,.$$
(19)

The definition of the θ-duality, a global property, follows that of the local duality.

Definition 2

(θ-Duality). The uncertainty coefficients U(XG) and U(GX) are dual with respect to θ, or θ-dual, in the interval Θ if and only if they are locally dual for all values of θ* in Θ.

From these definitions, we relate the presence of extrema of U(XG) and U(GX) with the existence of a θ-duality.

Lemma 1

(θ-duality between extrema). Let Θ be a non-empty subinterval of the variable θ whose one endpoint is a local extremum of U(XG) and the other, a local extremum of U(GX). Moreover, suppose that U(XG) and U(GX) do not have critical points in Θ. Then the extrema points delineate a region of θ-duality if and only if they are both maxima (or both minima).

The proof of this lemma is available in Supplementary Note IV.

Proof of the universality of the T-duality

In what follows, we prove Theorem 1, that shows the universality of the T-duality, where T is the number of steps in the process X. We make use of the two following lemmas, that are proved in Supplementary Information (Notes V and VI), regarding the monotonicity of I(X; G) with respect to T and the existence of continuous extensions of U(XG) and U(GX), that will allow us to apply the Definition 1 involving derivatives.

Lemma 2

(Monotonicity of mutual information information with T). Let X = (X1, X2,   , XT) be a Markov chain of length T whose transition probabilities are conditional to some discrete random variable G that is independent of T and such that H(Xt+1Xt) > 0 for all t {1, …, T − 1}. Suppose moreover that the state spaces of X and G are finite. Then the mutual information I(X; G) is nonzero and monotonically increasing with \(T\in {{\mathbb{Z}}}_{+}\).

Lemma 3

(Continuous extension of uncertainty coefficients with T). Let X = (X1, X2,   , XT) and G respectively be a Markov chain and a discrete random variable as in Lemma 2. Then the uncertainty coefficients U(GX) and U(XG), interpreted as functions of \(T\in {{\mathbb{Z}}}_{+}\), can be uniquely generalized to functions, respectively f(T) and g(T), that are holomorphic for all \(T\in {\mathbb{C}}\), and thus real analytic for all \(T\in {{\mathbb{R}}}_{+}\). Moreover, H(X) can be extended to a function h(T) that is analytic for all \(T\in {{\mathbb{R}}}_{+}\) except where f(T) = 0.

Next, we prove Theorem 1.

Proof

According to Lemma 3, the quantities U(XG), U(GX), and H(X), which were originally defined as real functions of \(T\in {{\mathbb{Z}}}_{+}\), have unique analytic extensions on the positive real axis, i.e., \(T\in {{\mathbb{R}}}_{+}\). This allows us to treat U(XG), U(GX), and H(X) as continuously differentiable functions with respect to T, where \(U(G\,| \,{{{{{{{\bf{X}}}}}}}})=\frac{I({{{{{{{\bf{X}}}}}}}};G)}{H(G)}\) and H(X) are also monotone.

Now, by hypothesis, the entropy rate of the Markov chain X, \(R:={\lim }_{T\to \infty }\frac{H({{{{{{{\bf{X}}}}}}}})}{T}\), is well defined and nonzero. Hence, H(X) ~ RT, i.e., H(X) is positive and asymptotically linearly increasing with T. Moreover, since G is independent of T and I(X; G) > 0, it follows that I(X; G) is monotonically increasing with respect to T by Lemma 2. As a result, \(U(G\,| \,{{{{{{{\bf{X}}}}}}}})=\frac{I({{{{{{{\bf{X}}}}}}}};G)}{H(G)}\) is also monotonically increasing, since its denominator is independent of T, by assumption. This translates to the strict inequality \(\frac{\partial U(G\,| \,{{{{{{{\bf{X}}}}}}}})}{\partial T} > \, 0\). If there exists a T-duality, i.e., there is a domain of T where Eq. (19) is true, then U(XG) must be monotonically decreasing with T—or \(\frac{\partial U({{{{{{{\bf{X}}}}}}}}\,| \,G)}{\partial T} < \, 0\)—in that domain. To prove this, note that we can relate the two uncertainty coefficients using

$$H({{{{{{{\bf{X}}}}}}}})=\frac{H(G)\,U(G\,| \,{{{{{{{\bf{X}}}}}}}})}{U({{{{{{{\bf{X}}}}}}}}\,| \,G)}\,.$$
(20)

This leads to the following differential equation

$$\frac{\partial }{\partial T}\left[\log U({{{{{{{\bf{X}}}}}}}}\,| \,G)\right]=\frac{\partial }{\partial T}\left[\log U(G\,| \,{{{{{{{\bf{X}}}}}}}})\right]-\frac{\partial }{\partial T}[\log H({{{{{{{\bf{X}}}}}}}})]\,,$$
(21)

where we used the fact that \(\frac{\partial H(G)}{\partial T}=0\). Hence, to show that U(XG) is monotonically decreasing with T, the following inequality must hold

$$\frac{\partial }{\partial T}\left[\log U(G\,| \,{{{{{{{\bf{X}}}}}}}})\right] \, < \, \frac{\partial }{\partial T}\left[\log H({{{{{{{\bf{X}}}}}}}})\right]\,.$$
(22)

Suppose for a moment that U(XG) is in fact increasing, such that Eq. (22) is false. This will eventually give rise to a contradiction. Let g(T): = U(GX) and h(T): = H(X) be continuous functions of T such that their derivative with respect to T are respectively given by \({g}^{{\prime} }(\tau ):={\left.\frac{\partial f(T)}{\partial T}\right\vert }_{T=\tau }\) and \({h}^{{\prime} }(\tau ):={\left.\frac{\partial h(T)}{\partial T}\right\vert }_{T=\tau }\). Note that 0 < f(τ) ≤ 1 and h(τ) > 0 for all \(\tau \in {{\mathbb{R}}}_{+}\). If Eq. (22) is false, then

$${(\log g(T))}^{{\prime} }\ge {(\log h(T))}^{{\prime} }\,.$$
(23)

Using Grönwall’s inequality108, Theorem 1.2.1, we get

$$\frac{g(T)}{g(a)}\ge \frac{h(T)}{h(a)}\,,\,0 \, < \, a \, < \, T.$$
(24)

So far, we have established that h(T) = H(X) ~ RT and that U(GX) is monotonically increasing. We have also proved that if U(XG) is not monotonically decreasing with T, then inequality (24) is satisfied. However, the latter inequality and h(T) ~ RT readily imply that g(T) belongs to the class Ω(T), which is the set of all \(\tilde{g}(T)\) such that there exist positive constants, S and T*, for which \(\tilde{g}(T)\ge ST\) for all TT* (i.e., Knuth’s Big Omega109).

Two cases must be considered. First, if ST* > 1, then \(\tilde{g}(T)\ge S{T}^{*} > 1\), which is in direct contradiction with g(T) ≤ 1 whenever TT*. Second, if ST* ≤ 1, then choose T** > S−1T*, so that \(\tilde{g}(T)\ge S{T}^{*\ast } > 1\) for all TT**. This again contradicts the inequality g(T) ≤ 1 whenever TT**. As a result, inequality (24) cannot be satisfied when Tϕ, with \(\phi=\max \{{T}^{*},{T}^{*\ast }\}\). We thus conclude that U(XG) is monotonically decreasing for all Tϕ. Therefore, U(GX) and U(XG) are T-dual in the interval [ϕ, ). □

Estimators of the mutual information

The mutual information I(X; G) is generally intractable. Its intractability stems from the evaluation of the evidence probability, which is defined by the following equation:

$$P({{{{{{{\bf{X}}}}}}}}={{{{{{{\boldsymbol{x}}}}}}}})={\sum}_{g\in {{{{{{{\mathcal{G}}}}}}}}}P(G=g)P({{{{{{{\bf{X}}}}}}}}={{{{{{{\boldsymbol{x}}}}}}}}| G=g)\,.$$
(25)

Indeed, this sum potentially counts a number of terms which grows exponentially with the number of vertices N in the random graph. More specifically, the evidence probability appears in two entropy terms needed to compute the mutual information, namely the marginal entropy \(H({{{{{{{\bf{X}}}}}}}})=-\left\langle \log P({{{{{{{\bf{X}}}}}}}})\right\rangle\) and the reconstruction entropy \(H(G| {{{{{{{\bf{X}}}}}}}})=-\langle \log \frac{P(G)P({{{{{{{\bf{X}}}}}}}}| G)}{P({{{{{{{\bf{X}}}}}}}})}\rangle\), where \(\left\langle f(Y)\right\rangle\) denotes the expectation of f(Y). Fortunately, the evidence probability, and in turn the mutual information, can be estimated efficiently using Monte Carlo techniques, which we present in this section.

Graph enumeration approach

For sufficiently small random graphs (N ≤ 5), the evidence probability can be efficiently computed by enumerating all graphs of \({{{{{{{\mathcal{G}}}}}}}}\) and by adding explicitly each term of Eq. (25). Then, we can estimate the mutual information by sampling M graphs \({\left\{{g}^{(m)}\right\}}_{m=1..M}\), followed by M time series \({\left\{{{{{{{{{\bf{x}}}}}}}}}^{(m)}\right\}}_{m=1..M}\)—such that x(m) is generated with g(m)—, and by computing the following arithmetic average:

$$I({{{{{\bf{X}}}}}};G) \simeq \frac{1}{M} \sum\limits_{m=1}^M \log P\left({{{{{\bf{X}}}}}}={{{{{{\bf{x}}}}}}}^{(m)} | G=g^{(m)}\right) - \log P\left({{{{{\bf{X}}}}}}={{{{{{\bf{x}}}}}}}^{(m)}\right)\,.$$
(26)

The variance of this estimator scales with the inverse of \(\sqrt{M}\). In Fig. 5, we used this estimator to compute the mutual information, where M = 1000.

Variational mean-field approximation

In this approach, we estimate the posterior probability instead of the evidence probability. According to Bayes’ theorem, the posterior probability is

$$P(G| {{{{{{{\bf{X}}}}}}}})=\frac{P(G)P({{{{{{{\bf{X}}}}}}}}| G)}{P({{{{{{{\bf{X}}}}}}}})}.$$
(27)

Behind this estimator is a variational mean-field (MF) approximation that assumes the conditional independence of the edges. For simple graphs, the MF posterior is

$$P_{{{{{{\mathrm{MF}}}}}}}(G \mid {{{{{\bf{X}}}}}})=\prod\limits_{i\leq j} {\left[\pi_{ij}({{{{{\bf{X}}}}}})\right]}^{A_{ij}}\, {\left[1 - \pi_{ij}({{{{{\bf{X}}}}}})\right]}^{1 - A_{ij}},$$
(28)

where πij(X): = P(Aij = 1X) is the marginal conditional probability of the existence of the edge (i, j) given X. For multigraphs, a similar expression can be obtained, but instead involves a probability πij(mX): = P(Mij = mX) that there are m multiedges between i and j. In this case, the MF posterior becomes

$$P_{{{{{{\mathrm{MF}}}}}}}(G |{{{{{\bf{X}}}}}})=\prod\limits_{i < j}\prod\limits_{m=0}^{\infty} \left[\pi_{ij}(m|{{{{{\bf{X}}}}}})\right]^{\delta_{m,M_{ij}}},$$
(29)

where δx,y is the Kronecker delta. The MF approximation allows to compute a lower bound of the true posterior entropy, such that

$$H(G| {{{{{{{\bf{X}}}}}}}})\ge -\left\langle \log {P}_{{{{{{{{\rm{MF}}}}}}}}}(G| {{{{{{{\bf{X}}}}}}}})\right\rangle,$$
(30)

as a consequence of the conditional independence between the edges82, Theorem 2.6.5. Using the MF approximation and a strategy similar to the exact estimator, we compute the MF estimator of the mutual information as follows:

$$I(G \mid {{{{{\bf{X}}}}}}) \geq \frac{1}{M}\sum\limits_{m=1}^M \left[\log P_{{{{{{\mathrm{MF}}}}}}}\left(G=g^{(m)} \mid {{{{{\bf{X}}}}}}={{{{{{\bf{x}}}}}}}^{(m)}\right)- \log P\left(G=g^{(m)}\right)\right].$$
(31)

To compute \({P}_{{{{{{{{\rm{MF}}}}}}}}}\left(G={g}^{(m)}| {{{{{{{\bf{X}}}}}}}}={{{{{{{{\bf{x}}}}}}}}}^{(m)}\right)\), we sample a set \({{{{{{{{\mathcal{Q}}}}}}}}}^{(m)}: = \left\{{g}_{1}^{(m)},\cdots \,,{g}_{Q}^{(m)}\right\}\) of Q graphs from the posterior distribution P(GX = x(m)). Then, we estimate the probabilities \({\pi }_{ij}({{{{{{{\bf{X}}}}}}}})\simeq \frac{{n}_{ij}^{(m)}}{Q}\) using their corresponding maximum likelihood estimate, where \({n}_{ij}^{(m)}\) is the number of times the edge (i, j) is seen in \({{{{{{{{\mathcal{Q}}}}}}}}}^{(m)}\). An analogous maximum likelihood estimate is made in the multigraph case, where \({\pi }_{ij}(\omega | {{{{{{{\bf{X}}}}}}}})\simeq \frac{{n}_{ij;\omega }^{(m)}}{K}\) and \({n}_{ij;\omega }^{(m)}\) counts the number of times there were ω multiedges between i and j in \({{{{{{{{\mathcal{Q}}}}}}}}}^{(m)}\). This estimator is a lower bound of the mutual information—a consequence of Eq. (30). Hence, it is biased, and the extent of this bias is dependent on the quality of the conditional independence assumption with respect to the true random graph. Note that the MF estimator can yield negative estimates of the mutual information (see Section VIII of the Supplementary Information).

In Fig. 6, we fix the number of graphs sampled from the posterior distribution to Q = 1000, and propose 5N moves between each sample (see also “Markov chain Monte Carlo algorithm” section for more detail).

Markov chain Monte Carlo algorithm

To sample from the posterior distribution, we use a Markov chain Monte Carlo (MCMC) algorithm where, starting from a graph g, we propose a move to graph \({g}^{{\prime} }\), according to a proposition probability \(P({G}^{{\prime} }={g}^{{\prime} }| G=g)\), and accept it with the Metropolis-Hastings probability:

$$\min \left(1,{e}^{-\log {{\Delta }}}\frac{P({G}^{{\prime} }=g| G={g}^{{\prime} })}{P({G}^{{\prime} }={g}^{{\prime} }| G=g)}\right)\,,$$
(32)

where \({{\Delta }}=\frac{P(G={g}^{{\prime} })P({{{{{{{\bf{X}}}}}}}}={{{{{{{\bf{x}}}}}}}}| G={g}^{{\prime} })}{P(G=g)P({{{{{{{\bf{X}}}}}}}}={{{{{{{\bf{x}}}}}}}}| G=g)}\) is the ratio between the joint probability of the two graphs with the time series X. This ratio can be computed efficiently in \({{{{{{{\mathcal{O}}}}}}}}(T)\), by keeping in memory ni,t, the number of inactive neighbors, and mi,t, a number of active neighbors, for each vertex i at each time t (see ref. 40). Equation (32) allows to sample from the posterior distribution P(GX) without the requirement to compute the intractable normalization constant P(X). We collect graph samples at every Nδ move, where we fix δ = 5 in all experiments.

We consider two types of random graphs with different constraints: The Erdős-Rényi model and the configuration model. Hence, we need two different sampling propositions to apply our MCMC algorithm, that is one for each model. We assume that the support of the Erdős-Rényi model is the set of all simple graphs of N vertices with E edges. In this case, we consider a hinge flip move, where an edge (i, j) is sampled uniformly from the edge set of the graph G and a vertex k is sampled uniformly from its vertex set. Then, with probability \(\frac{1}{2}\), we rewire edge (i, j) by either selecting i or j to connect with k. Note that, because we consider the support \({{{{{{{\mathcal{G}}}}}}}}\) of G to be a space of simple graphs, all moves resulting in the addition of a self-loop or a multiedges are rejected with probability 1. As a result, the proposition probability is the same for any move from g to \({g}^{{\prime} }\):

$$P({G}^{{\prime} }={g}^{{\prime} }| G=g)=\frac{1}{EN}\,\Rightarrow\frac{P({G}^{{\prime} }=g| G={g}^{{\prime} })}{P({G}^{{\prime} }={g}^{{\prime} }| G=g)}=1\,.$$
(33)

For the configuration model, we assume that the support is the set of all loopy multigraphs of N vertices whose degree sequence is k. In this case, we propose double-edge swap moves according to the prescription of ref. 110. We refer to it for further details.

Real networks

In this section, we present the real networks used in the bottom panels of Fig. 6. The networks have been downloaded from the Netzschleuder network catalog111.

Little Rock Lake food web

The Little Rock Lake food web95 is composed of N = 183 nodes and M = 2 494 edges, where nodes represent taxa (like species) found in Little Rock Lake in Wisconsin, and edges represent feeding patterns between two taxa. As presented in ref. 95, this network is directed, but for the purpose of our paper we reciprocated all edges. Also, note that the Glauber dynamics, which we used jointly with the Little Rock Lake food web in Fig. 6, was also used in ref. 40 to simulate a simplified interaction between the taxa.

European airline route network

The European airline route network96 is a multiplex network composed of N = 450 and M = 3 588 edges, where nodes represent airports and edges are routes between them. These edges have different types, encoding the different airlines. In our paper, we do not make any distinction between the edge types for simplicity.

C. Elegans neural network

The C. Elegans neural network97 used in Fig. 6 is an undirected network of N = 514 and M = 2363 edges representing the neural network of male C. Elegans worms. The nodes are neurons and edges represent when there are gap junctions between neurons.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.