Linguistic structure from a bottleneck on sequential information processing

Futrell, Richard; Hahn, Michael

doi:10.1038/s41562-025-02336-w

Download PDF

Article
Open access
Published: 24 November 2025

Linguistic structure from a bottleneck on sequential information processing

Nature Human Behaviour (2025)Cite this article

25k Accesses
2 Citations
49 Altmetric
Metrics details

Subjects

Abstract

Human language has a distinct systematic structure, where utterances break into individually meaningful words that are combined to form phrases. Here we show that natural-language-like systematicity arises in codes that are constrained by a statistical measure of complexity called predictive information, also known as excess entropy. Predictive information is the mutual information between the past and future of a stochastic process. In simulations, we find that codes that minimize predictive information break messages into groups of approximately independent features that are expressed systematically and locally, corresponding to words and phrases. Next, drawing on cross-linguistic text corpora, we find that actual human languages are structured in a way that yields low predictive information compared with baselines at the levels of phonology, morphology, syntax and lexical semantics. Our results establish a link between the statistical and algebraic structure of language and reinforce the idea that these structures are shaped by communication under general cognitive constraints.

A large quantitative analysis of written language challenges the idea that all languages are equally complex

Article Open access 16 September 2023

Language statistics as a window into mental representations

Article Open access 16 May 2022

The principle of anticipation in language use

Article Open access 26 September 2025

Main

Human language is organized around a systematic, compositional correspondence between the structure of utterances and the structure of the meanings that they express¹. For example, an English speaker will describe an image such as Fig. 1a with an utterance such as ‘a cat with a dog’, in which the parts of the the image correspond regularly with parts of the utterance such as ‘cat’—what we call words. This way of relating form and meaning may seem natural, but it is not logically necessary. For example, Fig. 1b shows an utterance in a hypothetical counterfactual language where meaning is decomposed in a way that most people would find unnatural: here, we have a word ‘gol’, which refers to a cat head and a dog head together, and another word ‘nar’, which refers to a cat body and a dog body together. Similarly, Fig. 1c presents a hypothetical language that is systematic but with an unnatural way of decomposing the utterance: here, the utterance contains individually meaningful subsequences ‘a cat’, ‘with‘ and ‘a dog’, but these are interleaved together, rather than concatenated as they are in English. We can even conceive of languages such as in Fig. 1d, where each meaning is expressed holistically as a single unanalysable form^2,3—in fact, this lack of systematic structure is expected in optimal codes like Huffman codes^4,5. Why is human language the way it is, and not like these counterfactuals?

**Fig. 1: Example utterances describing an image in English and various hypothetical languages.**

We argue that the particular structure of human language can be derived from general constraints on sequential information processing. We start from three observations:

(1)
Utterances consist, to a first approximation, of one-dimensional sequences of discrete symbols (for example, phonemes).
(2)
The ease of production and comprehension of these utterances is influenced by the sequential predictability of these symbols down to the smallest timescales^{6,7,8,9,10,11}.
(3)
Humans have limited cognitive resources for use in sequential prediction^{12,13,14,15,16}.

Thus, we posit that language is structured in a way that minimizes the complexity of sequential prediction, as measured using a quantity called predictive information: the amount of information about the past of a sequence that any predictor must use to predict its future^17,18, also called excess entropy^19,20. Below, we find that codes that are constrained to have low predictive information within signals have systematic structure resembling natural language, and we provide massively cross-linguistic empirical evidence based on large text corpora showing that natural language has lower predictive information than would be expected if it had different kinds of structure.

Results

Explananda

First, we clarify what we want to explain. Taking a maximally general stance, we think of a language as a function mapping meanings to forms, where meanings are any objects in a set ${\mathcal{M}}$, and forms are strings drawn from a finite alphabet of letters Σ, typically standing for phonemes. We say a language is systematic when it is a homomorphism^21,22, as illustrated in Fig. 2. That is, if a meaning m can be decomposed into parts (say m = m₁ × m₂), then the string for that meaning decomposes in the same way:

$$L({m}_{1}\times {m}_{2})=L({m}_{1})\cdot L({m}_{2}),$$

(1)

where ‘⋅’ is some means of combining two strings, such as concatenation. For example, an object would be described in English as L() = blue square. The meaning is decomposed into features for colour and shape, and these features are expressed systematically as the words ‘blue’ and ‘square’ concatenated together.

**Fig. 2: Two examples of linguistic systematicity as a homomorphism.**

We wish to explain why human languages are systematic, why they decompose meanings in the way they do, and why they combine strings in the way they do. In particular, meanings are decomposed in a way that seems natural to humans (that is, like Fig. 1a and not Fig. 1b), a property we call ‘naturalness’. Also, strings are usually combined by concatenation (that is, like Fig. 1a and not like Fig. 1c), or more generally by some process that keeps relevant parts of the string relatively close together. We call this property ‘locality’.

Influential accounts have held that human language is systematic because language learners need to generalize to produce forms for never-before-seen meanings^23,24,25,26. Such accounts successfully motivate systematicity in the abstract sense, but on their own they do not explain naturalness and locality. However, a theory of systematicity must have something to say about these properties, because if we are free to choose any arbitrary functions ‘×’ and ‘⋅’, then any function L can be considered systematic in the sense of equation (1), and the idea of systematicity becomes vacuous²⁷.

In existing work, naturalness and locality are explained via (implicit or explicit) inductive biases built into language learners^{23,28,29,30,31,32,33,34,35} or stipulations about the mental representation or perception of meanings^{36,37,38,39,40}. By contrast, we aim to explain natural local systematicity in language from maximally general principles, without any assumptions about the mental representation of meaning, and with extremely minimal assumptions about the structure of forms—only that they are ultimately expressed as one-dimensional sequences of discrete symbols.

Predictive Information

We measure the complexity of sequential prediction using predictive information, which is the amount of information that any predictor must use about the past of a stochastic process to predict its future (below, we assume familiarity with information-theoretic quantities of entropy and mutual information⁴¹). Given a stationary stochastic process generating a stream of symbols …, X_t−1, X_t, X_t+1, …, we split it into ‘the past’ X_past, representing all symbols up to time t, and ‘the future’ ${X}_{{\rm{future}}}$, representing all symbols at time t or after. The predictive information or excess entropy^18,19 E is the mutual information between the past and the future:

$$E={{\rm{I}}}[{X}_{{\rm{past}}}:{X}_{{\rm{future}}}].$$

(2)

We calculate the predictive information of a language L as the predictive information of the stream of letters generated by repeatedly sampling meanings $m\in {\mathcal{M}}$ from a source distribution, translating them to strings as s = L(m) and concatenating them with a delimiter in between.

Predictive information can be calculated in a simple way that gives intuition about its behaviour. Let h_n represent the n-gram entropy of a process, that is, the average entropy of a symbol given a window of n − 1 previous symbols:

$${h}_{n}={{\rm{H}}}[{X}_{t}| {X}_{t-n+1},\ldots ,{X}_{t-1}].$$

(3)

As the window size increases, the n-gram entropy decreases to an asymptotic value called the entropy rate h. The predictive information represents the convergence to the entropy rate,

$$E=\mathop{\sum }\limits_{n=1}^{\infty }\left({h}_{n}-h\right),$$

(4)

as illustrated in Fig. 3. This calculation reveals that predictive information is low when symbols can be predicted accurately on the basis of local contexts, that is, when h_n is close to h for small n.

Simulations

The following simulations show that, when languages minimize predictive information, they express approximately independent features systematically and locally in a way that corresponds to words and phrases in natural language.

Systematic expression of independent features

Consider a set of meanings consisting of the outcomes of three weighted coin flips. In a natural systematic language, we would expect each string to have contiguous ‘words’ corresponding to the outcome of each individual coin, whereas a holistic language would have no such structure, as shown in the examples in Fig. 4a. It turns out that, for these example languages, the natural systematic one has lower predictive information, as shown in Fig. 4b. In fact, among all possible unambiguous length-3 binary languages, predictive information is minimized exclusively in the systematic languages, as shown in Fig. 4c.

**Fig. 4: Simulations of languages for coin-flip distributions.**

Intuitively, the reason systematic languages minimize predictive information here is that the features of meaning expressed in each individual letter are independent of each other, and so there is no statistical dependence among letters in the string. The general pattern is that an unambiguous language that minimizes predictive information will find features that have minimal mutual information and express them systematically. See Supplementary Section A for formal arguments to this effect.

Holistic expression of correlated components

What happens to predictive information when the source distribution cannot be expressed in terms of fully independent features? In that case, it is better to express the more correlated features holistically, without systematic structure. This holistic mapping is what we find in natural language for individual words (or, more precisely, morphemes), according to the principle of arbitrariness of the sign⁴². For example, the word ‘cat’ has no identifiable parts that systematically correspond to features of its meaning. Furthermore, as we will discuss below, morphemes in language typically encode categories whose semantic features are highly correlated with each other⁴³.

We demonstrate this effect in simulations by varying the coin-flip scenario above. Denote the three coin flips as M₁, M₂ and M₃. Imagine the second and third coins and are tied together, so that their outcomes M₂ and M₃ are correlated, as in the example in Fig. 4d. In the limit where M₂ and M₃ are fully correlated, these coin flips have effectively become one feature. Figure 4e shows predictive information for a number of possible languages in this setting, as a function of the mutual information between the tied coin flips M₂ and M₃. In the low-mutual-information regime—where M₂ and M₃ are nearly independent—the best language is still fully systematic. However, as mutual information increases, the best language is one that expresses the tied coin flips M₂ and M₃ together holistically, as a single ‘word’. An unnatural language that expresses the uncorrelated coin flips M₁ and M₂ holistically is much worse, as is a non-local systematic language that breaks up the ‘word’ corresponding to the correlated coin flips M₂ and M₃.

Locality

Next, we show that minimization of predictive information yields languages where features of meaning correspond to localized parts of strings, corresponding to words. We consider a Zipfian distribution over 100 meanings, and a language L in which forms consist of two length-4 ‘words’. We then consider scrambled languages formed by applying permutations to the string output of L. For example, if the original language expresses a meaning with two words such as L(m₁ × m₂) = aaaa ⋅ bbbb, a possible scrambled language would have ${L}^{{\prime} }({m}_{1}\times {m}_{2})={\mathtt{baaabbab}}$. These scrambled languages instantiate possible string combination functions other than concatenation.

Calculating predictive information for all possible scrambled languages, we find that the languages in which the ‘words’ remain contiguous have the lowest predictive information, as shown in Fig. 5a. This happens because the coding procedure above creates correlations among letters within a word. When these correlated letters are separated from each other—such as when letters from another word intervene—then predictive information increases. Interestingly, not every concatenative language is better than every non-concatenative one. This corresponds to the reality of natural language, in which limited non-concatenative and non-local morphophonological processes do exist, for example, in Semitic non-concatenative morphology⁴⁴.

**Fig. 5: Simulations of codes with different orders of elements.**

Hierarchical structure

Natural language sentences typically have well-nested hierarchical syntactic structures, of the kind generated by a context-free grammar⁴⁵: for example, the sentence ‘[[the big dog] chased [a small cat]]’ has two noun phrases, indicated by brackets, which are contiguous and nested within the sentence. Minimization of predictive information creates these well-nested word orders, with phrases corresponding to groups of words that are more or less strongly correlated⁴⁶. We demonstrate this effect using a source distribution defined over six random variables M₁, …, M₆ with a covariance structure shown in the inset of Fig. 5b: each of the variable pairs (M₁, M₂) and (M₄, M₅) are highly internally correlated; these pairs are weakly correlated with M₃ and M₆, respectively; and both groups of variables are very weakly correlated with each other. As above, we consider all possible permutations of a systematic code for these source variables. The codes that minimize predictive information are those that are well nested with respect to the correlation structure of the source, keeping the letters corresponding to all groups of correlated features contiguous. Further simulation results involving context-free languages are found in Supplementary Section G. For a mathematical analysis of predictive information in local and random orders for structured sources, see Supplementary Section A.

Cross-linguistic empirical results

Here, we present cross-linguistic empirical evidence that the systematic structure of language has the effect of reducing predictive information at the levels of phonotactics, morphology, syntax and semantics, compared against systems that lack natural local systematicity.

Phonotactics

Languages have restrictions on what sequences of sounds may occur within words: for example, ‘blick’ seems like a possible English word, whereas ‘bnick’ does not, even though it is pronounceable in other languages⁴⁷. These systems of restrictions are called phonotactics. Here, we show that actual phonotactic systems of human languages, which involve primarily local constraints on what sounds may co-occur, result in lower predictive information compared with counterfactual phonotactic systems. We compare phonemically transcribed wordforms in vocabulary lists of 61 languages against counterfactual alternatives generated by deterministically scrambling phonemes within a word while preserving manner of articulation. This ensures that the resulting counterfactual forms are roughly possible to articulate. For example, an English word ‘fasted’ might be scrambled to form ‘sefdat’. Calculating predictive information, we find that the real vocabulary lists have lower predictive information than the counterfactual variants in all languages tested. Results for six languages with diverse sound systems are shown in Fig. 6a. Results for the remaining 55 languages are presented in Supplementary Section C.

Morphology

Words change form to express grammatical features in a way that is often systematic. For example, the forms of the Hungarian noun shown in Fig. 7a are locally systematic with respect to case and number features. In Fig. 6b, we show that the local systematic structure of affixes for case, number, possession and definiteness in five languages has the effect of reducing predictive information when comparing against baselines that disrupt this structure. We estimate predictive information of these morphological affixes across five languages, with source distributions proportional to empirical corpus counts of the joint frequencies of grammatical features. We compare the predictive information of the attested forms against three alternatives: (1) a non-local baseline generated by applying a deterministic permutation function to each form, (2) an unnatural baseline generated by permuting the assignment of forms to meanings (features) and (3) a more controlled unnatural baseline that permutes the form–meaning mapping while preserving form length. The unnatural baselines preserve the phonotactics of the original forms; only the form–meaning relationship is changed. We generate 10,000 samples (permutations) for each of the three baselines per language.

Across the languages, we find that the attested forms have lower predictive information than the majority of samples of the baselines. The weakest effect is in Latin, which also has the most fusional and least systematic morphology⁴⁸. Note that Arabic nouns often show non-concatenative morphology in the form of so-called broken plurals: for example, the plural of the loanword ‘film’ meaning ‘film’ is ’aflām. This pattern is represented in the forms used to generate Fig. 6b, and yet Arabic noun forms still have lower predictive information than the majority of baseline samples. This suggests that the limited form of non-concatenative morphology present in Arabic is still consistent with the idea that languages are configured in a way that keeps predictive information low.

Syntax

Phrases such as ‘blue square’ have natural local systematicity, as shown in Fig. 7b. We compare real adjective–noun combinations in corpora of 12 languages against unnatural and non-local baselines generated the same way as in the morphology study: permuting the letters within a form to disrupt locality, or permuting the assignment of forms to meanings to disrupt naturalness. We estimate the probability of a meaning as proportional to the frequency of the corresponding adjective–noun pair. Results are shown in Fig. 6c. The real adjective–noun pairs have lower predictive information than a large majority of baselines across all languages tested.

Word order

In an English noun phrase such as ‘the three cute cats’, the elements Determiner (D, ‘the’), Numeral (N, ‘three’), Adjective (A, ‘cute’) and Noun (n, ‘cats’) are combined in the order D–N–A–n. This order varies across languages—for example, Spanish has D–N–n–A (‘los tres gatos lindos’)—but certain orders are more common than others⁴⁹. We aim to explain the cross-linguistic distribution of these orders through reduction of predictive information, which drives words that are statistically predictive of each other to be close to each other, an intuition shared with existing models of adjective order^40,46,50. To do so, we estimate source probabilities for noun phrases (consisting of single head lemmas for a noun along with an optional adjective, numeral and determiner) based on corpus frequencies. We then calculate predictive information at the word level (treating words as single atomic symbols) for all possible permutations of D–N–A–n. Predictive information is symmetric with respect to time reversal, so we cannot distinguish orders such as D–N–A–n from n–A–N–D and so on. As shown in Fig. 8a, the orders with lower predictive information are also the orders that are more frequent cross-linguistically. A number of alternative source distributions also yield this downward correlation, as shown in Supplementary Section D.

Lexical semantics

Considering a word such as ‘cats’, all the semantic features of a cat (furriness, mammalianness and so on) are expressed holistically in the morpheme ‘cat’, while the feature of numerosity is separated into the plural marker ‘–s’. Plural marking like this is common across languages^51,52. From reduction of predictive information, we expect relatively uncorrelated components of meaning to be expressed systematically, and relatively correlated components to be expressed together holistically. Thus, we hold that numerosity is selected to be expressed systematically in a separate morpheme because it is relatively independent of the other features of nouns, which are in turn highly correlated with each other. Our theory thus derives the intuition that natural categories arise from the correlational structure of experience⁴³.

We validate this prediction in a study of semantic features in English, using the Lancaster Sensorimotor Norms⁵³ to provide semantic features for English words and using the English Universal Dependencies (UD) corpus to provide a frequency distribution over words. The Lancaster Sensorimotor Norms provide human ratings for words based on sensorimotor dimensions, such as whether they involve the head or arms. As shown in Fig. 8b (top), we find that the semantic norm features are highly correlated with each other, and relatively uncorrelated with numerosity, as predicted by the theory.

For the same reason, the theory also predicts that semantic features should be more correlated within words than across words. In Fig. 8b (bottom), we show within-word and cross-word correlations of the semantic norm features for pairs of verbs and their objects taken from the English UD corpus. As predicted, the across-word correlations are weaker. Correlations based on features drawn from other semantic norms are presented in Supplementary Section E.

Discussion

Our results underscore the fundamental roles of prediction and memory in human cognition and provide a link between the algebraic structure of human language and information-theoretic concepts used in machine learning and neuroscience. Our work joins the growing body of information-theoretic models of human language based on resource-rational efficiency^{54,55,56,57,58,59}.

Language models

Large language models are based on neural networks trained to predict the next token of text given previous tokens. Our results suggest that language is structured in a way that makes this next-token prediction relatively easy, by minimizing the amount of information that needs to be extracted from the previous tokens to predict the following tokens. Although it has been claimed that large language models have little to tell us about the structure of human language—because their architectures do not reflect formal properties of grammars and because they can putatively learn unnatural languages as well as natural ones^60,61,62—our results suggest that these models have succeeded so well precisely because natural language is structured in a way that makes their prediction task relatively simple. Indeed, neural sequence architectures struggle to learn languages that lack information locality^63,64.

Machine learning

Our results establish a connection between the structure of human language and ideas from machine learning. In particular, minimization of mutual information (a technique known as independent components analysis, ICA^65,66) is widely deployed to create representations that are ‘disentangled’ or compositional⁶⁷, and to detect object boundaries in images, under the assumption that pixels belonging to the same object exhibit higher statistical dependence than pixels belonging to different objects⁶⁸. (Although general nonlinear ICA with real-valued outputs does not yield unique solutions⁶⁹, we have found above that minimization of predictive information does find useful structure in our setting, with discrete string-valued outputs and a deterministic function mapping meaning to form.) We propose that human language follows a similar principle: it reduces predictive information, which amounts to performing a generalized sequential ICA on the source distribution on meanings, factoring it into groups of relatively independent components that are expressed systematically as words and phrases, with more statistical dependence within these units than across them. This provides an explanation for why ICA-like objectives yield representations that are intuitively disentangled, compositional, or interpretable: they yield the same kinds of concepts that we find encoded in natural language.

Neuroscience

Similarly, neural codes have been characterized as maximizing information throughput subject to information-theoretic and physiological constraints^70,71, including explicit constraints on predictive information^72,73. These models predict that, in many cases, neural codes are decorrelated: distinct neural populations encode statistically independent components of sensory input⁷⁴. Our results suggest that language operates on similar principles: it expresses meanings in a way that is temporally decorrelated. This view is compatible with neuroscientific evidence on language processing: minimization of predictive information (while holding overall predictability constant) equates to maximization of local predictability of the linguistic signal, a driver of the neural response to language^10,75.

Information theory and language

Previous work⁷⁶ derived locality in natural language from a related information-theoretic concept, the memory–surprisal trade-off or predictive information bottleneck curve, which describes the best achievable sequential predictability as a function of memory usage⁷⁷. The current theory is a simplification that looks at only one part of the curve: predictive information is the minimal memory at which sequential predictability is maximized. A more complete information-theoretic view of language may have to consider the whole curve.

We join existing work attempting to explain linguistic structure on the basis of information-theoretic analysis of language as a stochastic process, for example, the study of lexical scaling laws as a function of redundancy and non-ergodicity in text⁷⁸. Other work on predictive information in language has focused on the long-range scaling of the n-gram entropy in connected texts, with results seeming to imply that the predictive information diverges^79,80. By contrast, we have focused on only single utterances, effectively considering only relatively short-range predictive information.

Cognitive status of predictive information

Predictive information is a fundamental measure of complexity, which may manifest explicitly or implicitly in various ways in the actual mechanisms of language production, comprehension and learning. For example, in a recent model of online language comprehension⁸¹, comprehenders predict upcoming words on the basis of memory representations that are constrained to store only a small number of words. The fundamental limits of predictive information apply implicitly in this model because comprehenders’ predictions cannot be more accurate than if they stored an equivalent amount of predictive information. As another example, a model of language production based on short stored chunks⁴⁶ would effectively produce language with low predictive information, because these chunks would be relatively independent of each other, while predictive relationships inside the stored chunks would be preserved. Predictive information has also been linked to difficulty of learning: processes containing more predictive information require more parameters and data to be learned¹⁸, and any learner with limited ability to learn long-term dependencies will have an effective inductive bias towards languages with low predictive information. Predictive information is not meant as a complete model of the constraints on language, which would certainly involve factors beyond predictive information as well as separate, potentially competing pressures from comprehension and production⁸².

Relatedly, while we have shown that natural language is configured in a way that keeps predictive information low, we have not speculated on how languages come to be configured in this way, in terms of language evolution and change. We believe there are multiple pathways for this to happen. For example, efficiency pressures in individual interactions could give rise to overall efficient conventions⁸³, or memory limits in learning^84,85 could cause learners to form low-predictive-information generalizations from their input. Identifying the causal mechanisms that control predictive information in language is a critical topic for future work.

Linguistics

Our theory of linguistic systematicity is independent of theoretical assumptions about mental representations of grammars, linguistic forms or the meanings expressed in language. Predictive information is a function only of the probability distribution on forms, seen as one-dimensional sequences of symbols unfolding in time. This independence from representational assumptions is an advantage, because there is as yet no consensus about the basic nature of the mental representations underlying human language^86,87.

Our results reflect and formalize a widespread intuition about human language, first formulated as Behaghel’s Law⁸⁸: ‘that which is mentally closely related is also placed close together’. For example, words are contiguous units and the order of morphemes within them is determined by a principle of relevance^89,90, and important aspects of word order across languages have been explained in terms of dependency locality, the principle that syntactically linked words are close^91,92,93,94.

A constraint on predictive information predicts information locality: elements of a linguistic form should be close to each other when they predict each other⁵⁰. We propose that information locality subsumes existing intuitive locality ideas. Thus, because words have a high level of statistical interpredictability among their parts⁹⁵, they are mostly contiguous, and as a residual effect of this binding force, related words are also close together. Furthermore, we have found that the same formal principle predicts the existence of linguistic systematicity and the way that languages divide the world into natural kinds^37,43.

Limitations

Much work is required to push our hypothesis to its limit. We have assumed throughout that languages are one-to-one mappings between form and meaning; the behaviour of ambiguous or non-deterministic codes, where ambiguity might trade off with predictive information, may yield additional insight. Furthermore, we have examined predictive information only within isolated utterances. It remains to be seen whether reduction of predictive information, applied at the level of many connected utterances, would be able to explain aspects of discourse structure such as the hierarchical organization of topics and topic–focus structure⁹⁶.

One known limitation of our theory is that predictive information is symmetric with respect to time reversal, so (at least when applied at the utterance level) it cannot explain time-asymmetric properties of language such as the pattern of ‘accessible’ (frequent, animate, definite and given) words appearing earlier within utterances than inaccessible ones^97,98. There is also the fact that non-local and non-concatenative structures do exist in language, for example, long-term coreference relationships among discourse entities, and long-distance filler–gap dependencies, which would seem to contravene the idea that predictive information is constrained. An important area for future research will be to determine what effect these structures really have on predictive information, and what other constraints on language might explain them.

Methods

Constructing a stochastic process from a language

We define a language as a mapping from a set of meanings to a set of strings, $L:{\mathcal{M}}\to {\Sigma }^{* }$. To define predictive information of a language, we need a way to derive a stationary stochastic process generated by that language. We use the following mathematical construction that generates an infinite stream of symbols: (1) meanings m ~ p_M are sampled i.i.d. from the source distribution p_M, (2) each meaning is translated into a string as s = L(m), and (3) the strings s are concatenated end-to-end in both directions with a delimiter # ∉ Σ between them. Finally, a string is chosen with probability reweighted by its length, and a time index t (relative to the closest delimiter to the left) is selected uniformly at random within this form.

This construction has the effect of zeroing out any mutual information between symbols with the delimiter between them. Thus, when we compute n-gram statistics, we can treat each form as having infinite padding symbols to the left and right. This is the standard method for collecting n-gram statistics in natural language processing⁹⁹.

Three-feature source simulation

For Fig. 4b,c, the source distribution is distributed as a product of three Bernoulli distributions:

$$M \sim {\rm{Bernoulli}}\left(\frac{2}{3}\right)\times {\rm{Bernoulli}}\left(\frac{2}{3}+\varepsilon \right)\times {\rm{Bernoulli}}\left(\frac{2}{3}+2\varepsilon \right),$$

(5)

with ε = 0.05.

For Fig. 4e, we need to generate distributions of the form p(M) = p(M₁) × p(M₂, M₃) while varying the mutual information I[M₂: M₃]. We start with the source from equation (5) (whose components are here denoted p_indep) and mix it with a source that creates a correlation between M₂ and M₃:

$$\begin{array}{l}{p}_{\alpha }(M=ijk)={p}_{{\rm{indep}}}({M}_{1}=i)\\\times \left[\left(1-\alpha \right)\left({p}_{{\rm{indep}}}({M}_{2}=j)\times {p}_{{\rm{indep}}}({M}_{3}=k)\right)+\frac{\alpha }{2}{\delta }_{jk}\right],\end{array}$$

(6)

with δ_jk = 1 if j = k and 0 otherwise. The mixture weight α controls the level of mutual information, ranging from 0 at α = 0 to at most 1 bit at α = 1. A more comprehensive study of the relationship between feature correlation, systematicity and predictive information is given in Supplementary Section B, which examines systematic and holistic codes for a comprehensive grid of possible distributions on the simplex over four outcomes.

Locality simulation

For the simulation shown in Fig. 5a, we consider a source over 100 objects labelled {m⁰⁰, m⁰¹, …, m⁹⁹}, following a Zipfian distribution $p(M={m}^{i})\propto {\left(i+1\right)}^{-1}$. We consider a language based on a decomposition of the meanings based on the digits of their index, with for example m⁸⁹ decomposing into features as ${m}_{1}^{8}\times {m}_{2}^{9}$. Each utterance decomposes into two ‘words’ as L(m₁ × m₂) = L(m₁) ⋅ L(m₂), where the word for each feature m^k is a random string in {0, 1}⁴, maintaining a one-to-one mapping between features m^k and words.

Hierarchy simulation

For the simulation shown in Fig. 5b, we consider a source M over 5⁶ = 15,625 meanings, which may be expressed in terms of six random variables $\left\langle {M}_{1},{M}_{2},{M}_{3},{M}_{4},{M}_{5},{M}_{6}\right\rangle$ each over five outcomes, with a probability distribution as follows:

$$\begin{array}{lll}p(M)\,=\,\alpha q({M}_{1},{M}_{2},{M}_{3},{M}_{4},{M}_{5},{M}_{6})+\left(1-\alpha \right)\\\,\,\left(\left[\beta q({M}_{1},{M}_{2},{M}_{3})+\left(1-\beta \right)\left[\gamma q({M}_{1},{M}_{2})+\left(1-\gamma \right)q({M}_{1})q({M}_{2})\right]q({M}_{3})\right]\right.\\\qquad\quad\times \left[\beta q({M}_{4},{M}_{5},{M}_{6})+\left(1-\beta \right)\right.\\\qquad\quad\left[\gamma\!\left.\left.q({M}_{4},{M}_{5})+\left(1-\gamma\right)q({M}_{4})q({M}_{5})\right]q({M}_{6})\right]\right),\end{array}$$

(7)

where α = 0.01, β = 0.20 and γ = 0.99 are coupling constants, and each q(⋅) is a Zipfian distribution as above. The coupling constants control the strengths of the correlations shown in Fig. 5b.

Phonotactics

We assume a uniform distribution over forms found in WOLEX. Supplementary Section F shows results for four languages using corpus-based word frequency estimates to form the source distribution, with similar results.

Morphology

We estimate the source distribution on grammatical features (number, case, possessor and definiteness) using the feature annotations from UD corpora, summing over all nouns, with add-1/2 smoothing. The dependency treebanks are drawn from UD v2.8¹⁰⁰: for Arabic, NYUAD Arabic UD Treebank; for Finnish, Turku Dependency Treebank; for Turkish, Turkish Penn Treebank; for Latin, Index Thomisticus Treebank; for Hungarian, Szeged Dependency Treebank. Forms are represented with a dummy symbol ‘X’ standing for the stem, and then orthographic forms for suffixes, such as ‘Xoknak’ for the Hungarian dative plural. For Hungarian, Finnish and Turkish, we use the forms corresponding to back unrounded vowel harmony. For Latin, we use first-declension forms. For Arabic, we use regular masculine triptote forms with a broken plural; to do so, we represent the root using three dummy symbols, and the plural using a common ‘broken’ form¹⁰¹, with, for example, ‘XaYZun’ for the nominative indefinite singular and ‘’aXYāZun’ for the nominative indefinite plural. Results using an alternate broken plural form ‘XiYāZun’ are nearly identical.

Adjective–noun pairs

From UD corpora, we extract adjective–noun pairs, defined as a head wordform with part-of-speech ‘NOUN’ modified by an adjacent dependent wordform with relation ‘amod’ and part-of-speech ‘ADJ’. The forms over which predictive information is computed consist of the pair of adjective and noun from the corpus, in their original order, in original orthographic form with a whitespace between them. The source distribution is directly proportional to the frequencies of the forms.

Noun phrase order

The source distribution on noun phrases is estimated from the empirical frequency of noun phrases in the German GSD UD corpus, which has the largest number of such noun phrases among the UD corpora. To estimate this source, we define a noun phrase as a head lemma of part-of-speech ‘NOUN’ along with the head lemmas for all dependents of type ‘amod’ (with part-of-speech ‘ADJ’), ‘nummod’ (with part-of-speech ‘NUM’) and ‘det’ (with part-of-speech ‘DET’). We extract these noun phrase forms from the corpus. When a noun phrase has multiple adjectives, one of the adjectives is chosen randomly and the others are discarded. The result is counts of noun phrases of the form below:

Determiner	Numeral	Adjective	Noun	Count
die	—	—	Hand	234
ein	—	alt	Kind	4
—	drei	—	Buch	2
ein	—	einzigartig	Parfümeur	1
…	…	…	…	…

The source distribution is directly proportional to these counts. We then compute predictive information at the word level over the attested noun phrases for all possible permutations of determiner, numeral, adjective and noun. Typological frequencies are as given by ref. ⁴⁹.

Semantic features

We binarize the Lancaster Sensorimotor Norms⁵³ by recoding each norm as 1 if it exceeds the mean value for that feature across all words, and 0 otherwise. Word frequencies are calculated by maximum likelihood based on lemma frequencies in the concatenation of the English GUM¹⁰², GUMReddit¹⁰³ and EWT¹⁰⁴ corpora from UD 2.8. The ‘Number’ feature is calculated based on the value of the ‘Number’ feature in the UD annotations. Verb–object pairs were identified as a head wordform with part-of-speech ‘VERB’ with a dependent wordform of relation ‘obj’ and part-of-speech ‘NOUN’.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Unique data required to reproduce our results are available via GitHub at http://github.com/Futrell/infolocality. Corpus count data are drawn from Universal Dependencies v2.8, available at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3683. The Lancaster Sensorimotor Norms are available at https://osf.io/7emr6/. Wordform data from the WOLEX database¹⁰⁵ are not publicly available, but a subset can be made available upon request to the authors.

Code availability

Code to reproduce our results is available via GitHub at http://github.com/Futrell/infolocality.

References

Frege, G. Gedankengefüge. Beitr. Philos. Deutsch. Ideal. 3, 36–51 (1923).
Google Scholar
Jespersen, O. Language: Its Nature, Development, and Origin (W. W. Norton and Company, 1922).
Google Scholar
Wray, A. Protolanguage as a holistic system for social interaction. Lang. Commun. 18, 47–67 (1998).
Article Google Scholar
Huffman, D. A. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 1098–1101 (1952).
Article Google Scholar
Futrell, R. & Hahn, M. Information theory as a bridge between language function and language form. Front. Commun. 7, 657725 (2022).
Article Google Scholar
Goldman-Eisler, F. Speech production and language statistics. Nature 180, 1497–1497 (1957).
Article PubMed CAS Google Scholar
Ferreira, F. & Swets, B. How incremental is language production? Evidence from the production of utterances requiring the computation of arithmetic sums. J. Mem. Lang. 46, 57–84 (2002).
Article Google Scholar
Bell, A., Brenier, J. M., Gregory, M., Girand, C. & Jurafsky, D. Predictability effects on durations of content and function words in conversational English. J. Mem. Lang. 60, 92–111 (2009).
Article Google Scholar
Smith, N. J. & Levy, R. P. The effect of word predictability on reading time is logarithmic. Cognition 128, 302–319 (2013).
Article PubMed PubMed Central Google Scholar
Heilbron, M., Armeni, K., Schoffelen, J.-M., Hagoort, P. & De Lange, F. P. A hierarchy of linguistic predictions during natural language comprehension. Proc. Natl Acad. Sci. USA 119, e2201968119 (2022).
Article PubMed PubMed Central CAS Google Scholar
Ryskin, R. & Nieuwland, M. S. Prediction during language comprehension: what is next? Trends Cogn. Sci. 27, 1032–1052 (2023).
Article PubMed PubMed Central Google Scholar
Miller, G. A. & Chomsky, N. Finitary models of language users. Handb. Math. Psychol. 2, 419–491 (1963).
Google Scholar
Bratman, J., Shvartsman, M., Lewis, R. L. & Singh, S. A new approach to exploring language emergence as boundedly optimal control in the face of environmental and cognitive constraints. In Proc. 10th International Conference on Cognitive Modeling (eds Salvucci, D. D. & Gunzelmann, G.) 7–12 (Drexel University, 2010).
Christiansen, M. H. & Chater, N. The now-or-never bottleneck: a fundamental constraint on language. Behav. Brain Sci. 39, e62 (2016).
Article PubMed Google Scholar
Futrell, R., Gibson, E. & Levy, R. P. Lossy-context surprisal: an information-theoretic model of memory effects in sentence processing. Cogn. Sci. 44, e12814 (2020).
Article PubMed PubMed Central Google Scholar
Ferdinand, V., Yu, A. & Marzen, S. Humans are resource-rational predictors in a sequence learning task. Preprint at bioRxiv https://doi.org/10.1101/2024.10.21.619537 (2024).
Grassberger, P. Toward a quantitative theory of self-generated complexity. Int. J. Theor. Phys. 25, 907–938 (1986).
Article Google Scholar
Bialek, W., Nemenman, I. & Tishby, N. Predictability, complexity, and learning. Neural Comput. 13, 2409–2463 (2001).
Article PubMed CAS Google Scholar
Crutchfield, J. P. & Feldman, D. P. Regularities unseen, randomness observed: levels of entropy convergence. Chaos 13, 25–54 (2003).
Article PubMed Google Scholar
Dębowski, Ł. Information Theory Meets Power Laws: Stochastic Processes and Language Models (John Wiley & Sons, 2020).
Montague, R. Universal grammar. Theoria 36, 373–398 (1970).
Article Google Scholar
Janssen, T. M. V. & Partee, B. H. Compositionality. In Handbook of Logic and Language (eds van Benthem, J. & ter Meulen, A. G. B.) 417–473 (Elsevier, 1997).
Kirby, S. Syntax out of learning: the cultural evolution of structured communication in a population of induction algorithms. In Advances in Artificial Life (eds Floreano, D., Nicoud, J.-D. & Mondada, F.) 694–703 (Springer, 1999).
Smith, K., Brighton, H. & Kirby, S. Complex systems in language evolution: the cultural emergence of compositional structure. Adv. Complex Syst. 6, 537–558 (2003).
Article Google Scholar
Franke, M. Creative compositionality from reinforcement learning in signaling games. In Evolution of Language: Proc. 10th International Conference (EVOLANG10) (eds Cartmill, E. A. et al.) 82–89 (World Scientific, 2014).
Kirby, S., Tamariz, M., Cornish, H. & Smith, K. Compression and communication in the cultural evolution of linguistic structure. Cognition 141, 87–102 (2015).
Article PubMed Google Scholar
Zadrozny, W. From compositional to systematic semantics. Ling. Philos. 17, 329–342 (1994).
Article Google Scholar
Batali, J. Computational simulations of the emergence of grammar. In Approaches to the Evolution of Language: Social and Cognitive Bases (eds Hurford, J. R., Studdert-Kennedy, M. & Knight, C.) 405–426 (Cambridge Univ. Press, 1998).
Ke, J. & Holland, J. H. Language origin from an emergentist perspective. Appl. Ling. 27, 691–716 (2006).
Article Google Scholar
Tria, F., Galantucci, B. & Loreto, V. Naming a structured world: a cultural route to duality of patterning. PLoS ONE 7, 1–8 (2012).
Article Google Scholar
Lazaridou, A., Peysakhovich, A. & Baroni, M. Multi-agent cooperation and the emergence of (natural) language. In 5th International Conference on Learning Representations (2017).
Mordatch, I. & Abbeel, P. Emergence of grounded compositional language in multi-agent populations. In The Thirty-Second AAAI Conference on Artificial Intelligence (eds Weinberger, K. Q. & McIlraith, S. A.) 1495–1502 (AAAI Press, 2018).
Steinert-Threlkeld, S. Toward the emergence of nontrivial compositionality. Philos. Sci. 87, 897–909 (2020).
Article Google Scholar
Kuciński, Ł., Korbak, T., Kołodziej, P. & Miłoś, P. Catalytic role of noise and necessity of inductive biases in the emergence of compositional communication. Adv. Neural Inf. Process. Syst. 34, 23075–23088 (2021).
Google Scholar
Beguš, G., Lu, T. and Wang, Z. Basic syntax from speech: spontaneous concatenation in unsupervised deep neural networks. In Proc. Annual Meeting of the Cognitive Science Society Vol. 46 (2024); https://escholarship.org/uc/item/1ks8q4q9
Nowak, M. A., Plotkin, J. B. & Jansen, V. A. A. The evolution of syntactic communication. Nature 404, 495–498 (2000).
Article PubMed CAS Google Scholar
Barrett, J. A. Dynamic partitioning and the conventionality of kinds. Philos. Sci. 74, 527–546 (2007).
Article Google Scholar
Franke, M. The evolution of compositionality in signaling games. J. Logic Lang. Inf. 25, 355–377 (2016).
Article Google Scholar
Barrett, J. A., Cochran, C. & Skyrms, B. On the evolution of compositional language. Philos. Sci. 87, 910–920 (2020).
Article Google Scholar
Culbertson, J., Schouwstra, M. & Kirby, S. From the world to word order: deriving biases in noun phrase order from statistical properties of the world. Language 96, 696–717 (2020).
Article Google Scholar
Cover, T. M. & Thomas, J. A. Elements of Information Theory (John Wiley & Sons, 2006).
Google Scholar
de Saussure, F. Cours de linguistique générale (Payot, 1916).
Google Scholar
Rosch, E. Principles of categorization. In Cognition and Categorization (eds Rosch, E. & Lloyd, B. B.) 27–48 (Lawrence Elbaum Associates, 1978).
McCarthy, J. J. A prosodic theory of nonconcatenative morphology. Ling. Inquiry 12, 373–418 (1981).
Google Scholar
Chomsky, N. Syntactic Structures (Walter de Gruyter, 1957).
Mansfield, J. & Kemp, C. The emergence of grammatical structure from inter-predictability. In A Festschrift for Jane Simpson (eds O’Shannessy, C. & Gray, J.) 100–120 (ANU Press, 2025).
Chomsky, N. & Halle, M. The Sound Pattern of English (Harper and Row, 1968).
Google Scholar
Rathi, N., Hahn, M. & Futrell, R. An information-theoretic characterization of morphological fusion. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds Moens, M.-F. et al.) 10115–10120 (Association for Computational Linguistics, 2021); https://doi.org/10.18653/v1/2021.emnlp-main.793
Dryer, M. S. On the order of demonstrative, numeral, adjective, and noun. Language 94, 798–833 (2018).
Article Google Scholar
Futrell, R. Information-theoretic locality properties of natural language. In Proc. First Workshop on Quantitative Syntax (eds Chen, X. & Ferrer-i-Cancho, R.) 2–15 (Association for Computational Linguistics, 2019); https://www.aclweb.org/anthology/W19-7902
Corbett, G. G. Number (Cambridge Univ. Press, 2000).
Garner, W. R. The Processing of Information and Structure (Lawrence Earlbaum Associates, 1978).
Lynott, D., Connell, L., Brysbaert, M., Brand, J. & Carney, J. The Lancaster Sensorimotor Norms: multidimensional measures of perceptual and action strength for 40,000 English words. Behav. Res. Methods 52, 1271–1291 (2020).
Article PubMed Google Scholar
Ferrer-i-Cancho, R. & Solé, R. V. Least effort and the origins of scaling in human language. Proc. Natl Acad. Sci. USA 100, 788 (2003).
Article PubMed Google Scholar
Jaeger, T. F. & Tily, H. J. On language ‘utility’: processing complexity and communicative efficiency. Wiley Interdisc. Rev. Cogn. Sci. 2, 323–335 (2011).
Article Google Scholar
Kemp, C. & Regier, T. Kinship categories across languages reflect general communicative principles. Science 336, 1049–1054 (2012).
Article PubMed CAS Google Scholar
Zaslavsky, N., Kemp, C., Regier, T. & Tishby, N. Efficient compression in color naming and its evolution. Proc. Natl Acad. Sci. USA 115, 7937–7942 (2018).
Article PubMed PubMed Central CAS Google Scholar
Gibson, E. et al. How efficiency shapes human language. Trends Cogn. Sci. 23, 389–407 (2019).
Article PubMed Google Scholar
Levshina, N. Communicative Efficiency (Cambridge Univ. Press, 2022).
Mitchell, J. & Bowers, J. Priorless recurrent networks learn curiously. In Proc. 28th International Conference on Computational Linguistics (eds Scott, D., Bel, N. & Zong, C.) 5147–5158 (International Committee on Computational Linguistics, 2020); https://doi.org/10.18653/v1/2020.coling-main.451
Chomsky, N., Roberts, I. & Watumull, J. Noam Chomsky: the false promise of ChatGPT. The New York Times https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html (2023).
Moro, A., Greco, M. & Cappa, S. F. Large languages, impossible languages and human brains. Cortex 167, 82–85 (2023).
Article PubMed Google Scholar
Kallini, J., Papadimitriou, I., Futrell, R., Mahowald, K. & Potts, C. Mission: impossible language models. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (eds Ku, L.-W., Martins, A. & Srikumar, V.) 14691–14714 (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.acl-long.787
Someya, T. et al. Information locality as an inductive bias for neural language models. In Proc. 63rd Annual Meeting of the Association for Computational Linguistics (eds Che, W. et al.) 27995–28013 (Association for Computational Linguistics, 2025); https://doi.org/10.18653/v1/2025.acl-long.1357
Ans, B., Hérault, J. & Jutten, C. Architectures neuromimétiques adaptatives: détection de primitives. Cognitiva 85, 593–597 (1985).
Google Scholar
Bell, A. J. & Sejnowski, T. J. An information-maximization approach to blind separation and blind deconvolution. Neural Comput. 7, 1129–1159 (1995).
Article PubMed CAS Google Scholar
Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).
Article PubMed Google Scholar
Isola, P., Zoran, D., Krishnan, D. & Adelson, E. H. Crisp boundary detection using pointwise mutual information. In Computer Vision–ECCV 2014: 13th European Conference, Proceedings, Part III 13 (eds Fleet, D. et al.) 799–814 (Springer, 2014).
Hyvärinen, A. & Pajunen, P. Nonlinear independent component analysis: existence and uniqueness results. Neural Netw. 12, 429–439 (1999).
Article PubMed Google Scholar
Linsker, R. Self-organization in a perceptual network. Computer 21, 105–117 (1988).
Article Google Scholar
Stone, J. V. Principles of Neural Information Theory: Computational Neuroscience and Metabolic Efficiency (Sebtel Press, 2018).
Bialek, W., De Ruyter Van Steveninck, R. R. & Tishby, N. Efficient representation as a design principle for neural coding and computation. In 2006 IEEE International Symposium on Information Theory 659–663 (IEEE, 2006).
Palmer, S. E., Marre, O., Berry, M. J. & Bialek, W. Predictive information in a sensory population. Proc. Natl Acad. Sci. USA 112, 6908–6913 (2015).
Article PubMed PubMed Central CAS Google Scholar
Barlow, H. B. Unsupervised learning. Neural Comput. 1, 295–311 (1989).
Article Google Scholar
Schrimpf, M. et al. The neural architecture of language: integrative modeling converges on predictive processing. Proc. Natl Acad. Sci. USA 118, e2105646118 (2021).
Article PubMed PubMed Central CAS Google Scholar
Hahn, M., Degen, J. & Futrell, R. Modeling word and morpheme order in natural language as an efficient tradeoff of memory and surprisal. Psychol. Rev. 128, 726–756 (2021).
Article PubMed Google Scholar
Still, S. Information bottleneck approach to predictive inference. Entropy 16, 968–989 (2014).
Article Google Scholar
Dębowski, Ł. On the vocabulary of grammar-based codes and the logical consistency of texts. IEEE Trans. Inf. Theory 57, 4589–4599 (2011).
Article Google Scholar
Dębowski, Ł. Excess entropy in natural language: present state and perspectives. Chaos 21, 037105 (2011).
Article PubMed Google Scholar
Dębowski, Ł. The relaxed Hilberg conjecture: a review and new experimental support. J. Quant. Ling. 22, 311–337 (2015).
Article Google Scholar
Hahn, M., Futrell, R., Levy, R. & Gibson, E. A resource-rational model of human processing of recursive linguistic structure. Proc. Natl Acad. Sci. USA 119, e2122602119 (2022).
Article PubMed PubMed Central CAS Google Scholar
Dell, G. S. & Gordon, J. K. Neighbors in the lexicon: friends or foes? In Phonetics and Phonology in Language Comprehension and Production: Differences and Similarities (eds Schiller, N. O. & Meyer, A.) 9–38 (Mouton De Gruyter, 2003).
Hawkins, R. D. et al. From partners to populations: a hierarchical Bayesian account of coordination and convention. Psychol. Rev. 130, 977 (2023).
Article PubMed Google Scholar
Newport, E. L. Maturational constraints on language learning. Cogn. Sci. 14, 11–28 (1990).
Article Google Scholar
Cochran, B. P., McDonald, J. L. & Parault, S. J. Too smart for their own good: the disadvantage of a superior processing capacity for adult language learners. J. Mem. Lang. 41, 30–58 (1999).
Article Google Scholar
Jackendoff, R. Linguistics in cognitive science: the state of the art. Ling. Rev. 24, 347–402 (2007).
Google Scholar
Goldberg, A. E. Constructions work. Cogn. Ling. 20, 201–224 (2009).
Article Google Scholar
Behaghel, O. Deutsche Syntax: Eine geschichtliche Darstellung. Band IV: Wortstellung (Carl Winter, 1932).
Google Scholar
Bybee, J. L. Morphology: A Study of the Relation between Meaning and Form (John Benjamins, 1985).
Book Google Scholar
Givón, T. Isomorphism in the grammatical code: cognitive and biological considerations. Stud. Lang. 15, 85–114 (1991).
Article Google Scholar
Hawkins, J. A. Efficiency and Complexity in Grammars (Oxford Univ. Press, 2004).
Book Google Scholar
Liu, H., Xu, C. & Liang, J. Dependency distance: a new perspective on syntactic patterns in natural languages. Phys. Life Rev. 21, 171–193 (2017).
Article PubMed CAS Google Scholar
Temperley, D. & Gildea, D. Minimizing syntactic dependency lengths: typological/cognitive universal? Annu. Rev. Ling. 4, 1–15 (2018).
Google Scholar
Futrell, R., Levy, R. P. & Gibson, E. Dependency locality as an explanatory principle for word order. Language 96, 371–413 (2020).
Article Google Scholar
Mansfield, J. The word as a unit of internal predictability. Linguistics 59, 1427–1472 (2021).
Article Google Scholar
Chafe, W. L. Givenness, contrastiveness, definiteness, subjects, topics and points of view. In Subject and Topic (ed. Li, C. N.) 27–55 (Academic Press, 1976).
Bock, J. K. Toward a cognitive psychology of syntax: Information processing contributions to sentence formulation. Psychol. Rev. 89, 1–47 (1982).
Article Google Scholar
Bresnan, J., Cueni, A., Nikitina, T. & Baayen, H. Predicting the dative alternation. In Cognitive Foundations of Interpretation (eds Bouma, G., Krämer, I. & Zwarts, J.) 69–94 (Royal Netherlands Academy of Science, 2007).
Chen, S. F. & Goodman, J. An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13, 359–393 (1999).
Article Google Scholar
Nivre, J. et al. Universal Dependencies 1.0 (Universal Dependencies Consortium, 2015); http://hdl.handle.net/11234/1-1464
Thackston, W. M. An Introduction to Koranic and Classical Arabic: An Elementary Grammar of the Language (IBEX Publishers, 1994).
Google Scholar
Zeldes, A. The GUM Corpus: creating multilayer resources in the classroom. Lang. Resour. Eval. 51, 581–612 (2017).
Article Google Scholar
Behzad, S. & Zeldes, A. A cross-genre ensemble approach to robust Reddit part of speech tagging. In Proc. 12th Web as Corpus Workshop, (eds Barbaresi, A. et al.) 50–56 (European Language Resources Association, 2020); https://aclanthology.org/2020.wac-1.7
Silveira, N. et al. A gold standard dependency corpus for English. In Proc. Ninth International Conference on Language Resources and Evaluation (eds Calzolari, N. et al.) 2897–2904 (European Language Resources Association, 2014).
Graff, P. Communicative Efficiency in the Lexicon. PhD thesis, Massachusetts Institute of Technology (2012).
Vincze, V. et al. Hungarian dependency treebank. In Proc. Seventh International Conference on Language Resources and Evaluation (eds Calzolari, N. et al.) (European Language Resources Association, 2010); http://www.lrec-conf.org/proceedings/lrec2010/pdf/465_Paper.pdf
Buck, C., Heafield, K. & van Ooyen, B. N-gram counts and language models from the Common Crawl. In Proc. Ninth International Conference on Language Resources and Evaluation (eds Calzolari, N. et al.) 3579–3584 (European Language Resources Association, 2014); http://www.lrec-conf.org/proceedings/lrec2014/pdf/1097_Paper.pdf

Download references

Acknowledgements

We thank S. Piantadosi, N. Rathi, G. Scontras, K. Mahowald, N. Zaslavsky, T. Pimentel, R. Hawkins, N. Imel, R. Sun, Z. Pizlo, B. Skyrms, J. Barrett, J. Andreas, M. Marcolli, J. P. Vigneaux Ariztia, Ł. Dębowski, A. Nini and audiences at NeurIPS InfoCog 2023, the UCI Center for Theoretical Behavioral Sciences, EvoLang 2024, TedLab, the Society for Computation in Linguistics 2024, the Quantitative Cognitive Linguistics Network and the CalTech Seminar on Information and Geometry for discussion. We received no specific funding for this work.

Author information

Authors and Affiliations

University of California, Irvine, Irvine, CA, USA
Richard Futrell
Saarland University, Saarbrücken, Germany
Michael Hahn

Authors

Richard Futrell
View author publications
Search author on:PubMed Google Scholar
Michael Hahn
View author publications
Search author on:PubMed Google Scholar

Contributions

R.F. designed and ran studies in the main text. R.F. and M.H. performed mathematical analyses, designed and ran studies in the Supplementary Information, and wrote the paper.

Corresponding author

Correspondence to Richard Futrell.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Human Behaviour thanks Łukasz Dębowski, Byung-Doh Oh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Sections A–G, Figs. 1–14, Tables 1 and 2 and References.

Reporting summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Futrell, R., Hahn, M. Linguistic structure from a bottleneck on sequential information processing. Nat Hum Behav (2025). https://doi.org/10.1038/s41562-025-02336-w

Download citation

Received: 01 November 2024
Accepted: 06 October 2025
Published: 24 November 2025
Version of record: 24 November 2025
DOI: https://doi.org/10.1038/s41562-025-02336-w

Subjects

Abstract

Similar content being viewed by others

A large quantitative analysis of written language challenges the idea that all languages are equally complex

Language statistics as a window into mental representations

The principle of anticipation in language use

Main

Results

Explananda

Predictive Information

Simulations

Systematic expression of independent features

Holistic expression of correlated components

Locality

Hierarchical structure

Cross-linguistic empirical results

Phonotactics

Morphology

Syntax

Word order

Lexical semantics

Discussion

Language models

Machine learning

Neuroscience

Information theory and language

Cognitive status of predictive information

Linguistics

Limitations

Methods

Constructing a stochastic process from a language

Three-feature source simulation

Locality simulation

Hierarchy simulation

Phonotactics

Morphology

Adjective–noun pairs

Noun phrase order

Semantic features

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting summary

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links