The neural basis for uncertainty processing in hierarchical decision making

Wang, Mien Brabeeba; Lynch, Nancy; Halassa, Michael M.

doi:10.1038/s41467-025-63994-y

Download PDF

Article
Open access
Published: 16 October 2025

The neural basis for uncertainty processing in hierarchical decision making

Nature Communications volume 16, Article number: 9096 (2025) Cite this article

15 Altmetric
Metrics details

Subjects

Abstract

Hierarchical decisions in natural environments require processing uncertainty across multiple levels, but existing models struggle to explain how animals perform flexible, goal-directed behaviors under such conditions. Here we introduce CogLinks, biologically grounded neural architectures that combine corticostriatal circuits for reinforcement learning and frontal thalamocortical networks for executive control. Through mathematical analysis and targeted lesion, we show that these systems specialize in different forms of uncertainty, and their interaction supports hierarchical decisions by regulating efficient exploration, and strategy switching. We apply CogLinks to a computational psychiatry problem, linking neural dysfunction in schizophrenia to atypical reasoning patterns in decision making. Overall, CogLink fills an important gap in the computational landscape, providing a bridge from neural substrates to higher cognition.

Introduction

Environments and behaviors are often hierarchically organized, requiring animals to make decisions that integrate information from multiple levels. In complex tasks, animals must evaluate both immediate and broader causes to adapt their actions, especially when an unexpected outcome arises. Such outcomes can be ambiguous, as the brain must determine whether they result from random variability, a suboptimal strategy, or a fundamental change in the environment-a critical process for selecting the most effective course of action. This challenge often involves hierarchical reasoning, where the brain not only processes uncertainties at different levels but also integrates them to form coherent strategies.

For example, if a conversation with a new colleague feels awkward, one might question whether the topic is poorly chosen or whether the colleague is simply having a bad day. Disambiguating these possibilities is crucial for selecting the appropriate response. If the discomfort stems from a poor topic choice, switching to a different subject might improve the interaction. However, if the colleague’s disengagement reflects an underlying mood, a better approach might be to pause and revisit the conversation another day. This decision-making process relies on hierarchical inference, where lower-level variables (such as topic preferences) are interpreted within the broader context of higher-level states (such as mood or personal circumstances). This distinction becomes easier with a familiar colleague, as prior experience reduces uncertainty about their preferences, making it more likely that disengagement is attributed to mood rather than topic choice.

Since both conversational preferences and emotional states are latent variables that cannot be directly observed, the brain must infer their values while also estimating the uncertainty associated with each. For instance, one must assess both the intrinsic appeal of a topic, such as the Super Bowl, and the likelihood that a friend has emotionally recovered from a breakup after four months.

A fundamental challenge in neuroscience is understanding how the brain processes and integrates uncertainty across multiple hierarchical levels to drive flexible decision-making. Animal studies have demonstrated that perceptual confidence can influence higher-level contextual inference and revealed neural substrates associated with both sensory and contextual uncertainty in hierarchical decision-making tasks^1,2. However, how contextual uncertainty interacts with other types of uncertainty, such as associative or outcome uncertainty-illustrated by the example of a conversation with a new colleague-remains unclear.

Machine learning approaches have contributed to progress in addressing this question. Traditionally, normative models based on Bayesian inference have been used to solve hierarchical tasks and model the strategies animals employ^3,4,5. These models estimate uncertainty at multiple levels and use it for effective credit assignment^6,7,8. However, they pose significant limitations as tools for neuroscientific discovery. First, their explanatory power is constrained when the generative model of the environment is misspecified⁹, leaving the challenge of specifying an accurate model unresolved. Second, their components are non-neural, making it unclear whether and how they correspond to neural circuits and computations.

To address these limitations, the field has increasingly turned to neural networks trained through deep learning, which have been proposed as models of neural computation and have demonstrated exceptional performance on a range of tasks, sometimes exceeding human capabilities^10,11. However, these architectures operate as frequentist prediction models that do not explicitly account for uncertainty. As a result, they cannot estimate confidence in different task components in the way humans and animals do^8,12,13,14. These shortcomings highlight the need for a new approach to understand how brains make decisions in hierarchical environments and how uncertainty processing enables this cognitive ability.

In this study, we introduce CogLink, a neural architecture designed to bridge this gap. Fundamentally, CogLink networks are dynamical systems composed of rate neurons and share structural similarities with artificial feedforward and recurrent neural networks (Fig. 1a). However, they differ from conventional machine learning neural networks in three important ways.

**Fig. 1: CogLink bridges neural mechanisms and algorithms to optimize parameters for achieving complex cognitive tasks.**

First, CogLink networks are optimized using a multi-step procedure. Instead of using backpropagation to minimize network error, we employ an approach that leverages scale separation principles to extract a structured computational algorithm from neural dynamics, followed by mathematical analysis to determine near-optimal network connectivity parameters (Fig. 1b, c).

Second, CogLink networks incorporate biological realism by modeling specific brain systems, including known connectivity patterns, cell types, and learning rules for dynamic control (Fig. 1d).

Third, CogLinks offer greater interpretability. Because they are explicitly structured to approximate an algorithm, they allow us to directly map neural mechanisms to their functional roles, unlike traditional deep neural networks, which are often considered black boxes (Fig. 1d).

Through iterative development, we construct progressively complex CogLinks, mirroring the increasing computational complexity observed in biological evolution. Specifically, the basic network models a premotor cortico-thalamic-basal ganglia (BG) loop, emphasizing BG circuitry’s role in reinforcement learning and efficient environmental exploration, which addresses lower-level uncertainty in hierarchical environments. The augmented network incorporates an associative cortico-thalamic-BG loop, highlighting the mediodorsal thalamus (MD) and its interactions with the prefrontal cortex (PFC) to process higher-level uncertainty related to contextual inference and strategy switching.

In addition to demonstrating how partitioning uncertainty types in hierarchical environments is critical for the networks to reproduce animal behavior, CogLinks provide insights into neural mechanisms underlying complex decision-making. Specifically, our model explains findings from an accompanying study¹⁵ on human behavior and fMRI readouts and offers insights into perturbed dynamics in a mouse model relevant to schizophrenia¹⁶. To our knowledge, few existing neural frameworks simultaneously solve complex cognitive tasks while providing computational insight into neural mechanisms. We propose that CogLinks constitute an important step toward bridging this gap.

Results

Building a basic CogLink network for handling lower-level uncertainty

To illustrate lower-level uncertainty, let us revisit the example of conversing with a new colleague. Suppose you know nothing about the person and naively attribute each sigh of boredom to a suboptimal choice of topic, disregarding higher-level factors such as mood or personal circumstances. This scenario highlights two key types of lower-level uncertainty: outcome uncertainty, which may arise from factors such as variability in the person’s focus (e.g., low focus might prevent them from following certain sentences), and associative uncertainty, which reflects our lack of knowledge about the person’s preferences (greater unfamiliarity corresponds to higher associative uncertainty). Successfully navigating this interaction requires balancing exploration and exploitation. Persisting with the Super Bowl (exploitation) tests its suitability as a topic but risks disengagement if it proves uninteresting. Conversely, switching to a new topic, such as a shared hobby or current news (exploration), sacrifices immediate feedback on the Super Bowl but creates an opportunity to reduce associative uncertainty by learning more about the colleague’s preferences.

To investigate how the brain handles uncertainty, we use an A-alternative forced choice task (A-AFC task) (Fig. 2a). In this task, the reward probabilities for each action at trial t are represented as a vector ${{{{\boldsymbol{\theta }}}}}_{t}\in {{\mathbb{R}}}^{A}$, where ${({{{{\boldsymbol{\theta }}}}}_{t})}_{a}$ denotes the probability of receiving a reward when choosing action a. We consider both stationary and dynamic environments. In the stationary environment, the reward probabilities remain constant across trials, such that θ_t = θ₁ for all t ∈ [T]. In contrast, the dynamic environment features reward probabilities θ_t that vary across trials to reflect changing conditions. We will first study the stationary environment as it isolates lower-level uncertainty, providing a foundational framework for the basic CogLink. Subsequently, we will augment our CogLink to the dynamic environment to explore how hierarchical uncertainties interact.

**Fig. 2: Mechanistic details of basic CogLink support associative uncertainty encoding and balance exploration and exploitation.**

The basal ganglia (BG) are a natural candidate for handling lower-level uncertainties. A substantial body of research implicates the BG in learning action-outcome associations by integrating sensory inputs, motor actions, and reward feedback^17,18. Dopaminergic signals encoding reward prediction errors (RPEs) facilitate synaptic plasticity within the BG, enabling the adaptive adjustment of action values over time. This iterative refinement process makes the BG well-suited for encoding associative uncertainty and guiding the trade-off between exploration and exploitation. Accordingly, our basic CogLink network incorporates BG-like circuits, along with dopamine-dependent plasticity mechanisms for online learning and premotor/motor cortical areas for action selection (Fig. 2b). Neuronal activity in these areas is modeled as rate neurons governed by:

$$\tau \frac{d{{{\bf{x}}}}}{dt}=-{{{\bf{x}}}}+f({{{\bf{x}}}},{{{\bf{I}}}};{{{\bf{W}}}}),$$

(2.1)

where x represents the neurons’ firing rates, τ is the membrane time constant, f is a nonlinearity function, I is the input, and W denotes synaptic weights. As a convention, we denote synaptic weights from area A to area B using the variables W^A/B (matrix form) and V^A/B (vector form). The variable x^A represents the neural activity in vector form at area A. In its most basic form, the CogLink network handles lower-level uncertainty through two core mechanisms: exploration and learning. The exploration mechanism represents uncertainty as a distribution in BG and uses premotor recurrent dynamics to implement a probability matching strategy, promoting exploration when uncertainty is high. The learning mechanism, inspired by distributional reinforcement learning and Bayesian inference, updates action-value beliefs based on trial outcomes via dopamine-dependent plasticity. We detail how these mechanisms are implemented in different neural areas below.

A defining feature of the basic CogLink network is its incorporation of a quantile population code in the BG-like area, which encodes associative uncertainty as a distribution over action-value beliefs (Fig. 2c). In this scheme, each neuron is associated with a fixed quantile of the probability distribution, meaning that selecting a subset of neurons corresponds to sampling specific probabilities from the encoded distribution. Random sparsification dynamics in the premotor cortex (anterolateral motor cortex (ALM))-like area leverage this property to extract uncertainty through sampling (Fig. 2c, d, “Methods” section). This use of a quantile code builds upon the broader concept of population encoding, where neuronal ensembles represent probability distributions. Established approaches to population encoding include probabilistic population codes¹⁹, sampling codes²⁰, explicit probability codes²¹, and quantile codes²². In our model, we adopt the quantile coding approach to represent action-value beliefs as a probability distribution.

Specifically, we implement this by organizing neurons into A choice-specific ensemble of M premotor cortex-BG neuron pairs. In this framework, each ensemble encodes the distribution of action-value beliefs associated with a specific choice. The premotor corticostriatal synapses represent the distribution of action-value beliefs using a quantile code:

$$\forall a\in [A],m\in [M],\,P({v}_{a}={{{{\bf{V}}}}}_{a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}})=\frac{1}{M},$$

(2.2)

where v_a is a random variable representing action-value beliefs, and ${{{{\bf{V}}}}}_{a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}$ denotes the synaptic weight of the m-th connection between the m-th premotor neuron and BG neuron pair. Here, A is the number of alternatives and M is the size of a neuronal ensemble. This representation allows the network to efficiently extract uncertainty, as random sparsification in the premotor cortex-like area samples directly from the quantile-coded distribution formed by corticostriatal-like synapses (Fig. 2d). These sampled values are then relayed to the motor cortex-like area, where they inform action selection and balance exploration and exploitation during decision-making.

The sampling mechanism thus provides a way to translate associative uncertainty into inputs for the motor-like area, supporting efficient exploration during decision-making. While biological basal ganglia (BG) circuits project to the motor cortex via the thalamus and involve intricate circuitry beyond the corticostriatal loops modeled here, we abstract these additional components as relay functions to simplify the model. In this abstraction, the sampled values are directly projected to the motor cortex-like area to focus on the computations critical for exploration.

To convert the sampled action values, which encode associative uncertainty, into motor signals for action selection, we employ a model of the action selection mechanism inspired by ramp-to-threshold circuits observed in motor-related decision-making cortical circuits²³. Specifically, the recurrent connections in the motor cortex-like area are configured to implement mutual inhibition, enabling a Winner-Take-All (WTA) procedure (Fig. 2e, “Methods” section). In this setup, when the activity of a motor neuron ramps up to the threshold, the corresponding action a_t will be chosen. Thus, the motor circuit effectively chooses the action corresponding to the highest action-value sample emitted from striatal circuits (Fig. 2e).

Importantly, because the probabilistic nature of the sampled values carries information about the uncertainty (e.g., a flat distribution with a low mean can still sample a high value to drive exploration), this allows for associative uncertainty-based exploration: when value belief distributions of different actions have large overlapping (high uncertainty on which action is optimal), the CogLink will explore more; on the other hand, when value belief distribution of different actions have small overlapping (low uncertainty on which action is optimal), the CogLink will exploit more (Fig. 2f).

Finally, after the model chooses action a_t and receives reward r_t at trial t, the dopamine (DA) activities form a distributional RPE, ${{{\boldsymbol{\delta }}}}\in {{\mathbb{R}}}^{M}$, given by:

$$\forall m\in [M],\,{{{{\boldsymbol{\delta }}}}}_{m}={r}_{t}-{{{{\bf{V}}}}}_{{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}},$$

(2.3)

where ${{{{\bf{V}}}}}_{{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}$ represents the predicted value of action a_t for the m-th quantile. The distributional RPE is then used to update the premotor-BG synapses according to the following rule:

$$\forall m\in [M],\,{{{{\bf{V}}}}}_{{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\leftarrow {{{{\bf{V}}}}}_{{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}+{{{{\boldsymbol{\eta }}}}}_{{a}_{t}}{{{{\boldsymbol{\delta }}}}}_{m},$$

(2.4)

where ${{{{\boldsymbol{\eta }}}}}_{{a}_{t}}$ denotes the learning rate of the corticostriatal synapses for the selected action a_t.

Unlike prior work on distributional reinforcement learning²², which focuses on learning reward distributions, our approach learns action-value belief distributions. This distinction is critical for representing uncertainty in action values, which is essential for reasoning about lower-level uncertainty. We elaborate on this distinction and its implications in the “Discussion” section.

To assess the model’s performance, we use the standard regret metric from bandit literature²⁴, defined as:

$${R}_{T}={\sum}_{t=1}^{T}{({{{{\boldsymbol{\theta }}}}}_{t})}_{{a}_{t}^{*}}-{({{{{\boldsymbol{\theta }}}}}_{t})}_{{a}_{t}},$$

(2.5)

where ${a}_{t}^{*}$ is the retrospectively optimal action, and a_t is the action chosen by the model. Regret measures the cumulative difference in rewards between the model’s chosen actions and the optimal actions, providing a benchmark for evaluating the model’s ability to adapt and balance exploration and exploitation.

The CogLink network successfully minimized regret by balancing exploration and exploitation (Fig. 2g). To explore the neural underpinnings of our model’s performance, we examined the synaptic strength of corticostriatal connections. Intriguingly, these synaptic profiles exhibited distinct signatures indicative of efficient exploration. The ensemble of synapses tuned to the accurate choice (action 1) rapidly narrowed the distribution and converged to the correct value estimates, while the synapses tuned to less preferred choices still showed a gradient of smaller synaptic strengths distinct from those of the preferred choice. The distinct gradient of synaptic strengths tuned to the less preferred choice indicated that the less preferred choice has high associative uncertainty (i.e., the distributions remained wide and hence those ensembles tuned to exhibit a gradient of strength in Fig. 2h) but low enough to confidently exploit the correct choice (i.e., the distributions are separated from the distribution of the optimal action and hence exhibit a gradient of synaptic strengths smaller than the synaptic strengths tuned to the optimal action in Fig. 2e).

To test the necessity of specific mechanisms in balancing exploration and exploitation, we performed two lesion experiments: one with reduced sparsification in the premotor cortex (KO-sparseness) and another replacing the distributional RPE with a scalar RPE (KO-distributional RPE). Both lesion variants resulted in significantly higher regret (Fig. 2i), driven by premature exploitation that led to persistent suboptimal choices (Fig. 2j). These results provide mechanistic insights into both random sparsification and distributional RPEs in balancing exploration and exploitation.

Our basic CogLink model approximates an algorithm with nearly optimal regret

A mechanistic model often proves too complex to clearly illustrate the underlying computational mechanisms or to admit mathematical analysis. To address this challenge, we approximate the basic CogLink network with an algorithm by leveraging the separation of scales, assuming that neural dynamics occur instantaneously (see “Methods” section). This simplification enables the premotor corticostriatal-like ensemble tuned to action a to act as a sampling mechanism for the action-value distribution. Specifically, the K − WTA dynamics in the premotor cortex-like circuit serve to randomly select K neurons, enabling efficient sampling of the action-value distribution:

$${{{{\mathcal{V}}}}}_{a}=\,{{\mbox{unif}}}\,\left({\left\{\frac{1}{K}{\sum}_{j=1}^{K}{{{{\bf{V}}}}}_{a,{i}_{j}}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\right\}}_{1\le {i}_{1} < \cdots < {i}_{K}\le M}\right),\,{\hat{v}}_{a} \sim {{{{\mathcal{V}}}}}_{a}.$$

(2.6)

where ${{{{\mathcal{V}}}}}_{a}$ represents the distribution of value belief of action a and ${\hat{v}}_{a}$ is the sampled value. The WTA mechanism of the motor cortex-like circuit selects the action with the highest sample value:

$${a}_{t}=\arg {\max }_{a}{\hat{v}}_{a}.$$

(2.7)

Finally, the dopamine-gated plasticity adjusts the corticostriatal synapses to refine action-value estimates over time:

$$\forall m\in [M],\,{{{{\boldsymbol{\delta }}}}}_{m}={r}_{t}-{{{{\bf{V}}}}}_{{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}},\,{{{{\bf{V}}}}}_{{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\leftarrow {{{{\bf{V}}}}}_{{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}+{{{{\boldsymbol{\eta }}}}}_{{a}_{t}}{{{{\boldsymbol{\delta }}}}}_{m}$$

(2.8)

where δ_m is the distributional RPE and ${{{{\boldsymbol{\eta }}}}}_{{a}_{t}}$ is the learning rate for the a_t-tuned synapses.

The algorithm provides an intuitive framework for understanding the functionality of our corticostriatal network model. In this framework, A posterior-like distributions, representing action-value beliefs, are sampled through random sparsification in the premotor cortex. The motor cortex then selects the action corresponding to the largest sampled value through the recurrent competitive dynamics. Following action selection, the model refines its action-value distributions based on distributional reward prediction error (RPE) signals from dopamine (DA) neurons. High associative uncertainty-indicating a lack of confidence in the value estimate-results in the significant overlap between posterior-like distributions, promoting exploration (Fig. 2e, left). Conversely, low associative uncertainty leads to well-separated posterior-like distributions, enabling exploitation as the model confidently selects the optimal action (Fig. 2e, right).

To assess our network’s performance against the theoretical regret limit, we conducted a mathematical analysis of the algorithm. Our analysis demonstrates that by appropriately configuring the parameters in the synaptic update rule, the regret of the algorithm will be on the order of $O(\sqrt{AT\log (AT)})$, where A represents the number of actions and T denotes the number of trials (see Theorem 2 in the “Methods” section for a formal theorem).

Theorem 1

(Informal). If we select the sparsity K, the learning rate ${\{{\eta }_{a,t}\}}_{a\in [A],t\in [T]}$ and the initial synaptic weight ${\{{\bar{V}}_{a,m}^{{{{\rm{alm/bg}}}}}\}}_{a\in [A],m\in [M]}$ appropriately, then the regret of the algorithm after T trials in a static A-AFC task is at most $C\sqrt{AT\log (AT)}$, where C is a constant.

It has been demonstrated that no algorithm can achieve regret smaller than ${{\Theta }}(\sqrt{AT})$²⁵. Our algorithm, which differs by only a logarithmic factor, is therefore close to optimal in terms of regret. This result provides a theoretical foundation for the model’s ability to perform efficient exploration under lower-level uncertainty, demonstrating its near-optimal balance of exploration and exploitation.

Relationship between basic CogLink and Bayesian inference with probability matching

We evaluated the performance of our basic CogLink model in static A-AFC tasks and compared it to Thompson Sampling (TS), a widely used algorithm that combines optimal Bayesian inference with probability matching and provides asymptotically optimal theoretical guarantees²⁶. TS was chosen as the baseline because it represents a principled approach to balancing exploration and exploitation under uncertainty. Task difficulty was manipulated by varying the expected reward difference between the most and least rewarding actions (Δ) and the number of alternatives (A) (see “Methods” section) (Fig. 1a). Across all tested environments, CogLink consistently outperformed TS, achieving a better balance between exploration and exploitation, as evidenced by faster convergence to optimal actions and improved regret performance (Fig. 3a–c, Fig. S1a–c).

**Fig. 3: Comparison between CogLink and Thompson sampling.**

To further evaluate CogLink’s versatility in handling more complex decision-making scenarios, we extended its application to two generalizations: a cued A-AFC task, which incorporates state (cue) information, and a binary tree maze task, which introduces state transitions (see “Methods” section). These tasks represent a progression from stateless bandit problems to scenarios where decisions depend on environmental states. In these tasks, we compared CogLink against TS and a neural network-based method, Deep Q-Network (DQN,²⁷). CogLink demonstrated robust performance across varying difficulty settings, maintaining competitive regret compared to both TS and DQN (Fig. S2a–d). These results underscore CogLink’s ability to adapt from simple, stateless environments to more complex tasks involving state information and transitions, demonstrating its versatility in managing lower-level uncertainty during decision-making.

The robust performance of CogLink in various tasks raises questions about the underlying principles that enable its effective decision-making. To better understand these mechanisms, we examined the algorithm resulting from CogLink’s approximation and observed that the resulting algorithm after approximation shares key similarities with Thompson Sampling (TS), particularly in its use of action-value distributions and probability matching-like action selection, but differs in the update rule. To investigate this relationship further, we analyzed the correspondence between the distributional RPE update in Equation (2.4) and Bayesian updates.

First, we initialized the corticostriatal weights to approximate a uniform prior, analogous to Bayesian inference using a uniform prior:

$$\forall a\in [A],m\in [M],\,{{{{\bf{V}}}}}_{a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\leftarrow {\bar{{{{\bf{V}}}}}}_{a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}},\,{{\mbox{where}}}\,{\bar{{{{\bf{V}}}}}}_{a,m}^{\,{{\mbox{alm}}}/{{\mbox{bg}}}\,}=\frac{m}{M}.$$

(2.9)

Next, we examined how the expectation and variance of the action-value distribution evolved under the distributional RPE update. By selecting learning rates η_t ∝ 1/t (see “Methods” section), we found that our updates closely approximated the evolution of both the expectation and variance under optimal Bayesian inference (Fig. 3d, e). This choice of learning rate satisfies two critical conditions: ${\sum }_{t=0}^{\infty }{{{{\boldsymbol{\eta }}}}}_{t}=\infty$ and ${\lim }_{t\to \infty }{{{{\boldsymbol{\eta }}}}}_{t}=0$, ensuring that the variance diminishes over time while the expectations converge to the true action-value distribution.

For the action selection, compared to TS, CogLink provides flexibility in balancing exploration and exploitation by modulating the parameter K, the sparsity in the premotor cortex-like area. Larger K values result in narrower distributions after updates, favoring exploitation (Fig. 3f). Specifically, when K = 1, this is probability matching, and when K = M, this is deterministically sampling the expected value. When 1 < K < M, the model employs generalized probability matching^{28,29,30,31,32}, where higher K values increase the emphasis on exploitation while still allowing some degree of exploration. This framework provides a continuum of strategies for balancing exploration and exploitation.

We hypothesize that K could be dynamically modulated in biological systems through neural mechanisms, such as altering the excitability of premotor cortical neurons via neuromodulation. This hypothesis aligns with prior studies demonstrating that neuromodulatory systems, including dopamine and norepinephrine signaling, play a role in adjusting the exploration-exploitation trade-off ^33,34. This dynamic modulation offers a plausible pathway for organisms to adapt their decision-making strategies to changing environmental demands.

Building an augmented CogLink network for handling higher-level uncertainty

Returning to our example of conversing with a new colleague, the assumption that each sigh of boredom is solely due to a suboptimal topic choice or random outcome variability (e.g., low focus) is overly simplistic. Higher-level factors, such as the colleague’s mood or workload, can also play a role, and these conditions are often dynamic and unobservable. In naturalistic environments, animals must contend not only with lower-level uncertainties, such as outcome and associative uncertainty, but also higher-level uncertainties, including contextual uncertainty-ambiguity about the underlying context governing the environment. To address this challenge, we designed a probabilistic reversal task in a dynamic environment (Fig. 4a). While the basic CogLink network performed well in static environments, it struggled to adapt quickly to changing contexts in this dynamic setting (Fig. S3e, f).

**Fig. 4: Mechanistic details of augmented CogLink facilitate the encoding of contextual uncertainty and drive flexible switching behaviors.**

As an initial step, we introduced explicit external contextual cues to the model, activating separate instances of the basic CogLink network depending on the provided cues (Fig. S3b). This modification allowed the model to achieve instantaneous behavioral switching (Fig. S3g, h). However, animals in natural environments rarely have access to explicit contextual cues and instead must infer the underlying context from ambiguous and incomplete observations.

The prefrontal cortex (PFC)-mediodorsal thalamus (MD) circuit is a natural candidate for enabling such contextual inference. The PFC is well-established as a key region for flexible, context-dependent behavior^35,36, generating complex activity patterns to support such capacities. Recent studies suggest that these patterns are regulated by interactions with the MD^{37,38,39,40,41,42,43}, which encodes task context explicitly in a range of decision-making paradigms^1,44,45,46. Inspired by these findings, we augmented CogLink by incorporating a PFC-MD-like circuit to infer and provide contextual information to the basic CogLink networks (Fig. 4b). This augmentation enables the model to adapt to dynamic environments without relying on explicit external cues. One important assumption in our model is that disjoint basic CogLink networks are activated based on the inferred context. We discuss the biological plausibility of this mechanism further in the “Discussion” section.

A prominent feature of the augmented CogLink model is its low-dimensional representation of contextual likelihood in the MD-like area, consistent with previous literature^{1,44,45,46,47,48,49}. Specifically, we propose that the MD encodes the conditional likelihood p(c∣a_≤t, r_≤t) of a context c, given the history of action-outcome pairs {a_≤t, r_≤t}. To achieve this, we hypothesize that MD activity lies on a low-dimensional simplex attractor, enabling a stable representation of contextual likelihood. Since the thalamus lacks intrinsic excitatory recurrence⁴², we propose that PFC and MD form an excitatory loop, with the thalamic reticular nucleus^50,51,52,53 providing local inhibition to stabilize the attractor (Fig. 4, see “Methods” section). In this framework, contextual likelihood is represented as an explicit probability code, where higher neural activity corresponds to a higher likelihood of the associated context. The simplex attractor structure allows MD activity to dynamically integrate inputs, enabling it to traverse the manifold in response to changing environmental conditions (Fig. 4c, d). However, contextual inference requires that these inputs be appropriately encoded to reflect the conditional likelihood.

Bayes’ rule provides a framework for determining the required input encoding by defining how contextual likelihoods are computed:

$$p(c| {a}_{\le t},{r}_{\le t})\propto {\prod }_{i=1}^{t}p({a}_{i},{r}_{i}| c)p(c).$$

(2.10)

This formalism suggests that the inputs should correspond to the single-trial contextual generative model p(a_t, r_t∣c), which is accumulated across trials to compute the overall likelihood. We hypothesize that PFC-MD synapses learn this single-trial generative model, while the MD’s low-dimensional attractor dynamics perform the accumulation. To enable this process, we implemented a Hebbian learning rule for PFC-MD connections (Fig. 4e, see “Methods” section):

$$\Delta {{{{\bf{V}}}}}_{c,a,r}^{\,{\mbox{pfc/md}}}\propto {f}_{{{{\rm{hebb}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{md}}\,}){{{{\bf{x}}}}}_{a,r}^{\,{\mbox{pfc}}\,}.$$

(2.11)

Here, ${{{{\bf{x}}}}}_{c}^{\,{\mbox{md}}\,}$ and ${{{{\bf{x}}}}}_{a,r}^{\,{\mbox{pfc}}\,}$ denote the activities of MD and PFC neurons tuned to context c and action-outcome pair (a, r), respectively, ${{{{\bf{V}}}}}_{c,a,r}^{\,{\mbox{pfc/md}}\,}$ represents PFC-MD synaptic weights between these MD and PFC neurons, and f_hebb (Fig. S4b) is a sigmoidal gating function that modulates synaptic plasticity.

Naive Hebbian plasticity may incorrectly associate action-outcome pairs with the wrong context when contextual uncertainty is high, resulting in inaccurate estimates of the contextual generative model (Fig. 4f). To mitigate this issue, we incorporate a gating mechanism f_hebb that modulates plasticity based on MD activity. This gating enhances learning when the MD confidently infers the context (high MD activity) and suppresses plasticity when contextual uncertainty is high (low MD activity). By doing so, the mechanism achieves two key objectives: it accelerates the learning of contextual statistics when confidence is high and prevents the misattribution of associations under high contextual uncertainty (Fig. 4g, h).

To causally test the necessity of this mechanism, we evaluated a variant of the model with naive Hebbian plasticity (KO-nonlinear Hebb). The results show that the generative model learned by the full CogLink closely approximates the true generative model of the environment, whereas the KO-nonlinear Hebb variant deviates significantly (Fig. 4f). This supports the critical role of f_hebb in enabling accurate contextual learning under uncertainty.

Another key component of the augmented CogLink is an interneuron-mediated thalamocortical projection pathway that modulates cortical activity to drive exploration under high contextual uncertainty (Fig. 4i). When contextual uncertainty is high, animals need to explore more to gather information about the current context. To implement this pathway, we drew inspiration from experimental findings showing that the MD thalamus modulates PFC functional connectivity through distinct interneuron-mediated mechanisms. Specifically, Mukherjee et al. identified two thalamocortical pathways: one amplifies cortical connections via local disinhibition by vasoactive intestinal peptide (VIP) interneurons, and the other suppresses cortical activity through fast inhibition mediated by parvalbumin (PV) interneurons⁴⁵.

Building on these findings, we assumed that such modulation enables contextually relevant PFC populations to differentially influence downstream premotor circuits, thereby facilitating context-dependent behavior. To model this mechanism, we included both thalamic projection pathways: one amplifies effective cortical connectivity for the preferred context, while the other inhibits cortical activity related to the opposing context. This modulation adjusts the activity in the PFC-like area and influences downstream premotor corticostriatal connections (Fig. 4i). Specifically, the projections modulate the effective strength of corticostriatal connections according to the following dynamics:

$$\forall c\in [2],a\in [A],\,m\in [M],\,{\tau }^{{{{\rm{bg}}}}}\frac{d{{{{\bf{x}}}}}_{c,a,m}^{\,{\mbox{bg}}\,}}{dt}=\\ -{{{{\bf{x}}}}}_{c,a,m}^{\,{\mbox{bg}}}+{f}_{{{{\rm{in}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}-{{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,}){{{{\bf{V}}}}}_{c,a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}{{{{\bf{x}}}}}_{c,a,m}^{\,{\mbox{alm}}\,}.$$

(2.12)

Here, τ^bg is the membrane time constant of striatal neurons. ${{{{\bf{x}}}}}_{c,a,m}^{\,{\mbox{bg}}\,}$ and ${{{{\bf{x}}}}}_{c,a,m}^{\,{\mbox{alm}}\,}$ represent the activities of the mth BG and premotor neurons tuned to context c and action a, respectively, while ${{{{\bf{V}}}}}_{c,a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}$ is the strength of the corticostriatal synapse connecting these neurons. f_in is a sigmoidal nonlinearity, and ${{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}$ and ${{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,}$ denote the activities of VIP and PV interneurons receiving MD inputs tuned to the preferred and opposing contexts, respectively (see “Methods” section). Since these interneurons receive contextual inputs from MD, the term ${f}_{{{{\rm{in}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}-{{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,})$ encodes contextual certainty. This mechanism ensures that corticostriatal connections are weakened under high contextual uncertainty, promoting exploratory behavior.

To validate this mechanism, we systematically varied MD activity to manipulate contextual uncertainty and measured its effect on exploratory behavior. Consistent with our predictions, higher contextual uncertainty corresponded to increased exploration, confirming the role of thalamocortical projections in dynamically regulating contextual uncertainty-based exploration (Fig. 4j).

In addition to modulating exploratory behaviors, contextual uncertainty should also regulate learning. Under high contextual uncertainty, naive dopamine-dependent plasticity risks misattributing associations to the wrong context, resulting in inaccurate action-value estimates (Fig. 4k). To address this, we implemented a mechanism in which interneuron-mediated inputs gate the plasticity of corticostriatal synapses (Fig. 4i). This design is inspired by experimental findings that interneuron-mediated pathways can modulate cortical plasticity^54,55. Specifically, our model incorporates the following update rule:

$$\forall c\in [2],m\in [M],\,{{{{\boldsymbol{\delta }}}}}_{c,m}={r}_{t}-{{{{\bf{V}}}}}_{c,{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}},\,\Delta {{{{\bf{V}}}}}_{c,{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\propto {f}_{{{{\rm{in}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}-{{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,}){{{{\boldsymbol{\delta }}}}}_{c,m}.$$

(2.13)

Here, ${{{{\boldsymbol{\delta }}}}}_{c}\in {{\mathbb{R}}}^{M}$ represents the distributional dopamine activities tuned to context c, and ${{{{\bf{V}}}}}_{c,{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}$ denotes the corticostriatal synaptic weights. The gating term ${f}_{{{{\rm{in}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}-{{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,})$, a sigmoidal nonlinearity, reflects the relative activities of VIP and PV interneurons, encoding contextual certainty. When contextual uncertainty is high (low f_in), the mechanism suppresses learning to avoid associating incorrect contexts with observed outcomes. Conversely, under low uncertainty, plasticity is enhanced, promoting accurate learning.

To test the necessity of this gating mechanism, we developed a variant of the model that bypasses interneuron-mediated gating and uses direct thalamocortical modulation (KO-interneuron gating). We compared the action-value estimates learned by the full CogLink model to those of the KO-interneuron gating variant. The full model closely approximated the true action values of the environment, whereas the KO-interneuron gating variant deviated significantly (Fig. 4k). These results underscore the critical role of interneuron-mediated gating in enabling accurate and continual learning across contextual switches.

The MD circuit approximates an algorithm that detects environmental changes optimally

To computationally understand how CogLink achieves flexible switching, we next describe the effective MD circuit and approximate the dynamics of the MD circuit with an algorithm. The MD circuit is structured to accumulate contextual likelihoods, enabling robust context inference. Mathematically, by letting the dynamics of the thalamic reticular nucleus (TRN) and frontal neurons happen instantaneously, the MD circuit can be effectively described by the following equations (see “Methods” section):

$$\left\{\begin{array}{l}{\tau }^{{{{\rm{eff}}}}}\dot{{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}}=-\frac{2{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}}}{D}+{f}_{{{{\rm{md}}}}}(w{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}-w{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,})+{{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}\quad \\ {\tau }^{{{{\rm{eff}}}}}\dot{{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}}=-\frac{2{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}}}{D}+{f}_{{{{\rm{md}}}}}(w{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}-w{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,})+{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}\quad \end{array}\right..$$

(2.14)

where τ^md = τ^effD/2 represents the membrane time constant of MD, τ ^eff represents the effective time constant for accumulation dynamics, $w=\frac{1}{D}$, and ${{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,},{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}$ represent the PFC inputs to MD. The nonlinearity function is defined as:

$${f}_{{{{\rm{md}}}}}(x)=\left\{\begin{array}{ll}x+1,\quad &\,{{\mbox{for}}}\,-1\le x\le 1\\ 2,\quad &\,{{\mbox{for}}}\,x > 1\\ 0,\quad &\,{{\mbox{for}}}\,x < -1\\ \quad \end{array}\right..$$

(2.15)

Defining $X={{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}-{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}$, the dynamics simplifies to:

$${\tau }^{{{{\rm{eff}}}}}\frac{dX}{dt}={{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}},{\mbox{when}}\,| X| < D$$

(2.16)

and

$${\tau }^{{{{\rm{eff}}}}}\frac{dX}{dt}=-\frac{2X}{D}+{{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}}+2,{\mbox{when}}\,| X| > D.$$

(2.17)

Equation (2.16) corresponds to a drift-diffusion process, while Equation (2.17) describes the dynamics above threshold ±D. When $| {{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}| \ll 1$, X stabilizes at approximately ±D, resulting in thresholded drift-diffusion behavior with inputs ${{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}$ (Fig. 5a).

**Fig. 5: The augmented CogLink achieves flexible decision-making and continual learning by managing hierarchical uncertainty.**

If the PFC inputs learn the accurate generative model from Equation (2.11), these dynamics align with the CUSUM algorithm, a theoretically optimal method for detecting distributional changes^56,57. Specifically, this occurs when:

$${{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}(t)=\log P({a}_{t},{r}_{t}| c=1)+\alpha,{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}(t)=\log P({a}_{t},{r}_{t}| c=2)+\alpha .$$

(2.18)

Here, α denotes the baseline excitation. To illustrate, if we set X₀ = −D and S_t = X_t + D, the evolution of X_t corresponds to

$${S}_{t}=\min (2D,\max (0,{S}_{t-1}+{{{{\bf{I}}}}_{1}^{{\,{\mbox{pfc/md}}\,}}}(t)-{{{{\bf{I}}}}_{2}^{{\,{\mbox{pfc/md}}\,}}}(t))).$$

(2.19)

When S_t < 2D, the CogLink model functions as a CUSUM algorithm with a threshold at D for detecting distributional changes (see “Methods” section). This alignment underscores the efficiency of the thalamocortical model in identifying environmental changes and facilitating transitions between different instances of the basic CogLink for decision-making. Consistent with our theoretical predictions, the model closely approximates the behavior of the CUSUM algorithm during the first contextual switch (Fig. 5b).

Recognizing that real-world environments often involve multiple sequential changes, the CogLink model incorporates a capping mechanism to address the limitations of the CUSUM algorithm, which is designed for single change point detection. By capping the accumulation of evidence for each context (Equation (2.17)), this mechanism prevents overcommitment to a single context and enables the model to reset quickly and prepare for subsequent environmental shifts (Fig. 5a). This explains the observed deviations from the CUSUM algorithm’s behavior after the first detected change and highlights the importance of this feature in maintaining adaptability.

Furthermore, the evidence-capping mechanism supports the model’s independence from prior knowledge of the generative model. As long as there is sufficient time between context changes for I ^pfc/md to accurately learn the contextual generative model, the CogLink model operates effectively without requiring specific environmental assumptions. This model-agnostic property not only distinguishes it from ideal observer models, which depend on precise access to the generative model, but also underscores its versatility and robustness across diverse and dynamic environments.

The augmented CogLink achieves flexible decision-making and continual learning by managing hierarchical uncertainty

To empirically evaluate CogLink’s performance in dynamic environments, we compared it to a Hidden Markov Model (HMM) that has prior knowledge of the hidden generative model of the environment and uses Thompson sampling for action selection (see “Methods” section). Despite the HMM’s advantage of full prior knowledge, CogLink achieves comparable levels of regret and accuracy while learning the generative model from scratch (Fig. 5c-f). This comparison underscores CogLink’s ability to perform effectively without relying on predefined assumptions about the environment.

Analyzing the models’ behaviors after a context switch reveals differences in their adaptation strategies. While both models transition rapidly to the new context, the HMM switches slightly faster but requires more trials to fully stabilize its decisions (Fig. 5e). To quantify this, we define trials to switch as the number of trials needed for a model to achieve 80% accuracy over the past 10 trials following a context change. As expected, the HMM exhibits faster switching times due to its prior knowledge, though CogLink’s switching performance remains competitive (Fig. 5g).

To understand the mechanisms underlying CogLink’s performance, we analyzed the evidence accumulation dynamics in MD, as predicted by the theoretical framework in the previous section. The model rapidly and accurately detects context switches after each block, leveraging these dynamics to adapt effectively (Fig. 5a). Furthermore, CogLink demonstrates robust continual learning by accurately updating action values and the contextual generative model, even as environmental statistics shift across blocks. These learned estimates remain stable across switches, retaining prior block information while enabling adaptation to new contexts (Fig. 4f, k).

To further explore this adaptability, we examined how CogLink leverages contextual uncertainty encoded in MD populations to support continual learning. Contextual uncertainty peaks immediately after context switches, reflecting the model’s need to gather information during transitions (Fig. 4g). This uncertainty modulation directly influences Hebbian learning rates of PFC-MD synapses (Equation (2.11)), which rapidly decrease for the previous context after a switch. This reduction prevents the model from incorrectly learning generative models in the wrong context (Fig. 4h). Similarly, VIP- and PV-mediated learning rates (Equation (2.13)) are modulated to ensure that action-outcome associations are appropriately attributed to the current context (Fig. S4c, d).

Interestingly, uncertainty modulation operates bidirectionally between hierarchical levels. High associative uncertainty, arising from insufficient knowledge of action-outcome associations, slows the model’s updates to contextual uncertainty, reflecting the difficulty in attributing evidence to the correct hierarchical process. This behavior manifests in longer switching times when CogLink encounters a novel block (Fig. 5h). Conversely, in dynamic environments with low outcome uncertainty (e.g., reward probabilities of 90%/10%), CogLink switches contexts much more rapidly (Fig. 5i). This suggests that reduced variability in outcomes enables the model to more readily attribute failure to context changes, thereby facilitating faster contextual updates. Together, these findings indicate that both associative and outcome uncertainty shape the dynamics of contextual uncertainty. By orchestrating these interactions across hierarchical uncertainty levels, CogLink achieves flexible decision-making and robust continual learning, even in complex and dynamic environments.

The model explains experimental findings showing MD causal engagement in decision-making involving changing but not stationary environments

A number of studies have shown that MD lesions or inactivation perturbs behavioral adjustment when the environment changes, but doesn’t necessarily impact behavior when conditions are stable^{44,45,58,59,60}. To test whether our model exhibits these features, we performed perturbation studies by suppressing model MD neural activity (see “Methods” section). In agreement with the corpus of experimental findings, we found that the MD-suppressed model took significantly longer to switch compared to the normal model (Fig. 6a–f)⁴⁴. Specifically, following a block switch, the MD-suppressed model exhibits a gradual increase in exploration of the alternate action until commitment (Fig. 6c). Moreover, the model provided a unique perspective on why this happens with an experimentally testable prediction: in the other model component, the M1-BG component, analysis of corticostriatal connection strength revealed fluctuating value estimates across blocks, indicating unlearning of value estimates from the previous context to adapt to the current one (Fig. 6g, h). This is consistent with the idea that without the MD, animals may default to lower-level or model-free strategies to solve tasks that they would otherwise be able to solve with frontal control.

**Fig. 6: Mediodorsal thalamus in CogLink is necessary for decision-making in a changing environment but not required in a stationary environment.**

A natural question then arises: Is MD necessary for efficient exploration in a stationary environment? To answer this question, we evaluated our models in various stationary 2-AFC tasks. In contrast to the results above, the MD inhibition model still has comparable behavioral performance with Thompson sampling across various environments and only slightly degrades its performance from the full model (Fig. 6i, Fig. S6). This further indicates that MD is not directly involved in simple associative learning, but rather serves as a central hub to orchestrate the learning of contextual models and modulation of downstream associative learning through learned contextual models (see “Discussion” section).

Hyperactivation of striatal D2 receptors induces Schizophrenia-like behaviors, and MD stimulation can rescue these deficits

There is increasing evidence that schizophrenia patients exhibit impaired belief updating processes^61,62,63,64, which may be related to susceptibility to delusional thinking^61,65,66. Separately, resting state functional connectivity between the MD thalamus and PFC is altered in schizophrenia patients^{67,68,69,70,71,72,73,74}. A recent study using mouse models carrying schizophrenia-relevant mutations showed both perturbed MD function and belief updating, and optogenetic MD stimulation led to a normalization of the belief updating process¹⁶. While striking, these findings leave open the question of what the mechanistic links are between MD perturbation and the belief updating decision-making process.

Inspired by the fact that most antipsychotics targeting D2 receptors (D2Rs) are dopamine antagonists^75,76,77 and most Schizophrenia patients show an elevated level of striatal D2Rs^78,79, we consider a model with hyperactivation of striatal D2Rs. Since a hyperactive striatum is expected to inhibit the MD thalamus⁸⁰, we model our impaired model with decreased MD excitability (see “Methods” section, Fig. 7a).

$$\left\{\begin{array}{l}{\tau }^{{{{\rm{eff}}}}}\dot{{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}}=-\frac{2{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}}}{D}+{\beta }_{{{{\rm{d2}}}}}\,{f}_{{{{\rm{md}}}}}(w{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}-w{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}})+{\beta }_{{{{\rm{d2}}}}}{{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}\quad \\ {\tau }^{{{{\rm{eff}}}}}\dot{{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}}=-\frac{2{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}}}{D}+{\beta }_{{{{\rm{d2}}}}}\,{f}_{{{{\rm{md}}}}}(w{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}-w{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}})+{\beta }_{{{{\rm{d2}}}}}{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}\quad \end{array}\right..$$

Here, β_d2 is the decreased excitability from D2R hyperactivation. It is observed that the impaired model has suboptimal regret and accuracy and never fully commits to accurate choices after a block switch (Fig. 7b–e). Moreover, the model also exhibits longer exploration after a switch (Fig. 7f, g), showing impaired cognitive flexibility^16,81. On the other hand, the impaired model also shows an elevated win-switch rate (Fig. 7h), suggesting a perception of environmental instability leading to this erratic behavior. These two seemingly contradictory behaviors of slow switching and high win-switch are consistent with experimental findings in both patients and animal models^16,62,81,82.

**Fig. 7: Excess dopamine on MD induced Schizophrenia-like signatures in CogLink and MD activation rescues the behaviors.**

To investigate the neural mechanisms behind these two behaviors, we first examine the drift process formed by the difference in activities of two contextual MD populations. Compared to the normal drift process, which is well-separated across contexts, the impaired drift process saturates its evidence at a much lower threshold, inducing a strong prior for the volatility of the environment (Fig. 7i). To understand its underlying dysfunctions, we leverage CogLink’s capacity to approximate an algorithm and show that the threshold of the accumulation dynamics becomes smaller. Moreover, the impaired normative model exhibits leaky evidence integration, further reinforcing its prior belief in the environment’s volatility (see “Methods” section). By examining the contextual uncertainty the impaired model decoded, we can also observe its strong belief in environmental volatility (Fig. S7b).

On the other hand, the corticostriatal strengths exhibit more homogeneous profiles (Fig. S7c), indicating low associative uncertainty. Moreover, its learning rate modulated by VIP/PV interneurons is much lower than the normal model (Fig. 7m, Fig. S4d). This suggests that although the impaired model has a strong prior on the volatility of the environment, it also updates its belief at a much slower rate within a single context, potentially contributing to slow switching.

Numerous studies have demonstrated alterations in PFC-MD coupling in schizophrenia patients^{67,68,69,70,71,72,73,74}. Given that our model suggests PFC-MD connections are involved in learning contextual generative models, we aim to investigate whether the impaired model also exhibits deficits in model learning. Compared to the normal model, the impaired model struggles to learn the correct contextual generative model of the environments (Fig. 7k). To probe the mechanism, we examine the learning rate of PFC-MD connections. Indeed, lower excitability results in neuronal activities insufficient to induce Hebbian plasticity (Fig. 7l).

To restore the model’s learning capacity, we introduce a small excitatory current into the MD neurons (Fig. 7a). This intervention reduces both regret and exploratory behaviors after a switch (Fig. 7b–g). Additionally, although the rescue model does not reduce the win-switch rate (Fig. 7h), the drift process of the rescue model exhibits a higher threshold for evidence accumulation, indicating a weaker prior on environmental volatility (Fig. 7i). Moreover, the rescue model learns a more accurate generative model of the world (Fig. 7k) and reinstates proper learning in PFC-MD connections (Fig. 7l). These findings are consistent with the recent MD activation experiments on Schizophrenia-relevant mouse models¹⁶.

Discussion

Biological plausibility of the CogLink

CogLink incorporates diverse, biologically inspired mechanisms to model associative and contextual uncertainty processing in frontal networks. To ensure computational tractability while maintaining biological plausibility, the model includes certain assumptions and simplifications, which we discuss below.

To address hierarchical uncertainty, CogLink models hierarchical cortico-thalamic-basal ganglia (BG) loops. In animals, these loops process different types of information-such as motor, limbic, and associative-through parallel streams^83,84. In CogLink, we specifically model the motor and associative components of the cortico-thalamic-BG loop to process low- and high-level uncertainty, respectively.

The basic CogLink focuses on the premotor cortico-thalamic-basal ganglia (BG) loop, emphasizing BG’s role in associative learning under uncertainty. Instead of modeling the full complexity of BG circuitry, including direct and indirect pathways⁸⁵, we adopt a simplified actor-critic structure, commonly used to capture BG’s associative learning capacity¹⁷. Importantly, we assume that striatal neurons encode a distribution of action-value beliefs, updated by dopamine neurons through distributional reward prediction errors (RPEs). This assumption is suggested by evidence that dopamine neurons exhibit inhomogeneous responses forming distributional RPEs²². To implement sampling of these distributions, we hypothesize random input sparsification in the premotor cortex, inspired by evidence of variable sparse firing in supragranular cortical layers (layers 2/3)^86,87,88, which are parts of the cortical circuit known to generate striatal inputs.

The augmented CogLink extends this framework to an associative cortico-thalamic-BG loop to address contextual uncertainty. Although thalamic-striatal connections also have been implicated in flexible behavior^89,90, we focus on the well-established role of the prefrontal cortex (PFC) and mediodorsal thalamus (MD) in cognitive flexibility^{1,16,43,44,45,52} and choose not to model those connections. Another key assumption in this model is the use of disjoint cortical representations that are contextually activated. This modular strategy representation is supported theoretically^91,92 and experimentally, as Kim et al. demonstrated context-dependent stable representations of the same action in mice⁹³.

Extensive studies have shown that the MD thalamus explicitly encodes task contexts across various decision-making paradigms^44,45,46, with recent findings highlighting its role in encoding contextual uncertainty¹. Based on this evidence, we propose that MD acts as a central hub for encoding contextual uncertainty. In CogLink, thalamocortical projections serve dual roles: driver-like projections maintain stable contextual representations via recurrent PFC-MD loops (Equation (2.14)), while modulatory projections influence cortical connectivity and plasticity through interneurons (Equation (2.12), Equation (2.13)). These dual roles align with the classical distinction between core (driver) and matrix (modulatory) thalamic projections^94,95, as well as recent findings on the diverse functions of the thalamus in modulating cortical dynamics^{43,44,52,96,97,98,99,100,101,102}.

Additionally, we hypothesize that PFC populations modulated by these interneuron-mediated thalamocortical projections activate downstream premotor circuits in a context-dependent manner. Supporting this hypothesis, Wang and Sun¹⁰³ demonstrated that PFC sends context-encoding inputs to the premotor cortex to initiate movement.

Together, these features make CogLink a biologically plausible framework for understanding how hierarchical uncertainty shapes decision-making and cognitive flexibility.

Neural representation, computation, and usage of uncertainty

Even though it is well-established that uncertainty profoundly impacts behavior^104,105,106, an overarching computational framework for how different forms of uncertainty are encoded, represented, and decoded to drive behavioral adjustment is lacking^107,108. On the encoding front, uncertainty may be represented at the single neuron level, which empirical studies have found in the basal ganglia¹⁰⁹ and the frontal cortex^110,111. Another encoding strategy is in the form of a distribution at the neural population level^19,112. A distributional code is more computationally demanding but may offer flexibility through differential decoding (e.g., different parts of the distribution can be selectively weighted based on other state variables). There are different frameworks to represent distributions in a neural population, such as probabilistic population codes^19,113,114, sampling-based codes^20,112,115, explicit probabilistic codes^21,116, and quantile codes²². Our model used two distinct ways to encode uncertainty as a distribution, which are motivated by empirical findings that we explain below.

For associative uncertainty, which is computed in the BG component of our model, it is encoded in a quantile code similar to distributional reinforcement learning (RL)²². This is consistent with the fact that the striatum is a major output for dopaminergic neurons, and indeed, our model shows that it is quite straightforward for dopamine-gated plasticity to update this form of BG distribution. In addition, our simulations show that it is easy to sample from a quantile distribution through recurrent competitive dynamics because sampling neurons corresponds to sampling the corresponding probabilistic quantity in such code. We should note, however, that while our model uses a quantile code similar to those in the distributional RL literature, our implementation differs in two important ways. First, rather than varying the optimism of each synapse, we vary the initial synaptic strength. This approach enables our model to learn the posterior over action-value beliefs rather than the reward distribution, allowing for representation of associative uncertainty. Second, we introduce a mechanism to couple this representation to behavioral adjustment through sampling. This conceptually motivated deviation from previous distributional RL models is designed to link the representation of uncertainty to decision-making, rather than merely using distribution learning for improved generalization. In our model, we posit that sampling, which can be efficiently done on its quantile code, is a mechanism to couple uncertainty to efficient exploration.

For contextual uncertainty, which is computed in the MD thalamus, it is encoded as an explicit probabilistic code inspired by past experimental works showing MD encodes context⁴⁴. This representation has the distinct advantage of contextually modulating local learning (Fig. 4g, h). However, the detailed mechanism of how this representation arises is poorly understood. To investigate how to compute such a representation, we include two mechanisms in PFC-MD circuits. First, PFC-MD connections learn the contextual model of the environment at a single-trial level via Hebbian learning. Second, recurrent dynamics in the PFC-MD circuit accumulate the single-trial likelihoods from corticothalmic inputs to calculate the current likelihood of contexts conditioned on previous experiences. Based on recent evidence showing that the thalamus modulates both activities and plasticity of downstream cortical networks^1,45,54,55, we include interneuron-mediated pathways to allow the contextual MD representation to accomplish these functions and explain how contextual uncertainty can impact exploration and learning through these mechanisms.

Thalamocortical interaction as a system-level solution for flexible behaviors and model-based learning

Both animals and humans rely on a delicate coordination between model-free and model-based learning processes to adapt flexibly to their environments^{117,118,119,120,121}. BG has traditionally been associated with model-free learning, while PFC has emerged as a locus for model-based learning and the mediation between the two systems^{17,122,123,124,125,126}. However, the intricate mechanisms underlying the coordination of these learning types remain poorly understood. In our study, we propose the thalamus as a potential communication hub orchestrating this coordination, hypothesizing a detailed circuit mechanism to achieve this integration.

The thalamus is well-known for its topographic and reciprocal connections with the neocortex, as well as its projections to the BG^95,127. While traditionally viewed as a relay station for sensory information, recent research has revealed its involvement in diverse functions across sensory^52,98,53,128, cognitive^{1,43,44,45,97} and motor domains^96,100. The convergence of inputs onto the thalamus and its diverse modulation of cortical and BG circuits position it ideally as a locus of plasticity for learning contextual states in model-based learning to coordinate between model-free and model-based systems.

In our model, PFC-MD circuits learn the contextual model of the environment and represent contexts in MD. This model-based learning component then modulates both plasticity and activities of the downstream model-free learning component, corticostriatal circuits, based on estimation and uncertainty of current contexts in MD. Lesioning MD disrupts this coordination, impairing the model’s ability to flexibly switch behaviors in dynamic environments. However, the lesioned model can still perform in a stationary environment, indicating MD is not involved directly in pure model-free learning. These observations underscore the pivotal role of PFC-MD circuits as the locus of model-based learning, utilizing the learned model of the world to modulate corticostriatal model-free learning and achieve flexible behaviors.

Brains provide different levels of specialized mechanisms for credit assignment

The role of dopamine innervation in the basal ganglia is well-established in carrying reward prediction error (RPE) signals to reinforce behaviors associated with unexpected rewards through synaptic plasticity mechanisms^{122,129,130,131}. However, decision-making in animals involves navigating through multiple cues, actions, and contexts, posing the challenge of appropriately assigning credits to the corresponding synaptic connections responsible for the unexpected rewards—a phenomenon termed credit assignment^{132,133,134,135}.

Traditional machine learning approaches, such as backpropagation, attempt to reinforce internal activity states leading to unexpected rewards¹³⁴. However, backpropagation relies on symmetric feedback weights and a separation of errors and activities, which are not observed in biological brains¹³⁵. Additionally, traditional artificial neural networks often struggle with crediting sensorimotor associations to the correct context across different contexts, leading to catastrophic forgetting^{136,137,138,139,140}.

To address these challenges, researchers have proposed a plethora of cellular, circuit, and system-level mechanisms for proper credit assignment^{141,142,143,144,145,146,147,148,149,150,151}. In our work, we integrate mechanisms at multiple levels to facilitate credit assignment.

At the cellular level, Hebbian-like learning in thalamocortical connections enables credit assignment by crediting associations to specific contexts only when the model is confident in its context inference. Circuit-level credit assignment is exemplified by dopamine-gated plasticity in the basal ganglia, where only corticostriatal connections corresponding to the chosen action undergo plasticity changes. This can be implemented by maintaining an eligibility trace from a motor action’s efference copy back to corticostriatal synapses.

Moreover, thalamocortical interactions via interneurons offer a system-level solution for credit assignment. In our model, the thalamus modulates cortical learning through cortical interneurons to correctly attribute sensorimotor associations to the appropriate context. PV neurons inhibit context-irrelevant cortical ensembles to prevent learning in the wrong context, while VIP neurons facilitate downstream learning when the model is confident in its inferred context.

These examples illustrate the brain’s utilization of diverse mechanisms operating at different levels to perform credit assignments effectively in complex natural environments.

CogLink network as a way to link molecular and behavioral changes in schizophrenia

Genetics are recognized as significant risk factors in schizophrenia¹⁵², and computational modeling has highlighted deficits in belief updating as a key aspect of the disorder^61,62,63,64. However, the intricate mechanisms bridging these genetic risk factors and belief updating deficits remain poorly understood. Our CogLink network, capable of linking mechanisms with normative behavior, creates a foundation to study these connections.

In constructing our schizophrenia model, we focus on a striatal D2R overexpression model because most antipsychotics targeting D2 receptors (D2Rs) are dopamine antagonists^75,76,77 and most SZ patients showed an elevated level of striatal D2Rs^78,79. We focus on the effects of striatal D2R overexpression on the PFC-MD circuit, given mounting evidence implicating alterations in these regions in schizophrenia pathology^{67,68,69,70,71,72,73,74}. Since the abundance of D2Rs increases the inhibition from BG to the thalamus, we model schizophrenia by reducing the excitability of MD neurons to mimic a high level of BG inhibition.

Our schizophrenia model replicates experimental findings in both patients and animal models, such as exploratory behavior following contextual switches and an elevated win-switch rate^16,62,81,82. The CogLink network further explains how circuit-level perturbation connects to these specific cognitive impairments. In particular, by examining the corresponding normative model, we can show that the model exhibits a much lower threshold for evidence accumulation, and the accumulation dynamic becomes leaky, indicating a strong bias toward environmental volatility. Additionally, decreased excitability in MD compromised the ability of PFC-MD connections to accurately learn the environmental model. To address this impairment, we applied current injections to MD to restore activity levels to a range conducive to Hebbian plasticity. Remarkably, the rescue model demonstrated reduced exploratory behavior following switches and exhibited a higher threshold for MD activity switching, indicative of a diminished bias towards environmental volatility. Moreover, the rescue model exhibited improved learning of the environmental model within its PFC-MD connections. These findings validate recent experiments in Schizophrenia-related animal models¹⁶ and demonstrate the utility of the CogLink network in computational psychiatry.

CogLink network vertically integrates and describes neural phenomena from different perspectives

Different modeling approaches offer distinct perspectives on understanding brain computation^153,154. Normative theories have traditionally elucidated animal behaviors and neural coding but often lack direct connections to lower-level neural correlates. In contrast, mechanistic models provide such links, allowing for testable predictions through symmetric perturbations on models and animals. However, understanding mechanistic models at the computational level can be challenging due to their complexity.

In this paper, our CogLink network aims to bridge the gap by constructing a mechanistic model capable of approximating normative models. By incorporating observed neural mechanisms into our model, we establish a direct connection to neural circuits. Simultaneously, approximating normative theories enables mathematical analysis, offering both quantitative and qualitative computational insights. Furthermore, our CogLink network provides distinct advantages in providing a model hypothesis. On the one hand, neural mechanisms provide strong biological prior to a normative model; on the other hand, connections to a normative model provide a guide to adjust the mechanistic parameters to achieve complex cognitive behaviors. We view this modeling approach as the initial step toward integrating Marr’s three levels of analysis-computational, algorithmic, and implementation levels. Furthermore, many neurological diseases have genetic origins along with cognitive symptoms. Since our modeling approach contains both mechanistic details and computational insights into behaviors, it can serve as a lens to study these neurological diseases.

Recently, population dynamic approaches have proven potent in uncovering underlying computations from electrophysiological data¹⁵⁵. However, these approaches often lack integration with connectivity and functional data information, limiting their ability to provide insights at the circuit level. In the future, we aim to develop a CogLink network that incorporates electrophysiological data as well as connectivity and functional data.

Methods

Model overview

Our model is specified by a differential equation governing the evolution of the neural activities (Equation (2.1)), a set of synaptic weights, and synaptic update rules (Equations (2.4), (2.11), (2.13)). In the subsequent section, we provide a more detailed specification of our model.

Basic CogLink model

This section presents the details of the basic CogLink model. Let A denote the number of alternatives, M represent the size of a premotor cortex ensemble, and K indicate the sparsity of cortical activities. In the basic CogLink model, there are A ensembles of premotor neurons. Within the ath ensemble, the premotor cortex activities ${{{{\bf{x}}}}}_{a}^{\,{\mbox{alm}}\,}\in {{\mathbb{R}}}^{M}$ evolve according to the following equation:

$${\tau }^{{{{\rm{alm}}}}}\frac{d{{{{\bf{x}}}}}_{a}^{\,{\mbox{alm}}\,}}{dt}=-{{{{\bf{x}}}}}_{a}^{\,{\mbox{alm}}\,}+{{{{\bf{W}}}}}_{a}^{\,{\mbox{alm}}\,}g({{{{\bf{x}}}}}_{a}^{\,{\mbox{alm}}\,})+I+0.2\frac{d{{{{\bf{B}}}}}_{t}}{dt}.$$

(4.1)

Here, the membrane time constant τ ^alm = 1/6, excitatory inputs I = K − 0.25, and recurrent synaptic weights ${{{{\bf{W}}}}}_{a}^{\,{\mbox{alm}}\,}\in {{\mathbb{R}}}^{M\times M}$ are defined as:

$${{{{\bf{W}}}}}_{a}^{\,{\mbox{alm}}\,}=\left[\begin{array}{cccc}0.75&-1&\cdots \,&-1\\ -1&0.75&\cdots \,&-1\\ \vdots &\vdots &&\vdots \\ -1&-1&\cdots \,&0.75,\end{array}\right].$$

(4.2)

The nonlinearity function $g:{\mathbb{R}}\to {\mathbb{R}}$ is defined as:

$$g(x)=\left\{\begin{array}{ll}1\quad &\,{{\mbox{if}}}\,x > 1\hfill \\ x\quad &\,{{\mbox{if}}}\,1\ge x > 0\\ 0\quad &\,{\mbox{otherwise}}\,\end{array}\right.$$

(4.3)

B_t represents a standard Brownian motion with identity covariance. The selection of the recurrent weights ${{{{\bf{W}}}}}_{a}^{\,{\mbox{alm}}\,}$ and inputs I is designed to implement K-WTA dynamics¹⁵⁶.

The premotor cortex then projects to the BG. The activities of the BG at the ath ensemble, ${{{{\bf{x}}}}}_{a}^{\,{\mbox{bg}}\,}\in {{\mathbb{R}}}^{M}$, evolve according to the following equation:

$${\tau }^{{{{\rm{bg}}}}}\frac{d{{{{\bf{x}}}}}_{a}^{\,{\mbox{bg}}\,}}{dt}=-{{{{\bf{x}}}}}_{a}^{\,{\mbox{bg}}\,}+{{{{\bf{V}}}}}_{a}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\circ g\left({{{{\bf{x}}}}}_{a}^{\,{\mbox{alm}}\,}\right)$$

(4.4)

Here, the membrane time constant τ^bg = 0.1, and the premotor cortex-BG synapses, ${{{{\bf{V}}}}}_{a}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\in {{\mathbb{R}}}^{M}$, are initialized with:

$$\forall a\in [A],m\in [M],\,{{{{\bf{V}}}}}_{a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\leftarrow {\bar{{{{\bf{V}}}}}}_{a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}},\,\,{{\mbox{where}}}\,{\bar{{{{\bf{V}}}}}}_{a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}=\frac{m}{M}.$$

(4.5)

The BG then projects to the motor cortex, and the recurrent competitive dynamics of the motor cortex determine the action a_t at trial t. Specifically, the activities of the motor cortex, denoted as ${{{{\bf{x}}}}}^{{{{\rm{mct}}}}}\in {{\mathbb{R}}}^{A}$, evolve according to the following equations:

$$\forall a\in [A],\,{{{{\bf{I}}}}}_{a}^{\,{\mbox{mct}}\,}={{{{\bf{W}}}}}_{a}^{\,{\mbox{bg/mct}}\,}{{{{\bf{x}}}}}_{a}^{bg}$$

(4.6)

and

$$\forall a\in [A],\,{\tau }^{{{{\rm{mct}}}}}\frac{d{{{{\bf{x}}}}}_{a}^{\,{\mbox{mct}}\,}}{dt}=-{{{{\bf{x}}}}}_{a}^{\,{\mbox{mct}}\,}+g\left(\mathop{\sum}_{i=1}^{A}{{{{\bf{W}}}}}_{a,i}^{\,{\mbox{mct}}\,}{{{{\bf{x}}}}}_{i}^{\,{\mbox{mct}}\,}\right)+{{{{\bf{I}}}}}_{a}^{\,{\mbox{mct}}\,}.$$

(4.7)

Here, the membrane time constant τ^mct = 1, and BG-motor cortex synapses that tuned to action a, ${{{{\bf{W}}}}}_{a}^{\,{\mbox{bg/mct}}\,}\in {{\mathbb{R}}}^{M}$, where ∀ m ∈ [M], a ∈ [A], ${{{{\bf{W}}}}}_{a,m}^{\,{\mbox{bg/mct}}\,}=1/K$. The recurrent synaptic weights, ${{{{\bf{W}}}}}^{{{{\rm{mct}}}}}\in {{\mathbb{R}}}^{A\times A}$, are defined as:

$${{{{\bf{W}}}}}^{{{{\rm{mct}}}}}=\left[\begin{array}{cccc}1&-1&\cdots \,&-1\\ -1&1&\cdots \,&-1\\ \vdots &\vdots &&\vdots \\ -1&-1&\cdots \,&1\end{array}\right].$$

(4.8)

The action a is chosen as a_t if either ${{{{\bf{x}}}}}_{a}^{\,{\mbox{mct}}\,}$ reaches the threshold = 1 within 5 s after the trial starts, or chosen stochastically from a softmax distribution, a_t~softmax(30x^mct), after that.

Once the model receives the reward r_t, it forms a distributional reward prediction error (RPE), denoted as ${{{\boldsymbol{\delta }}}}\in {{\mathbb{R}}}^{M}$:

$$\forall m\in [M],\,{{{{\boldsymbol{\delta }}}}}_{m}={r}_{t}-{{{{\bf{V}}}}}_{{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}.$$

(4.9)

This RPE is then used to update the premotor cortex-BG synapses, ${{{{\bf{V}}}}}_{{a}_{t}}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}$, according to the equation:

$$\forall m\in [M],\,{{{{\bf{V}}}}}_{{a}_{t},m}^{\,{\mbox{alm}}/{\mbox{bg}}\,}\leftarrow {{{{\bf{V}}}}}_{{a}_{t},m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}+{{{{\boldsymbol{\eta }}}}}_{{a}_{t}}{{{{\boldsymbol{\delta }}}}}_{m}.$$

(4.10)

Here, the learning rate, η_a, is defined as follows: ${{{{\boldsymbol{\eta }}}}}_{a}=\frac{1}{7+{N}_{a}}$, where N_a represents the count of the number of times a was chosen up to trial t.

For the KO-sparseness model, we let K = 80, and for KO-distributional-RPE model, we let M = 1.

The simulation is conducted by discretizing the differential equation using dt = 0.005.

Approximation of the basic CogLink model to an algorithm with an analysis of the algorithm

In this section, we approximate the basic CogLink as an algorithm and conduct a mathematical analysis of its performance.

The stable fixed points ${S}_{a}^{\,{\mbox{alm}}\,}$ of the premotor cortex dynamics at the ath ensemble (see Equation (4.1)) are defined as:

$${S}_{a}^{\,{{\mbox{alm}}}\,}=\left\{{{{\bf{x}}}}:{{{\bf{x}}}}\in {{\mathbb{R}}}^{M},\,| \,{{\mbox{supp}}}\,({{{\bf{x}}}})|=K,\,\forall i\in [M],{{{{\bf{x}}}}}_{i}=1.5,\,{{\mbox{if}}}\,{{{{\bf{x}}}}}_{i}\ne 0\right\}.$$

(4.11)

Here, supp(x) = {x_i∣x_i ≠ 0}. Assuming the K-WTA sampling dynamic at the premotor cortex occurs instantaneously (i.e., τ^alm is small), the premotor cortex dynamic ${{{{\bf{x}}}}}_{a}^{\,{\mbox{alm}}\,}$ converges to one of the fixed points above for each ensemble. As the network is symmetric, it uniformly converges to one of the fixed points ${\hat{{{{\bf{x}}}}}}_{a}^{\,{\mbox{alm}}\,}$, i.e., ${\hat{{{{\bf{x}}}}}}_{a}^{\,{\mbox{alm}}} \sim {\mbox{unif}}({S}_{a}^{{\mbox{alm}}\,})$.

Similarly, assuming the BG dynamic (see Equation (4.4)) occurs instantaneously (i.e., τ^bg is small), we obtain:

$${{{{\bf{x}}}}}_{a}^{\,{\mbox{bg}}\,}={{{{\bf{V}}}}}_{a}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\circ g({{{{\bf{x}}}}}_{a}^{\,{\mbox{alm}}\,})={{{{\bf{V}}}}}_{a}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\circ g({\hat{{{{\bf{x}}}}}}_{a}^{\,{\mbox{alm}}\,}).$$

(4.12)

From Equation (4.6), we also have

$$\forall a\in [A],\,{{{{\bf{I}}}}}_{a}^{\,{\mbox{mct}}\,} ={{{{\bf{V}}}}}_{a}^{\,{\mbox{bg/mct}}\,}\cdot {{{{\bf{V}}}}}_{a}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\circ f({\hat{{{{\bf{x}}}}}}_{a}^{\,{\mbox{alm}}}) \\ =\frac{1}{K}{\sum}_{m\in {\mbox{supp}}\,({\hat{{{{\bf{x}}}}}}_{a}^{\,{\mbox{alm}}\,})}{{{{\bf{V}}}}}_{a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}.$$

(4.13)

${{{{\bf{I}}}}}_{a}^{\,{\mbox{mct}}\,}$ is then the sample value ${\hat{v}}_{a}$ in our algorithm (Equation (2.6)).

Finally, assuming the WTA motor dynamic (see Equation (4.7)) occurs instantaneously (i.e., τ^mct is small), the motor cortex dynamic outputs action A_t:

$${a}_{t}=\arg {\max }_{a}{{{{\bf{I}}}}}_{a}^{\,{\mbox{mct}}\,}$$

(4.14)

This corresponds to Equation (2.7) in the algorithm.

The regret of the algorithm is nearly optimal

In this section, we analyze the performance of the algorithm. To simplify the notation for analysis, we present the pseudo-code of the algorithm and introduce a few notation changes (see Algorithm 1).

Algorithm 1

Algorithmic form of the CogLink model

1: Input parameters $A,K,M,{\left\{{{{{\boldsymbol{\eta }}}}}_{(t)}\right\}}_{t\in [T]},{\left\{{\bar{{{{\bf{V}}}}}}_{a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}\right\}}_{a\in [A],m\in [M]}$

2: For all a ∈ [A], for m ∈ [M], initialize N_1,a = 1 and ${{{{\bf{v}}}}}_{1,a,m}={\bar{{{{\bf{V}}}}}}_{a,m}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}$

3: for trial t = 1, …, T do

4: for action a = 1, …, A do

5: Let ${{{{\mathcal{V}}}}}_{t,a}$ be the uniform distribution over ${\left\{\frac{1}{K}{\sum }_{j=1}^{K}{{{{\bf{v}}}}}_{t,a,{i}_{j}}\right\}}_{1\le {i}_{1} < \ldots < {i}_{K}\le M}$

6: Sample ${\hat{{{{\bf{v}}}}}}_{t,a} \sim {{{{\bf{V}}}}}_{t,a}$

7: Output action ${a}_{t}\leftarrow \arg {\max }_{a}{\hat{{{{\bf{v}}}}}}_{t,a}$

8: Receive reward r_t

9: ${{{{\boldsymbol{\delta }}}}}_{t}\leftarrow {r}_{t}-{{{{\bf{v}}}}}_{t,{a}_{t}}$

10: if a = a_t then

11: N_(t+1),a ← N_t,a + 1

12: ${{{{\bf{v}}}}}_{(t+1),a}\leftarrow {{{{\bf{v}}}}}_{t,a}+{\eta }_{({N}_{t,a})}{{{{\boldsymbol{\delta }}}}}_{t}$

13: else

14: N_(t+1),a ← N_t,a

15: v_(t+1),a ← v_t,a

Let μ_i represent the probability of receiving rewards for choosing action i. Without loss of generality, let action 1 denote the optimal action. Define Δ_i = μ₁ − μ_i and $D=\frac{{{{{\mathbf{\Delta }}}}}_{1}}{{{{{\mathbf{\Delta }}}}}_{2}}$. The primary objective of this section is to establish the following theorem:

Theorem 2

Let K = 1, and for all a ∈ [A] and m ∈ [M], ${\bar{{{{\bf{V}}}}}}_{a,m}^{\,{\mbox{alm/bg}}\,}=\frac{Cm}{M}$, where $C=\frac{16\log (2ATD{{{{\mathbf{\Delta }}}}}_{2}^{2}\log T)}{{{{{\mathbf{\Delta }}}}}_{2}}$. Additionally, let ${\eta }_{(t)}=\frac{1}{1+t}$ for all t ∈ [T]. Under these conditions, the regret of this algorithm is bounded by $\sqrt{324ATD\log (2ATD{{{{\mathbf{\Delta }}}}}_{2}^{2}\log T)}$.

Let M denote the number of neurons in each ensemble, and let v_tam represent the synaptic strength of the mth neuron at the ath ensemble at trial t. The learning rate after an action is chosen t times is denoted by η_t. Additionally, N_t,a denotes the number of times action a has been chosen at the end of trial t, and T_n,a represents the trial when action a has been chosen for the nth time. If action $\hat{a}$ is chosen at trial t, we employ the standard reward prediction error update:

$${{{{\bf{v}}}}}_{(t+1),\hat{a},m}={{{{\bf{v}}}}}_{t,\hat{a},m}+{\eta }_{({N}_{t\hat{a}})}{{{{\boldsymbol{\delta }}}}}_{t},\,{{{{\boldsymbol{\delta }}}}}_{t}={r}_{t}-{{{{\bf{v}}}}}_{t,\hat{a},m}.$$

(4.15)

And for $a\ne \hat{a}$, v_(t+1),a,m = v_t,a,m. One can conceive of this ensemble of synapses as a quantile distribution ${{{{\mathcal{V}}}}}_{t,a}$ representing the values for each action. Each ensemble randomly samples ${\hat{{{{\bf{v}}}}}}_{t,a}$ from this quantile distribution ${{{{\mathcal{V}}}}}_{t,a}$ by selecting K synaptic strengths uniformly at random and averaging them:

$${\hat{{{{\bf{v}}}}}}_{t,a} \sim {{{{\mathcal{V}}}}}_{t,a}.$$

(4.16)

The action is then chosen based on the values of the samples through a mutual competition process:

$${a}_{t}=\arg {\max }_{a}{\hat{{{{\bf{v}}}}}}_{t,a}.$$

(4.17)

By recursively expanding Equation (4.15), we obtain:

$${{{{\bf{v}}}}}_{(t+1),a,m}=\left(\mathop{\prod}_{n=1}^{{N}_{t,a}}(1-{\eta }_{(n)})\right)\left({{{{\bf{v}}}}}_{0,a,m}+{\sum}_{j=1}^{{N}_{t,a}}{\eta }_{(\,\,j)}{\prod}_{n=1}^{j}{(1-{\eta }_{(n)})}^{-1}{r}_{{T}_{n,a}}\right).$$

(4.18)

For the theoretical analysis of this circuit, we consider the following simple setting: let K = 1, and for all a ∈ [A], ${{{{\bf{v}}}}}_{0,a,m}=\frac{Cm}{M}$, where C > 0 is a constant we will define later. For all m ∈ [M] and t ∈ [T], let ${\eta }_{(t)}=\frac{1}{1+t}$. By substituting these conditions into the equation, we obtain:

$${{{{\bf{v}}}}}_{(t+1)am}=\frac{Cm}{({N}_{t,a}+1)M}+\frac{1}{{N}_{t,a}+1}{\sum}_{n=1}^{{N}_{t,a}}{r}_{{T}_{n,a}}.$$

(4.19)

Now, we aim to bound the expectation of N_t,a for a ≠ 1. Demonstrating that the model selects suboptimal actions infrequently implies small regret. Given any $\epsilon \in {\mathbb{R}}$, we define ${E}_{a}(t)=\{{\hat{{{{\bf{v}}}}}}_{ta}\le {{{{\boldsymbol{\mu }}}}}_{1}-\epsilon \}$. We establish the following stopping time to capture the event when rewards are concentrated around the mean:

$$\tau=\inf \left\{t:\exists a\in [A],\,\left | t \frac{1}{{N}_{ta}}{\sum}_{n=1}^{{N}_{t,a}}{r}_{{T}_{n,a}}-{{{{\boldsymbol{\mu }}}}}_{a}\right | > \sqrt{\frac{1}{{N}_{t,a}}\log \frac{2A\log T}{\delta }}\right\}.$$

(4.20)

Applying the maximal Hoeffding inequality yields:

$$P\left(\exists t\in [T],\left | \frac{1}{{N}_{t,a}}{\sum}_{n=1}^{{N}_{t,a}}{r}_{{T}_{n,a}}-{{{{\boldsymbol{\mu }}}}}_{i}\right | > \sqrt{\frac{1}{{N}_{t,a}}\log \frac{2}{{\delta }^{{\prime} }}}\right) < {\delta }^{{\prime} }.$$

(4.21)

By union bounding over all intervals and actions:

$$P\left(\exists t\in [T],\exists a\in [A],\left | \frac{1}{{N}_{t,a}}{\sum}_{n=1}^{{N}_{t,a}}{r}_{{T}_{n,a}}-{{{{\boldsymbol{\mu }}}}}_{a}\right | > \sqrt{\frac{1}{{N}_{t,a}}\log \frac{2}{{\delta }^{{\prime} }}}\right) < {\delta }^{{\prime} }A\log T.$$

(4.22)

Setting ${\delta }^{{\prime} }=\frac{\delta }{A\log T}$, we obtain:

$$P\left(\tau < T\right) < \delta .$$

Now, let’s bound the expectation of N_t,a using this stopping time. We have:

$${\mathbb{E}}\left[{N}_{T,a}\right] \le {\mathbb{E}}\left[{\sum}_{t=1}^{T}{{{\bf{1}}}}\{{a}_{t}=a\}\right]\\ ={\mathbb{E}}\left[{\sum}_{t=1}^{T\wedge \tau }{{{\bf{1}}}}\{{a}_{t}=a\}\right]\\ ={\mathbb{E}}\left[{\sum}_{t=1}^{T\wedge \tau }{{{\bf{1}}}}\{{a}_{t}=a\}{{{\bf{1}}}}\{\tau \ge T\}\right]+{\mathbb{E}}\left[{\sum}_{t=1}^{T\wedge \tau }{{{\bf{1}}}}\{{a}_{t}=a\}{{{\bf{1}}}}\{\tau < T\}\right]\\ \le {\mathbb{E}}\left[{\sum}_{t=1}^{T}{{{\bf{1}}}}\{{a}_{t}=a\}{{{\bf{1}}}}\{\tau \ge T\}\right]+TP(\tau < T)\\ \le {\mathbb{E}}\left[{\sum}_{t=1}^{T}{{{\bf{1}}}}\{{a}_{t}=a\}{{{\bf{1}}}}\{\tau \ge T\}\right]+T\delta .$$

(4.23)

Now, let’s decompose the first term as follows:

$$\begin{array}{rc}&{\mathbb{E}}\left[{\sum }_{t=1}^{T}{{{\bf{1}}}}\{{a}_{t}=a\}{{{\bf{1}}}}\{\tau \ge T\}\right]\\ &={\mathbb{E}}\left[{\sum }_{t=1}^{T}{{{\bf{1}}}}\{{a}_{t}=a,{E}_{a}(t)\}{{{\bf{1}}}}\{\tau \ge T\}\right]+{\mathbb{E}}\left[{\sum }_{t=1}^{T}{{{\bf{1}}}}\{{a}_{t}=a,{E}_{a}^{c}(t)\}{{{\bf{1}}}}\{\tau \ge T\}\right].\end{array}$$

(4.24)

Let ${a}_{t}^{{\prime} }=\arg {\max }_{a\ne 1}{\hat{{{{\bf{v}}}}}}_{t,a}$, and let F_t,a be the cumulative distribution function of ${{{{\mathcal{V}}}}}_{t,a}$ conditioning on τ ≥ t. To bound the first term, let’s examine:

$$\begin{array}{rc}P\left({a}_{t}=a,{E}_{a}(t)| \tau \ge t,{{{{\mathcal{F}}}}}_{t-1}\right)&\le {F}_{t,1}(\,{{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )P\left({a}_{t}^{{\prime} }=a,{E}_{a}(t)| \tau \ge t,{{{{\mathcal{F}}}}}_{t-1}\right)\\ &\le \frac{{F}_{t,1}(\,{{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )}{1-{F}_{t,1}(\,{{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )}P\left({a}_{t}=1,{E}_{a}(t)| \tau \ge t,{{{{\mathcal{F}}}}}_{t-1}\right).\end{array}$$

(4.25)

Now, we have:

$$ \quad{\mathbb{E}}\left[{\sum}_{t=1}^{T}{{{\bf{1}}}}\{{a}_{t}=a,{E}_{a}(t)\}{{{\bf{1}}}}\{\tau \ge T\}\right]\\ ={\mathbb{E}}\left[{\sum}_{t=1}^{T}P\left({a}_{t}=a,{E}_{a}(t)| \tau \ge t,{{{{\mathcal{F}}}}}_{t-1}\right)\right]\\ \le {\mathbb{E}}\left[{\sum}_{t=1}^{T}\frac{{F}_{t,1}({\mu }_{1}-\epsilon )}{1-{F}_{t,1}({\mu }_{1}-\epsilon )}P\left({a}_{t}=1,{E}_{a}(t)| \tau \ge t,{{{{\mathcal{F}}}}}_{t-1}\right)\right]\\ \le {\mathbb{E}}\left[{\sum}_{t=1}^{T}\frac{{F}_{{T}_{t,1}1}({{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )}{1-{F}_{{T}_{t,1}1}({{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )}\right].$$

(4.26)

Note that from Equation (4.19) and Equation (4.20), we have, conditioning on τ ≥ t:

$$ \frac{{N}_{t,a}}{{N}_{t,a}+1}\left(\frac{Cm}{{N}_{t,a}M}+{{{{\mathbf{\mu }}}}}_{a}-\sqrt{\frac{1}{{N}_{t,a}}\log \frac{2A\log T}{\delta }}\right)\le {{{{\bf{v}}}}}_{t,a,m}\\ \le \frac{{N}_{t,a}}{{N}_{t,a}+1}\left(\frac{Cm}{{N}_{t,a}M}+{{{{\mathbf{\mu }}}}}_{a}+\sqrt{\frac{1}{{N}_{t,a}}\log \frac{2A\log T}{\delta }}\right).$$

(4.27)

Notice that if ${N}_{t,1}\ge \frac{4\log \frac{2A\log T}{\delta }}{{\epsilon }^{2}}$, then we have

$$\forall m,\,{{{{\bf{v}}}}}_{t,1,m}\ge {{{{\boldsymbol{\mu }}}}}_{1}-\epsilon .$$

(4.28)

If ${N}_{t,1} < \frac{4\log \frac{2A\log T}{\delta }}{{\epsilon }^{2}}$ and $C\ge \frac{8\log \frac{2A\log T}{\delta }}{\epsilon }$, then

$$\forall m\ge \frac{M}{2},\,\frac{Cm}{{N}_{t,1}M}-\sqrt{\frac{1}{{N}_{t,1}}\log \frac{2A\log T}{\delta }} > 0.$$

(4.29)

This implies that ${F}_{{T}_{t,1}1}({{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )\le \frac{1}{2}$, hence

$$\frac{{F}_{{T}_{1,t},1}({{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )}{1-{F}_{{T}_{1,t},1}({{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )}\le 1.$$

(4.30)

Consequently,

$${\mathbb{E}}\left[{\sum}_{t=1}^{T}\frac{{F}_{{T}_{1,t},1}({{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )}{1-{F}_{{T}_{1,t},1}({{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )}\right]\le \frac{4\log \frac{2A\log T}{\delta }}{{\epsilon }^{2}}.$$

(4.31)

Similarly, we can bound the second term:

$$ \,\,\,\,\,{\mathbb{E}}\left[{\sum }_{t=1}^{T}{{{\bf{1}}}}\{{a}_{t}=a,{E}_{a}^{c}(t)\}{{{\bf{1}}}}\{\tau \ge T\}\right]\\ \le {\mathbb{E}}\left[{\sum }_{t=1}^{T}(1-{F}_{ta}({{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )){{{\bf{1}}}}\{{a}_{t}=a\}\right]\\ \le {\mathbb{E}}\left[{\sum }_{t=1}^{T}(1-{F}_{{T}_{ta}a}({{{{\boldsymbol{\mu }}}}}_{1}-\epsilon ))\right].$$

(4.32)

By Equation (4.27), if ${N}_{t,a} > \frac{16D\log \frac{2A\log T}{\delta }}{{({{{{\mathbf{\Delta }}}}}_{a}-\epsilon )}^{2}}$ and $C\le \frac{8D\log \frac{2A\log T}{\delta }}{{\Delta }_{a}-\epsilon }$, then ${F}_{{T}_{t,a},a}(\,{{{{\boldsymbol{\mu }}}}}_{1}-\epsilon )=1$. Hence,

$${\mathbb{E}}\left[{\sum}_{t=1}^{T}(1-{F}_{{T}_{t,a},a}({{{{\boldsymbol{\mu }}}}}_{1}-\epsilon ))\right]\le \frac{16D\log \frac{2A\log T}{\delta }}{{({{{{\mathbf{\Delta }}}}}_{a}-\epsilon )}^{2}}.$$

(4.33)

Now, let’s set $\epsilon=\frac{{{{{\mathbf{\Delta }}}}}_{a}}{2}$, $\delta=\frac{1}{TD{{{{\mathbf{\Delta }}}}}_{2}^{2}}$, and $C=\frac{16\log \frac{2A\log T}{\delta }}{{{{{\mathbf{\Delta }}}}}_{2}}$. This satisfies the condition for C:

$$\frac{4D\log \frac{2A\log T}{\delta }}{{{{{\mathbf{\Delta }}}}}_{a}-\epsilon }\ge C\ge \frac{8\log \frac{2A\log T}{\delta }}{\epsilon }.$$

(4.34)

Combining Equation (4.23), Equation (4.31), and Equation (4.33), we find:

$$\begin{array}{rc}{\mathbb{E}}[{N}_{T,a}]&\le \frac{(64D+16)\log (2ATD{{{{\mathbf{\Delta }}}}}_{2}^{2}\log T)}{{{{\mathbf{\Delta}}}_{a}^{2}}}+\frac{1}{D{{{{\mathbf{\Delta }}}}}_{2}^{2}}\\ &\le \frac{81D\log (2ATD{{{{\mathbf{\Delta }}}}}_{2}^{2}\log T)}{{{{{\mathbf{\Delta }}}}}_{a}^{2}}.\hfill\end{array}$$

(4.35)

Now, let’s bound the regret:

$${R}_{T}={\sum}_{a=1}^{A}{{{{\mathbf{\Delta }}}}}_{a}{\mathbb{E}}[{N}_{T,a}]\\ \le {\sum}_{a=1}^{A}\frac{81D\log (2ATD{{{{\mathbf{\Delta }}}}}_{2}^{2}\log T)}{{{{{\mathbf{\Delta }}}}}_{a}}.$$

(4.36)

For any Δ > 0, we can divide the sum as follows.

$$\begin{array}{rc}&={\sum }_{a:{{{{\mathbf{\Delta }}}}}_{a} < \Delta }^{A}\frac{81D\log (2ATD{{{{\mathbf{\Delta }}}}}_{2}^{2}\log T)}{{{{{\mathbf{\Delta }}}}}_{a}}+{\sum }_{a:{{{{\mathbf{\Delta }}}}}_{a}\ge \Delta }^{A}\frac{81D\log (2ATD{{{{\mathbf{\Delta }}}}}_{2}^{2}\log T)}{{{{{\mathbf{\Delta }}}}}_{a}}\\ &\le T\Delta+\frac{81AD\log (2ATD{{{{\mathbf{\Delta }}}}}_{2}^{2}\log T)}{\Delta }.\end{array}$$

(4.37)

By geometric inequality, we have

$$\le \sqrt{324ATD\log (2ATD{{{{\mathbf{\Delta }}}}}_{2}^{2}\log T)},$$

(4.38)

as desired.

Specifically, when A = 2, we can present the following simplified theorem.

Theorem 3

Let K = 1 and Δ = ∣μ₁ − μ₂∣ and $\forall a\in [2],\,{v}_{0am}=\frac{Cm}{M}$ where $C=\frac{16\log (4T{\Delta }^{2}\log T)}{\Delta }$. For all m ∈ [M], t ∈ [T], let ${\eta }_{(t)}=\frac{1}{1+t}$. Then the regret of this algorithm is bounded by $36\sqrt{T\log (4T\Delta \log T)}$.

Details on the augmented CogLink

This section provides details of the augmented CogLink model. At its core, the model comprises the PFC-MD-like circuit for contextual inferences and copies of basic CogLink models for dynamically switching behavioral strategies based on the inferred context.

At trial t, the prefrontal cortex activities ${{{{\bf{x}}}}}^{{{{\rm{pfc}}}}}\in {{\mathbb{R}}}^{A\times 2}$ jointly encode actions and rewards at the last trial, with the following formulation:

$${{{{\bf{x}}}}}_{a,r}^{\,{\mbox{pfc}}\,}=\left\{\begin{array}{cc}1,&{{\mbox{if}}}\,a={a}_{t-1},r={r}_{t-1}\\ 0,\,&{{\mbox{otherwise}}}\,\hfill\end{array}\right..$$

(4.39)

To form a line attractor in MD, we consider the following thalamocortical loop:

The MD activities, denoted as ${{{{\bf{x}}}}}^{{{{\rm{md}}}}}\in {{\mathbb{R}}}^{2}$, evolve according to the equation:

$${\tau }^{{{{\rm{eff}}}}}\frac{d{{{{\bf{x}}}}}^{{{{\rm{md}}}}}}{dt}=-\frac{2{{{{\bf{x}}}}}^{{{{\rm{md}}}}}}{D}+{\beta }_{{{{\rm{d2}}}}}\left({f}_{{{{\rm{md}}}}}\left(\frac{1}{2}{{{{\bf{x}}}}}^{{{{\rm{fc}}}}}-\frac{1}{4}{x}^{{{{\rm{trn}}}}}\right)+{{{{\bf{I}}}}}^{{{{\rm{pfc/md}}}}}\right)+{{{{\bf{I}}}}}^{{{{\rm{rescue}}}}},$$

(4.40)

the frontal cortex activities, denoted as ${{{{\bf{x}}}}}^{{{{\rm{fc}}}}}\in {{\mathbb{R}}}^{2}$, evolve according to the equation:

$${\tau }^{{{{\rm{fc}}}}}\frac{d{{{{\bf{x}}}}}^{{{{\rm{fc}}}}}}{dt}=-{{{{\bf{x}}}}}^{{{{\rm{fc}}}}}+{{{{\bf{x}}}}}^{{{{\rm{md}}}}},$$

(4.41)

and the TRN activity, denoted as ${x}^{{{{\rm{trn}}}}}\in {\mathbb{R}}$, evolves according to the equation:

$${\tau }^{{{{\rm{trn}}}}}\frac{d{x}^{{{{\rm{trn}}}}}}{dt}=-{x}^{{{{\rm{trn}}}}}+{\sum}_{i=1}^{2}{{{{\bf{x}}}}}_{i}^{\,{\mbox{md}}\,}.$$

(4.42)

Here, τ^eff = 5 represents the effective time constant for accumulation dynamics, τ^fc = τ^trn = 0.1 represents the membrane time constant of frontal neurons and TRN neurons, and D = 4 signifies the threshold for accumulation dynamics.

The nonlinearity function, ${f}_{{{{\rm{md}}}}}:{\mathbb{R}}\to {\mathbb{R}},$ is defined as

$${f}_{{{{\rm{md}}}}}(x)=\left\{\begin{array}{ll}x+1\quad &\,{{\mbox{if}}}\,-1\le x\le 1\\ 2\quad &\,{{\mbox{if}}}\,x > 1\hfill\\ 0\quad &\,{\mbox{otherwise}}\,\hfill\end{array}\right.,$$

(4.43)

and the PFC-MD inputs, denoted as ${{{{\bf{I}}}}}^{{{{\rm{pfc/md}}}}}\in {{\mathbb{R}}}^{2}$, are given by

$$\forall c\in [2],\,{{{{\bf{I}}}}}_{c}^{\,{\mbox{pfc/md}}}={f}_{{{{\rm{pfc}}}}}({{{{\bf{W}}}}}_{c}^{\,{\mbox{pfc/md}}\,}\cdot {{{{\bf{x}}}}}^{{{{\rm{pfc}}}}}).$$

Here, ${{{{\bf{W}}}}}_{c}^{\,{\mbox{pfc/md}}\,}\in {{\mathbb{R}}}^{A\times 2}$ represents PFC-MD connections projecting to MD neurons tuned to context c, and ⋅ signifies the matrix inner product. Additionally, ${f}_{{{{\rm{pfc}}}}}(x)={[2.7+\log (x)]}_{+}$, β_d2 = 0.85 in the D2R hyperactivation model and the rescue model, β_d2 = 1 otherwise, and I^rescue = 0.45 in the rescue model and I^rescue = 0 otherwise. We set x^md = 0 for the MD inhibition model throughout the experiment.

We update the PFC-MD connections through Hebbian learning as follows:

$$\forall c \in [2],a\in [A],r\in [2],\,\Delta {{{{\bf{W}}}}}_{c,a,r}^{\,{\mbox{pfc/md}}\,}={{{{\boldsymbol{\eta }}}}}_{c}\left({f}_{{{{\rm{hebb}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{md}}\,}){{{{\bf{x}}}}}_{a,r}^{\,{\mbox{pfc}}}-{f}_{{{{\rm{hebb}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{md}}\,}){{{{\bf{W}}}}}_{c,a,r}^{\,{\mbox{pfc/md}}\,}\right)$$

(4.44)

Here, ${{{{\bf{W}}}}}_{c,a,r}^{\,{\mbox{pfc/md}}\,}$ represents the synapse ${f}_{{{{\rm{hebb}}}}}:{\mathbb{R}}\to {\mathbb{R}}$ denotes the sigmoidal nonlinearity function,

$${f}_{{{{\rm{hebb}}}}}(x)={\left[\frac{1-{e}^{8-4(x-4)}}{1+{e}^{8-4(x-4)}}\right]}_{+}$$

(4.45)

The learning rate is determined by $\forall c\in [2],a\in [A],{{{{\boldsymbol{\eta }}}}}_{c}\in {{\mathbb{R}}}^{2}$, given by ${{{{\boldsymbol{\eta }}}}}_{c}=\frac{{f}_{{{{\rm{hebb}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{md}}\,})}{4+{N}_{c}}$, where N_c represents a rolling sum of ${f}_{{{{\rm{hebb}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{md}}\,})$, and is updated as ${N}_{c}\leftarrow {N}_{c}+{f}_{{{{\rm{hebb}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{md}}\,})$. For the KO-nonlinear Hebb, we replace f_hebb with a linear function

$${f}_{{{{\rm{KO}}}}{\mbox{-}}{{{\rm{hebb}}}}}(x)={\left[\frac{x-4}{4}\right]}_{+}.$$

(4.46)

MD neurons then modulate the downstream d-CS models via interneuron-mediated pathways. Specifically, the interneuron activities are defined as

$$\forall c\in [2],\,{\tau }_{{{{\rm{vip}}}}}{{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}=-{{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}+{{{{\bf{x}}}}}_{c}^{\,{\mbox{md}}\,}$$

(4.47)

and

$$\forall c\in [2],\,{\tau }_{{{{\rm{pv}}}}}{{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,}=-{{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,}+{{{{\bf{x}}}}}_{\bar{c}}^{\,{\mbox{md}}\,}$$

(4.48)

Here, the interneuron membrane time constant is τ_vip = τ_pv = 0.1, and $\bar{c}$ represents a context different from c. These activities modulate the downstream d-CS models as follows:

$$\forall c\in [2],a\in [A],\,{\tau }^{{{{\rm{bg}}}}}\frac{d{{{{\bf{x}}}}}_{c,a}^{\,{\mbox{bg}}\,}}{dt}=-{{{{\bf{x}}}}}_{c,a}^{\,{\mbox{bg}}\,}+f({{{{\bf{x}}}}}_{c,a}^{\,{\mbox{alm}}})\circ {f}_{{{{\rm{in}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}-{{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,}){{{{\bf{V}}}}}_{c,a}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}.$$

(4.49)

where

$${f}_{{{{\rm{in}}}}}(x)={\left[\frac{1-{e}^{-2(x+0.25)}}{1+{e}^{-2(x+0.25)}}\right]}_{+}.$$

(4.50)

For KO-interneuron gating, we replace f_in with direct MD modulation

$$\forall c\in [2],a\in [A],\,{\tau }^{{{{\rm{bg}}}}}\frac{d{{{{\bf{x}}}}}_{c,a}^{\,{\mbox{bg}}\,}}{dt}=-{{{{\bf{x}}}}}_{c,a}^{\,{\mbox{bg}}\,}+f({{{{\bf{x}}}}}_{c,a}^{\,{\mbox{alm}}})\circ {f}_{{{{\rm{KO}}}}{\mbox{-}}{{{\rm{in}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{md}}\,}){{{{\bf{V}}}}}_{c,a}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}.$$

(4.51)

where

$${f}_{{{{\rm{KO}}}}{\mbox{-}}{{{\rm{in}}}}}={\left[\frac{x-4}{4}\right]}_{+}.$$

(4.52)

These interneuron-mediated pathways also modulate plasticity. Specifically,

$$\forall c\in [2],{{{{\boldsymbol{\delta }}}}}_{c}={r}_{t}-{{{{\bf{V}}}}}_{c,{a}_{t}}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}},\,\Delta {{{{\bf{V}}}}}_{c,{a}_{t}}^{{{{\rm{alm}}}}/{{{\rm{bg}}}}}={{{{\boldsymbol{\eta }}}}}_{c,{a}_{t}}{f}_{{{{\rm{in}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}-{{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,}){{{{\boldsymbol{\delta }}}}}_{c}.$$

(4.53)

where ${{{{\boldsymbol{\eta }}}}}_{c,a}=\frac{1}{4+{N}_{c,a}}$ and N_c,a is the rolling sum of $0.5*{{{{\bf{1}}}}}_{a={a}_{t}}\,{f}_{{{{\rm{in}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}-{{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,})$, with ${N}_{a,c}\leftarrow {N}_{a,c}+0.5*{{{{\bf{1}}}}}_{a={a}_{t}}\,{f}_{{{{\rm{in}}}}}({{{{\bf{x}}}}}_{c}^{\,{\mbox{vip}}\,}-{{{{\bf{x}}}}}_{c}^{\,{\mbox{pv}}\,})$. The remainder of the model consists of two copies of the basic CogLink models.

Approximation of the thalamocortical model to an algorithm

In this section, we approximate the thalamocortical model as an algorithm and demonstrate its connection to the CUSUM algorithm^56,57. Additionally, we illustrate that the D2R hyperactive impaired model corresponds to a leaky evidence integrator.

We recall that the MD circuit can be described by (Equation (4.40), Equation (4.41), Equation (4.42)). Notice that by letting the dynamics of x^fc and x^trn be instantaneous, we can describe the effective MD circuit dynamics as follows:

$$\left\{\begin{array}{l}{\tau }^{{{{\rm{eff}}}}}\dot{{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}}=-\frac{2{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}}}{D}+{f}_{{{{\rm{md}}}}}(w{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}-w{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,})+{{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}\quad \\ {\tau }^{{{{\rm{eff}}}}}\dot{{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}}=-\frac{2{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}}}{D}+{f}_{{{{\rm{md}}}}}(w{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}-w{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,})+{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}\quad \end{array}\right..$$

(4.54)

Here, D = 4, and we will show that D represents the threshold of accumulation dynamics. τ ^eff = 5 represents the effective time constant for accumulation dynamics, τ ^md = τ ^effD/2 represents the membrane time constant, $w=\frac{1}{D}$, and ${{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,},{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}$ represent the PFC inputs to MD. The nonlinearity function is defined as:

$${f}_{{{{\rm{md}}}}}(x)=\left\{\begin{array}{ll}x+1,\quad &\,{{\mbox{for}}}\,-1\le x\le 1\\ 2,\quad &\,{{\mbox{for}}}\,x > 1\hfill\\ 0,\quad &\,{{\mbox{for}}}\,x < -1\hfill\\ \quad\end{array}\right..$$

(4.55)

Let $X={{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}-{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}$. Then, we have:

$${\tau }^{{{{\rm{eff}}}}}\frac{dX}{dt}={{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}},{{\mbox{when}}}\,| X| < D$$

(4.56)

and

$${\tau }^{{{{\rm{eff}}}}}\frac{dX}{dt}=-\frac{2X}{D}+{{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}}+2,{{\mbox{when}}}\,| X| > D.$$

(4.57)

If $| {{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}| \ll 1$, then at the stationary point, X remains approximately ±D. This corresponds to a drift-diffusion process with a threshold at ±D and inputs of ${{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}$. To be precise, we discretize the differential equation and threshold X at ±4 to derive the following algorithm:

$${X}_{t}=\min \left\{D,\max \left\{-D,{X}_{t-1}+\frac{dt}{{\tau }^{{{{\rm{eff}}}}}}(\underset{1}{\overset{\,{\mbox{pfc/md}}\,}{{{{\bf{I}}}}}}-\underset{2}{\overset{\,{\mbox{pfc/md}}\,}{{{{\bf{I}}}}}})\right\}\right\}.$$

(4.58)

We prove the following theorem:

Theorem 4

Let dt = τ^eff. If we set X₀ = −D, S_t = X_t + D and assume $| {{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}| \ll 1$, the evolution of X₀ approximates to

$${S}_{t}=\min \left(2D,\max \left(0,{S}_{t-1}+{{{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}}-{{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}}\right)\right)$$

We can prove the theorem by substituting the variable into Equation (4.58) and adding D to both sides:

$${S}_{t}=\min \left(2D,\max \left(0,{S}_{t-1}+{{{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}}-{{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}}\right)\right)$$

(4.59)

as desired. Notably, when ${{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}(t)=\log P({a}_{t},{r}_{t}| c=1)+\alpha,{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}(t)=\log P({a}_{t},{r}_{t}| c=2)+\alpha$ for any α > 0 and S_n < 2D, this corresponds exactly to the CUSUM algorithm:

$${S}_{t}=\max (0,{S}_{t-1}+\log P({a}_{t},{r}_{t}| c=1)-\log P({a}_{t},{r}_{t}| c=2)).$$

(4.60)

To analyze the impaired model, we recall the equations:

$$\left\{\begin{array}{l}{\tau }^{{{{\rm{eff}}}}}\dot{{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}}=-\frac{2{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}}}{D}+{\beta }_{{{{\rm{d2}}}}}\,{f}_{{{{\rm{md}}}}}(w{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}-w{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}})+{\beta }_{{{{\rm{d2}}}}}{{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}\quad \\ {\tau }^{{{{\rm{eff}}}}}\dot{{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}}=-\frac{2{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}}}{D}+{\beta }_{d2}\,{f}_{{{{\rm{md}}}}}(w{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}-w{{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}})+{\beta }_{{{{\rm{d2}}}}}{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}\quad \end{array}\right..$$

(4.61)

Let $X={{{{\bf{x}}}}}_{1}^{\,{\mbox{md}}\,}-{{{{\bf{x}}}}}_{2}^{\,{\mbox{md}}\,}$. Then, we have:

$${\tau }^{{{{\rm{eff}}}}}\frac{dX}{dt}=\frac{2}{D}({\beta }_{{{{\rm{d2}}}}}-1)X+{\beta }_{{{{\rm{d2}}}}}({{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}}),{{\mbox{when}}}\,| X| < D.$$

(4.62)

This indicates that the evidence accumulation dynamic is a leaky integrator. At the stationary point, we have

$$| X|=\frac{{\beta }_{{{{\rm{d2}}}}}D}{2(1-{\beta }_{{{{\rm{d2}}}}})}| \langle {{{{\bf{I}}}}}_{1}^{\,{\mbox{pfc/md}}\,}-{{{{\bf{I}}}}}_{2}^{\,{\mbox{pfc/md}}\,}\rangle | .$$

(4.63)

By plugging in the model learned by the impaired model in Fig. 7k, we have

$${\langle {{{{\bf{I}}}}}_{1}^{{{{\rm{pfc/md}}}}}-{{{{\bf{I}}}}}_{2}^{{{{\rm{pfc/md}}}}}\rangle }_{c=0} \approx {{\mathbb{E}}}_{{r}_{t}| c=0,{a}_{t}=0}\left[\log p({a}_{t}=0,{r}_{t},| c=0)-\log p({a}_{t}=0,{r}_{t},| c=1)\right]\\ \approx 0.7(\log 0.57-\log 0.49)+0.3(\log (1-0.57)-\log (1-0.49))$$

(4.64)

$$\approx 0.055.$$

(4.65)

So we have the threshold $X\approx \frac{0.055{\beta }_{{{{\rm{d2}}}}}D}{2(1-{\beta }_{{{{\rm{d2}}}}})}=0.62\ll 4=D$, consistent with Fig. 7i. This demonstrates that the impaired model has a much lower evidence accumulation threshold compared to the normal model, thereby inducing a strong prior on environmental volatility.

Details of other models

This section contains details on other models used in the paper. For Thompson sampling, let γ be the discounted factor. Initialize for all a ∈ [A], α_a = β_a = 0. Then, we sample from the posterior

$${\hat{{{{\bf{v}}}}}}_{a} \sim \,{\mbox{Beta}}\,({{{{\boldsymbol{\alpha }}}}}_{a}+1,{{{{\boldsymbol{\beta }}}}}_{a}+1).$$

(4.66)

We output action

$${a}_{t}\leftarrow \arg {\max }_{a}{\hat{{{{\bf{v}}}}}}_{a}.$$

(4.67)

and receive reward r_t. We then update the parameter

$$\forall a\in [A],{{{{\boldsymbol{\alpha }}}}}_{a}=\gamma {{{{\boldsymbol{\alpha }}}}}_{a}+{{{{\bf{1}}}}}_{a={a}_{t}}{{{{\bf{1}}}}}_{{r}_{t}=1},\,{{{{\boldsymbol{\beta }}}}}_{a}=\gamma {{{{\boldsymbol{\beta }}}}}_{a}+{{{{\bf{1}}}}}_{a={a}_{t}}{{{{\bf{1}}}}}_{{r}_{t}=0}$$

(4.68)

In all simulations in the paper, we use γ = 1 for Thompson sampling and γ = 0.93 for discounted Thompson sampling.

For the hidden Markov model with Thompson sampling, Initialize for all a ∈ [A], c ∈ [2], α_a,c = β_a,c = 0. We use HMM with known environmental parameters to infer the current contextual likelihood, ${{{{\bf{p}}}}}_{c}\in {{\mathbb{R}}}^{2}$

$${{{{\bf{p}}}}}_{c}=\,{\mbox{HMM}}\,({a}_{ < t},{r}_{ < t}).$$

(4.69)

And then we sample from the posterior,

$${\hat{{{{\bf{v}}}}}}_{a} \sim {{{{\bf{p}}}}}_{1}*\,{\mbox{Beta}}\,({{{{\boldsymbol{\alpha }}}}}_{a,1}+1,{{{{\boldsymbol{\beta }}}}}_{a,1}+1)+{{{{\bf{p}}}}}_{2}*\,{\mbox{Beta}}\,({{{{\boldsymbol{\alpha }}}}}_{a,2}+1,{{{{\boldsymbol{\beta }}}}}_{a,2}+1).$$

(4.70)

We output action

$${a}_{t}\leftarrow \arg {\max }_{a}{\hat{{{{\bf{v}}}}}}_{a}.$$

(4.71)

and receive reward r_t. We then update the parameter

$$\forall c\in [2],\,{{{{\boldsymbol{\alpha }}}}}_{{a}_{t},c}={{{{\boldsymbol{\alpha }}}}}_{{a}_{t},c}+{{{{\bf{p}}}}}_{c}{{{{\bf{1}}}}}_{1={r}_{t}},\,{{{{\boldsymbol{\beta }}}}}_{{a}_{t},c}={{{{\boldsymbol{\beta }}}}}_{{a}_{t},c}+{{{{\bf{p}}}}}_{c}{{{{\bf{1}}}}}_{0={r}_{t}}.$$

(4.72)

For the Deep Q-Network (DQN)²⁷, we use a multilayer perceptron with a hidden layer size of 10 and an ϵ-greedy exploration strategy. To balance exploration and exploitation, at trial t, given state s_t, we define ${{{\bf{N}}}}\in {{\mathbb{R}}}^{S}$, where S is the total number of states. The visit count for each state is updated as: ${{{{\bf{N}}}}}_{s}\leftarrow {{{{\bf{N}}}}}_{s}+{{{{\bf{1}}}}}_{s={s}_{t}}0.2$. The model explores uniformly at random with probability $\epsilon=\frac{1}{{{{{\bf{N}}}}}_{{s}_{t}}}$ and otherwise selects the action with the highest Q-value.

A-AFC task

This section contains details for the stationary A-AFC task. The task contains two parameters: the expected difference in reward probability between the most and the least rewarding actions (Δ) and the number of alternatives (A). The reward probability θ_a of action a ∈ [A] is specified by

$$\forall a\in [A],\,{{{{\boldsymbol{\theta }}}}}_{a}=(0.7-\Delta )+\frac{a-1}{A-1}\Delta .$$

(4.73)

Each session contains 500 trials, and for each simulation, we run the task for 50 sessions.

Cued 2-AFC task

This section contains details for the cue 2-AFC task (Fig. S2a). The task contains one parameter, the expected difference in reward probability between the most and the least rewarding actions (Δ). In each trial, with uniform probability, the model will be presented with cue 1 or cue 2. The reward probability of action 1 after seeing cue 1 is 70% while action 2 is (70 − Δ)%. On the other hand, the reward probability of action 1 after seeing cue 2 is (70 − Δ)% while action 2 is 70%. Each session contains 500 trials, and for each simulation, we run the task for 50 sessions.

Binary tree maze task

This section contains details for the binary tree maze task (Fig. S2c). The task consists of a depth-2 binary tree maze with 4 end locations. Upon reaching each end location, the model will receive a reward with probability $1,\frac{2}{3},\frac{1}{3},0$ respectively. The task contains one parameter a; at the start, if the model chooses left, it receives a reward with probability a, and if the model chooses right, it receives a reward with probability (1 − a). Each session contains 500 trials, and for each simulation, we run the task for 50 sessions.

Probabilistic reversal task

This section contains details for the probabilistic reversal task. There are two alternatives, left or right, in the task, and the reward probability in the context 1 is θ_R = 0.3, θ_L = 0.7, and the reward probability in the context 2 is θ_R = 0.7, θ_L = 0.3. The task starts with context 1 and switches to an alternative context for every 200 trials. The task consists of 1000 trials, and for each simulation, we run the task for 50 sessions. For the low outcome uncertainty environment in Fig. 5i, we replace the 70%, 30% reward probability with 90%, 10%.

Regret and contextual uncertainty

Let ${\theta }_{t}\in {{\mathbb{R}}}^{A}$ be the probability of getting a reward for each action at trial t. We define regret at trial T, R_T, as the expected differences in rewards between the retrospectively optimal action and the action taken,

$${R}_{T}={\sum}_{t=1}^{T}{({{{{\boldsymbol{\theta }}}}}_{t})}_{{a}_{t}^{*}}-{({\theta }_{t})}_{{a}_{t}},$$

(4.74)

where ${a}_{t}^{*}$ is the retrospectively optimal action.

To decode the contextual uncertainty, U, at Fig. 4g and Fig. S7b, we consider the following nonlinear transformation of the MD activities x^md:

$$U=2-\frac{2}{1+{e}^{-| {{{{\bf{x}}}}}_{1}^{{{{\rm{md}}}}}-{{{{\bf{x}}}}}_{2}^{{{{\rm{md}}}}}| }}.$$

(4.75)

Notice that when two MD populations have the same activity, the uncertainty is 1, and when they have a large difference in activity, the uncertainty is close to 0.

Statistic test

Data were first tested for normality using the Shapiro–Wilk test. All data presented in this paper are non-normally distributed; therefore, all statistical tests were conducted using nonparametric statistics. For all comparisons of two groups, we used a two-way rank sum test. For comparison of more than two groups, we used the Bonferroni-corrected Kruskal–Wallis test with post hoc Dunn’s test. All permutation tests are done using 10⁶ resamples.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The behavioral and neural activity data of the models in this study have been deposited at FigShare and are publicly available. DOI is https://doi.org/10.6084/m9.figshare.26065372.

Code availability

All original code has been deposited at Zenodo and is publicly available as of the date of publication. DOI is https://doi.org/10.5281/zenodo.13152289. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

References

Lam, N. H. et al. Prefrontal transthalamic uncertainty processing drives flexible switching. Nature 637, 127–136 (2024).
Article ADS PubMed Google Scholar
Sarafyazd, M. & Jazayeri, M. Hierarchical reasoning by neural circuits in the frontal cortex. Science 364, eaav8911 (2019).
Article PubMed Google Scholar
Bill, J., Gershman, S. J. & Drugowitsch, J. Visual motion perception as online hierarchical inference. Nat. Commun. 13, 7403 (2022).
Article ADS PubMed PubMed Central Google Scholar
Rohe, T., Ehlis, A. C. & Noppeney, U. The neural dynamics of hierarchical Bayesian causal inference in multisensory perception. Nat. Commun. 10, 1907 (2019).
Article ADS PubMed PubMed Central Google Scholar
Tenenbaum, J. B., Kemp, C., Griffiths, T. L. & Goodman, N. D. How to grow a mind: statistics, structure, and abstraction. Science 331, 1279–1285 (2011).
Article ADS MathSciNet PubMed Google Scholar
Mathys, C. D. et al. Uncertainty in perception and the hierarchical Gaussian filter. Front. Hum. Neurosci. 8, 825 (2014).
Article PubMed PubMed Central Google Scholar
Knill, D. C. & Pouget, A. The Bayesian brain: the role of uncertainty in neural coding and computation. Trends Neurosci. 27, 712–719 (2004).
Article PubMed Google Scholar
Kording, K. P. & Wolpert, D. M. Bayesian integration in sensorimotor learning. Nature 427, 244–247 (2004).
Article ADS PubMed Google Scholar
Nott, D. J., Drovandi, C. & Frazier, D. T. Bayesian inference for misspecified generative models. Annu. Rev. Stat. Appl. 11, 179–202 (2024).
Article MathSciNet Google Scholar
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article ADS PubMed Google Scholar
Yamins, D. L. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci. USA 111, 8619–8624 (2014).
Article ADS PubMed PubMed Central Google Scholar
Monosov, I. E. How outcome uncertainty mediates attention, learning, and decision-making. Trends Neurosci. 43, 795–809 (2020).
Article PubMed PubMed Central Google Scholar
Knill, D. C. & Whitman, R. (eds) Perception as Bayesian Inference (Cambridge Univ. Press, 1996).
Stocker, A. A. & Simoncelli, E. P. Noise characteristics and prior expectations in human visual speed perception. Nat. Neurosci. 9, 578–585 (2006).
Article PubMed Google Scholar
Wang, B. A. et al. Thalamic regulation of reinforcement learning strategies across prefrontal-striatal networks. Nat. Commun. https://doi.org/10.1038/s41467-025-63995-x (2025).
Zhou, T. et al. Enhancement of mediodorsal thalamus rescues aberrant belief dynamics in a mouse model with schizophrenia-associated mutation. Preprint at bioRxiv https://doi.org/10.1101/2024.01.08.574745 (2024).
Niv, Y. Reinforcement learning in the brain. J. Math. Psychol. 53, 139–154 (2009).
Article MathSciNet Google Scholar
Soltani, A. & Wang, X. J. A biophysically based neural model of matching law behavior: melioration by stochastic synapses. J. Neurosci. 26, 3731–3744 (2006).
Article PubMed PubMed Central Google Scholar
Ma, W. J., Beck, J. M., Latham, P. E. & Pouget, A. Bayesian inference with probabilistic population codes. Nat. Neurosci. 9, 1432–1438 (2006).
Article PubMed Google Scholar
Hoyer, P. & Hyvärinen, A. Interpreting neural response variability as Monte Carlo sampling of the posterior. In Advances in Neural Information Processing Systems Vol. 15 (eds Becker, S. et al.) (MIT Press, 2002).
Rao, R. P. Bayesian computation in recurrent neural circuits. Neural Comput. 16, 1–38 (2004).
Article PubMed Google Scholar
Dabney, W. et al. A distributional code for value in dopamine-based reinforcement learning. Nature 577, 671–675 (2020).
Article ADS PubMed PubMed Central Google Scholar
Roitman, J. D. & Shadlen, M. N. Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. J. Neurosci. 22, 9475–9489 (2002).
Article PubMed PubMed Central Google Scholar
Lattimore, T. & Szepesvári, C. Bandit Algorithms (Cambridge Univ. Press, 2020).
Auer, P., Cesa-Bianchi, N., Freund, Y. & Schapire, R. Gambling in a rigged casino: the adversarial multi-armed bandit problem. In Proc. IEEE 36th Annual Foundations of Computer Science 322–331 (1995).
Korda, N., Kaufmann, E. & Munos, R. Thompson sampling for 1-dimensional exponential family bandits. In Advances in Neural Information Processing Systems Vol. 26 (eds Burges, C. et al.) (Curran Associates, Inc., 2013).
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Article ADS PubMed Google Scholar
Vul, E. Sampling in Human Cognition. Ph.D. dissertation, Massachusetts Institute of Technology https://dspace.mit.edu/handle/1721.1/62097 (2010).
Battaglia, P. W., Kersten, D. & Schrater, P. R. How haptic size sensations improve distance perception. PLoS Comput. Biol. 7, e1002080 (2011).
Article ADS MathSciNet PubMed PubMed Central Google Scholar
Acerbi, L., Vijayakumar, S. & Wolpert, D. M. On the origins of suboptimality in human probabilistic inference. PLoS Comput. Biol. 10, e1003661 (2014).
Article ADS PubMed PubMed Central Google Scholar
Prat-Carrabin, A., Wilson, R. C., Cohen, J. D. & Azeredo da Silveira, R. Human inference in changing environments with temporal structure. Psychol. Rev. 128, 879–912 (2021).
Article PubMed Google Scholar
Prat-Carrabin, A., Meyniel, F. & Azeredo da Silveira, R. Resource-rational account of sequential effects in human prediction. eLife 13, e81256 (2024).
Article PubMed PubMed Central Google Scholar
Chen, C. S., Mueller, D., Knep, E., Ebitz, R. B. & Grissom, N. M. Dopamine and norepinephrine differentially mediate the exploration-exploitation tradeoff. J. Neurosci. 44, e1194232024 (2024).
Article PubMed PubMed Central Google Scholar
Doya, K. Metalearning and neuromodulation. Neural Netw. 15, 495–506 (2002).
Article PubMed Google Scholar
Sakai, K. & Passingham, R. E. Prefrontal interactions reflect future task operations. Nat. Neurosci. 6, 75–81 (2003).
Article PubMed Google Scholar
Miller, E. K. & Cohen, J. D. An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci. 24, 167–202 (2001).
Article PubMed Google Scholar
Wolff, M. & Halassa, M. M. The mediodorsal thalamus in executive control. Neuron 112, 893–908 (2024).
Article PubMed Google Scholar
Wang, M. B. & Halassa, M. M. Thalamocortical contribution to flexible learning in neural systems. Netw. Neurosci. 6, 980–997 (2022).
Article PubMed PubMed Central Google Scholar
Scott, D. N., Mukherjee, A., Nassar, M. R. & Halassa, M. M. Thalamocortical architectures for flexible cognition and efficient learning. Trends Cogn. Sci. 28, 739–756 (2024).
Article PubMed PubMed Central Google Scholar
Nakajima, M. & Halassa, M. M. Thalamic control of functional cortical connectivity. Curr. Opin. Neurobiol. 44, 127–131 (2017).
Article PubMed PubMed Central Google Scholar
Halassa, M. M. & Kastner, S. Thalamic functions in distributed cognitive control. Nat. Neurosci. 20, 1669–1679 (2017).
Article PubMed Google Scholar
Halassa, M. M. & Sherman, S. M. Thalamocortical circuit motifs: a general framework. Neuron 103, 762–770 (2019).
Article PubMed PubMed Central Google Scholar
Schmitt, L. I. et al. Thalamic amplification of cortical connectivity sustains attentional control. Nature 545, 219–223 (2017).
Article ADS PubMed PubMed Central Google Scholar
Rikhye, R. V., Gilra, A. & Halassa, M. M. Thalamic regulation of switching between cortical representations enables cognitive flexibility. Nat. Neurosci. 21, 1753–1763 (2018).
Article PubMed PubMed Central Google Scholar
Mukherjee, A., Lam, N. H., Wimmer, R. D. & Halassa, M. M. Thalamic circuits for independent control of prefrontal signal and noise. Nature 600, 100–104 (2021).
Article ADS PubMed PubMed Central Google Scholar
Chen, X., Sorenson, E. & Hwang, K. Thalamocortical contributions to working memory processes during the n-back task. Neurobiol. Learn. Mem. 197, 107701 (2023).
Article PubMed Google Scholar
Zheng, W. L., Wu, Z., Hummos, A., Yang, G. R. & Halassa, M. M. Rapid context inference in a thalamocortical model using recurrent neural networks. Nat Commun. 15, 8275 (2024).
Hummos, A., Wang, B. A., Drammis, S., Halassa, M. M. & Pleger, B. Thalamic regulation of frontal interactions in human cognitive flexibility. PLoS Comput Biol. 18, e1010500 (2022).
Zhang, X., Mukherjee, A., Halassa, M. M. & Chen, Z. S. Mediodorsal thalamus regulates task uncertainty to enable cognitive flexibility. Nat Commun. 16, 2640 (2025).
Halassa, M. M. & Acsády, L. Thalamic Inhibition: Diverse Sources, Diverse Scales. Trends Neurosci. 39 680–693 (2016).
Halassa, M. M. et al. State-dependent architecture of thalamic reticular subnetworks. Cell. 158, 808–821 (2014).
Wimmer, R. D. et al. Thalamic control of sensory selection in divided attention. Nature 526, 705–709 (2015).
Article ADS PubMed PubMed Central Google Scholar
Nakajima, M., Schmitt, L. I. & Halassa, M. M. Prefrontal cortex regulates sensory filtering through a basal ganglia-to-thalamus pathway. Neuron 103, 445–458 (2019).
Article PubMed PubMed Central Google Scholar
Canto-Bustos, M., Friason, F. K., Bassi, C. & Oswald, A. M. Disinhibitory circuitry gates associative synaptic plasticity in olfactory cortex. J. Neurosci. 42, 2942–2950 (2022).
Article PubMed PubMed Central Google Scholar
Williams, L. E. & Holtmaat, A. Higher-order thalamocortical inputs gate synaptic long-term potentiation via disinhibition. Neuron 101, 91–102 (2019).
Article PubMed Google Scholar
Moustakides, G. V. Optimal stopping times for detecting changes in distributions. Ann. Stat. 14, 1379–1387 (1986).
Article MathSciNet Google Scholar
Lorden, G. Procedures for reacting to a change in distribution. Ann. Math. Stat. 42, 1897–1908 (1971).
Article MathSciNet Google Scholar
Chakraborty, S., Kolling, N., Walton, M. E. & Mitchell, A. S. Critical role for the mediodorsal thalamus in permitting rapid reward-guided updating in stochastic reward environments. eLife 5, e13588 (2016).
Article PubMed PubMed Central Google Scholar
Alcaraz, F. et al. Dissociable effects of anterior and mediodorsal thalamic lesions on spatial goal-directed behavior. Brain Struct. Funct. 221, 79–89 (2016).
Article PubMed Google Scholar
Hwang, K., Bruss, J., Tranel, D. & Boes, A. D. Network localization of executive function deficits in patients with focal thalamic lesions. J. Cogn. Neurosci. 32, 2303–2319 (2020).
Article PubMed PubMed Central Google Scholar
Baker, S. C., Konova, A. B., Daw, N. D. & Horga, G. A distinct inferential mechanism for delusions in schizophrenia. Brain 142, 1797–1812 (2019).
Article PubMed PubMed Central Google Scholar
Sheffield, J. M., Suthaharan, P., Leptourgos, P. & Corlett, P. R. Belief updating and paranoia in individuals with schizophrenia. Biol. Psychiatry Cogn. Neurosci. Neuroimaging 7, 1149–1157 (2022).
PubMed PubMed Central Google Scholar
Adams, R. A., Napier, G., Roiser, J. P., Mathys, C. & Gilleen, J. Attractor-like dynamics in belief updating in schizophrenia. J. Neurosci. 38, 9471–9485 (2018).
Article PubMed PubMed Central Google Scholar
Nassar, M. R., Waltz, J. A., Albrecht, M. A., Gold, J. M. & Frank, M. J. All or nothing belief updating in patients with schizophrenia reduces precision and flexibility of beliefs. Brain 144, 1013–1029 (2021).
Article PubMed PubMed Central Google Scholar
Corlett, P. R. & Fletcher, P. Modelling delusions as temporally-evolving beliefs. Cogn. Neuropsychiatry 26, 231–241 (2021).
Article PubMed Google Scholar
Corlett, P., Taylor, J., Wang, X.-J., Fletcher, P. & Krystal, J. Toward a neurobiology of delusions. Prog. Neurobiol. 92, 345–369 (2010).
Article PubMed PubMed Central Google Scholar
Huang, A. S. et al. A prefrontal thalamocortical readout for conflict-related executive dysfunction in schizophrenia. Cell Rep. Med. 5, 101802 (2024).
Article PubMed PubMed Central Google Scholar
Anticevic, A. & Halassa, M. M. The thalamus in psychosis spectrum disorder. Front. Neurosci. 17, 1163600 (2023).
Article PubMed PubMed Central Google Scholar
Mukherjee, A. & Halassa, M. M. The associative thalamus: a switchboard for cortical operations and a promising target for schizophrenia. Neuroscientist 30, 132–147 (2022).
Article PubMed PubMed Central Google Scholar
Paz, R. & nez Amaya, J. M. The mediodorsal thalamic nucleus and schizophrenia. J. Psychiatry Neurosci. 33, 489–498 (2008).
Article Google Scholar
Anticevic, A. et al. Mediodorsal and visual thalamic connectivity differ in schizophrenia and bipolar disorder with and without psychosis history. Schizophr. Bull. 40, 1227–1243 (2014).
Article PubMed PubMed Central Google Scholar
Byne, W. et al. Magnetic resonance imaging of the thalamic mediodorsal nucleus and pulvinar in schizophrenia and schizotypal personality disorder. Arch. Gen. Psychiatry 58, 133–140 (2001).
Article PubMed Google Scholar
Woodward, N. D., Karbasforoushan, H. & Heckers, S. Thalamocortical dysconnectivity in schizophrenia. Am. J. Psychiatry 169, 1092–1099 (2012).
Article PubMed Google Scholar
Pomarol-Clotet, E. et al. Medial prefrontal cortex pathology in schizophrenia as revealed by convergent findings from multimodal imaging. Mol. Psychiatry 15, 823–830 (2010).
Article PubMed PubMed Central Google Scholar
Seeman, P. & Lee, T. Antipsychotic drugs: direct correlation between clinical potency and presynaptic action on dopamine neurons. Science 188, 1217–1219 (1975).
Article ADS PubMed Google Scholar
Creese, I., Burt, D. R. & Snyder, S. H. Dopamine receptor binding predicts clinical and pharmacological potencies of antischizophrenic drugs. Science 192, 481–483 (1976).
Article ADS PubMed Google Scholar
Meltzer, H. Y., Matsubara, S. & Lee, J. C. Classification of typical and atypical antipsychotic drugs on the basis of dopamine D-1, D-2 and serotonin2 pKi values. J. Pharmacol. Exp. Ther. 251, 238–246 (1989).
Article PubMed Google Scholar
Wong, D. F. et al. Positron emission tomography reveals elevated D2 dopamine receptors in drug-naive schizophrenics. Science 234, 1558–1563 (1986).
Article ADS PubMed Google Scholar
Abi-Dargham, A. et al. Increased baseline occupancy of D2 receptors by dopamine in schizophrenia. Proc. Natl. Acad. Sci. USA 97, 8104–8109 (2000).
Article ADS PubMed PubMed Central Google Scholar
Cazorla, M., Shegda, M., Ramesh, B., Harrison, N. L. & Kellendonk, C. Striatal D2 receptors regulate dendritic morphology of medium spiny neurons via Kir2 channels. J. Neurosci. 32, 2398–2409 (2012).
Article PubMed PubMed Central Google Scholar
Waltz, J. A. The neural underpinnings of cognitive flexibility and their disruption in psychotic illness. Neuroscience 345, 203–217 (2017).
Article PubMed Google Scholar
Deserno, L. et al. Volatility estimates increase choice switching and relate to prefrontal activity in schizophrenia. Biol. Psychiatry Cogn. Neurosci. Neuroimaging 5, 173–183 (2020).
PubMed Google Scholar
Foster, N. N. et al. The mouse cortico-basal ganglia-thalamic network. Nature 598, 188–194 (2021).
Article ADS PubMed PubMed Central Google Scholar
Alexander, G. E., DeLong, M. R. & Strick, P. L. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci. 9, 357–381 (1986).
Article PubMed Google Scholar
Cox, J. & Witten, I. B. Striatal circuits for reward learning and decision-making. Nat. Rev. Neurosci. 20, 482–494 (2019).
Article PubMed PubMed Central Google Scholar
Petersen, C. C. & Crochet, S. Synaptic computation and sensory processing in neocortical layer 2/3. Neuron 78, 28–48 (2013).
Article PubMed Google Scholar
Barth, A. L. & Poulet, J. F. Experimental evidence for sparse firing in the neocortex. Trends Neurosci. 35, 345–355 (2012).
Article PubMed Google Scholar
Kerr, J. N. et al. Spatial organization of neuronal population responses in layer 2/3 of rat barrel cortex. J. Neurosci. 27, 13316–13328 (2007).
Article PubMed PubMed Central Google Scholar
Kato, S. et al. Action selection and flexible switching controlled by the intralaminar thalamic neurons. Cell Rep. 22, 2370–2382 (2018).
Article PubMed Google Scholar
Minamimoto, T., Hori, Y. & Kimura, M. Roles of the thalamic cm-pf complex-basal ganglia circuit in externally driven rebias of action. Brain Res. Bull. 78, 75–79 (2009).
Article PubMed Google Scholar
Wolpert, D. M. & Kawato, M. Multiple paired forward and inverse models for motor control. Neural Netw. 11, 1317–1329 (1998).
Article PubMed Google Scholar
Heald, J. B., Lengyel, M. & Wolpert, D. M. Contextual inference underlies the learning of sensorimotor repertoires. Nature 600, 489–493 (2021).
Article ADS PubMed PubMed Central Google Scholar
Kim, J.-H., Daie, K. & Li, N. A combinatorial neural code for long-term motor memory. Nature 637, 663–672 (2024).
Article ADS PubMed PubMed Central Google Scholar
Jones, E. G. The thalamic matrix and thalamocortical synchrony. Trends Neurosci. 24, 595–601 (2001).
Article PubMed Google Scholar
Sherman, S. M. & Guillery, R. W. Exploring the Thalamus and Its Role in Cortical Function 2nd edn (MIT Press, 2005), hardcover edn.
Tanaka, M. Cognitive signals in the primate motor thalamus predict saccade timing. J. Neurosci. 27, 12109–12118 (2007).
Article PubMed PubMed Central Google Scholar
Saalmann, Y. B. & Kastner, S. The cognitive thalamus. Front. Syst. Neurosci. 9, 39 (2015).
Article PubMed PubMed Central Google Scholar
Zhou, H., Schafer, R. J. & Desimone, R. Pulvinar-cortex interactions in vision and attention. Neuron 89, 209–220 (2016).
Article PubMed PubMed Central Google Scholar
Bolkan, S. S. et al. Thalamic projections sustain prefrontal activity during working memory maintenance. Nat. Neurosci. 20, 987–996 (2017).
Article PubMed PubMed Central Google Scholar
Guo, Z. V. et al. Maintenance of persistent activity in a frontal thalamocortical loop. Nature 545, 181–186 (2017).
Article ADS PubMed PubMed Central Google Scholar
Guo, W., Clause, A. R., Barth-Maron, A. & Polley, D. B. A corticothalamic circuit for dynamic switching between feature detection and discrimination. Neuron 95, 180–194 (2017).
Article PubMed PubMed Central Google Scholar
Mukherjee, A. et al. Variation of connectivity across exemplar sensory and associative thalamocortical loops in the mouse. eLife 9, e62554 (2020).
Article PubMed PubMed Central Google Scholar
Wang, Y. & Sun, Q.-Q. A prefrontal motor circuit initiates persistent movement. Nat. Commun. 15, 5264 (2024).
Article ADS PubMed PubMed Central Google Scholar
Gershman, S. J. Deconstructing the human algorithms for exploration. Cognition 173, 34–42 (2018).
Article PubMed Google Scholar
Cohen, J. D., McClure, S. M. & Yu, A. J. Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philos. Trans. R. Soc. Lond. B Biol. Sci. 362, 933–942 (2007).
Article PubMed PubMed Central Google Scholar
Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A. & Cohen, J. D. Humans use directed and random exploration to solve the explore-exploit dilemma. J. Exp. Psychol. Gen. 143, 2074–2081 (2014).
Article PubMed PubMed Central Google Scholar
Ma, W. J. & Jazayeri, M. Neural coding of uncertainty and probability. Annu. Rev. Neurosci. 37, 205–220 (2014).
Article PubMed Google Scholar
Walker, E. Y. et al. Studying the neural representations of uncertainty. Nat. Neurosci. 26, 1857–1867 (2023).
Article PubMed Google Scholar
Akiti, K. et al. Striatal dopamine explains novelty-induced behavioral dynamics and individual variability in threat prediction. Neuron 110, 3789–3804 (2022).
Article PubMed PubMed Central Google Scholar
O’Neill, M. & Schultz, W. Coding of reward risk by orbitofrontal neurons is mostly distinct from coding of reward value. Neuron 68, 789–800 (2010).
Article PubMed Google Scholar
Masset, P., Ott, T., Lak, A., Hirokawa, J. & Kepecs, A. Behavior- and modality-general representation of confidence in orbitofrontal cortex. Cell 182, 112–126 (2020).
Article PubMed PubMed Central Google Scholar
Orban, G., Berkes, P., Fiser, J. & Lengyel, M. Neural variability and sampling-based probabilistic representations in the visual cortex. Neuron 92, 530–543 (2016).
Article PubMed PubMed Central Google Scholar
Walker, E. Y., Cotton, R. J., Ma, W. J. & Tolias, A. S. A neural basis of probabilistic computation in visual cortex. Nat. Neurosci. 23, 122–129 (2020).
Article PubMed Google Scholar
Geurts, L. S., Cooke, J. R. H., van Bergen, R. S. & Jehee, J. F. M. Subjective confidence reflects representation of Bayesian probability in cortex. Nat. Hum. Behav. 6, 294–305 (2022).
Article PubMed PubMed Central Google Scholar
Echeveste, R., Aitchison, L., Hennequin, G. & Lengyel, M. Cortical-like dynamics in recurrent circuits optimized for sampling-based probabilistic inference. Nat. Neurosci. 23, 1138–1149 (2020).
Article PubMed PubMed Central Google Scholar
Deneve, S. Bayesian spiking neurons I: inference. Neural Comput. 20, 91–117 (2008).
Article MathSciNet PubMed Google Scholar
Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P. & Dolan, R. J. Model-based influences on humans’ choices and striatal prediction errors. Neuron 69, 1204–1215 (2011).
Article PubMed PubMed Central Google Scholar
van der Meer, M. A., Johnson, A., Schmitzer-Torbert, N. C. & Redish, A. D. Triple dissociation of information processing in dorsal striatum, ventral striatum, and hippocampus on a learned spatial decision task. Neuron 67, 25–32 (2010).
Article PubMed PubMed Central Google Scholar
Doll, B. B., Duncan, K. D., Simon, D. A., Shohamy, D. & Daw, N. D. Model-based choices involve prospective neural activity. Nat. Neurosci. 18, 767–772 (2015).
Article PubMed PubMed Central Google Scholar
Scher, J., Daw, N., Dayan, P. & O’Doherty, J. P. States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron 66, 585–595 (2010).
Article Google Scholar
Lee, S. W., Shimojo, S. & O’Doherty, J. P. Neural computations underlying arbitration between model-based and model-free learning. Neuron 81, 687–699 (2014).
Article PubMed PubMed Central Google Scholar
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Article PubMed Google Scholar
Akam, T. et al. The anterior cingulate cortex predicts future states to mediate model-based action selection. Neuron 109, 149–163 (2021).
Article PubMed PubMed Central Google Scholar
Witkowski, P. P., Park, S. A. & Boorman, E. D. Neural mechanisms of credit assignment for inferred relationships in a structured world. Neuron 110, 2680–2690 (2022).
Article PubMed Google Scholar
Schuck, N. W., Cai, M. B., Wilson, R. C. & Niv, Y. Human orbitofrontal cortex represents a cognitive map of state space. Neuron 91, 1402–1412 (2016).
Article PubMed PubMed Central Google Scholar
Soltani, A. & Koechlin, E. Computational models of adaptive behavior and prefrontal cortex. Neuropsychopharmacology 47, 58–71 (2022).
Article PubMed Google Scholar
Jones, E. G. (ed.) The Thalamus (Springer US, 1985).
Phillips, J. M., Kambi, N. A. & Saalmann, Y. B. A subcortical pathway for rapid, goal-driven, attentional filtering. Trends Neurosci. 39, 49–51 (2016).
Article PubMed Google Scholar
Montague, P. R., Dayan, P. & Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci. 16, 1936–1947 (1996).
Article PubMed PubMed Central Google Scholar
Bayer, H. M. & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47, 129–141 (2005).
Article PubMed PubMed Central Google Scholar
Bamford, N. S., Wightman, R. M. & Sulzer, D. Dopamine’s effects on corticostriatal synapses during reward-based behaviors. Neuron 97, 494–510 (2018).
Article PubMed PubMed Central Google Scholar
Whittington, J. C. R. & Bogacz, R. Theories of error back-propagation in the brain. Trends Cogn. Sci. 23, 235–250 (2019).
Article PubMed PubMed Central Google Scholar
Minsky, M. Steps toward artificial intelligence. Proc. IRE 49, 8–30 (1961).
Article ADS MathSciNet Google Scholar
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
Article ADS Google Scholar
Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J. & Hinton, G. Backpropagation and the brain. Nat. Rev. Neurosci. 21, 335–346 (2020).
Article PubMed Google Scholar
McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation Vol. 24 109–165 (Academic Press, 1989).
French, R. M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3, 128–135 (1999).
Article PubMed Google Scholar
Kumaran, D., Hassabis, D. & McClelland, J. L. What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends Cogn. Sci. 20, 512–534 (2016).
Article PubMed Google Scholar
Kemker, R., McClure, M., Abitino, A., Hayes, T. & Kanan, C. Measuring catastrophic forgetting in neural networks. In AAAI Conference on Artificial Intelligence https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16410 (2018).
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C. & Wermter, S. Continual lifelong learning with neural networks: A review. Neural Netw. 113, 54–71 (2019).
Article PubMed Google Scholar
Fiete, I. R. & Seung, H. S. Gradient learning in spiking neural networks by dynamic perturbation of conductances. Phys. Rev. Lett. 97, 048104 (2006).
Article ADS PubMed Google Scholar
Schiess, M., Urbanczik, R. & Senn, W. Somato-dendritic synaptic plasticity and error-backpropagation in active dendrites. PLoS Comput. Biol. 12, 1–18 (2016).
Article Google Scholar
Kusmierz, L., Isomura, T. & Toyoizumi, T. Learning with three factors: modulating Hebbian plasticity with errors. Curr. Opin. Neurobiol. 46, 170–177 (2017).
Article PubMed Google Scholar
Richards, B. A. & Lillicrap, T. P. Dendritic solutions to the credit assignment problem. Curr. Opin. Neurobiol. 54, 28–36 (2019).
Article PubMed Google Scholar
Sacramento, J., Ponte Costa, R., Bengio, Y. & Senn, W. Dendritic cortical microcircuits approximate the backpropagation algorithm. In Advances in Neural Information Processing Systems Vol. 31 8735–8746 (Curran Associates, Inc., 2018).
Kornfeld, J. et al. An anatomical substrate of credit assignment in reinforcement learning. Preprint at bioRxiv https://doi.org/10.1101/2020.02.18.954354 (2020).
Liu, Y. H., Smith, S., Mihalas, S., Shea-Brown, E. & Sümbül, U. Cell-type–specific neuromodulation guides synaptic credit assignment in a spiking neural network. Proc. Natl. Acad. Sci. USA 118, e2111821118 (2021).
O’Reilly, R. C. Biologically plausible error-driven learning using local activation differences: the generalized recirculation algorithm. Neural Comput. 8, 895–938 (1996).
Article Google Scholar
Roelfsema, P. R. & van Ooyen, A. Attention-gated reinforcement learning of internal representations for classification. Neural Comput. 17, 2176–2214 (2005).
Article PubMed Google Scholar
Lillicrap, T. P., Cownden, D., Tweed, D. B. & Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nat. Commun. 7, 13276 (2016).
Article ADS PubMed PubMed Central Google Scholar
Roelfsema, P. R. & Holtmaat, A. Control of synaptic plasticity in deep cortical networks. Nat. Rev. Neurosci. 19, 166–180 (2018).
Article PubMed Google Scholar
Gejman, P. V., Sanders, A. R. & Duan, J. The role of genetics in the etiology of schizophrenia. Psychiatr. Clin. North Am. 33, 35–66 (2010).
Article PubMed PubMed Central Google Scholar
Levenstein, D. et al. On the role of theory and modeling in neuroscience. J. Neurosci. 43, 1074–1088 (2023).
Article PubMed PubMed Central Google Scholar
Dayan, P. & Abbott, L. F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems (MIT Press, 2001).
Mante, V., Sussillo, D., Shenoy, K. V. & Newsome, W. T. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature 503, 78–84 (2013).
Article ADS PubMed PubMed Central Google Scholar
Majani, E., Erlanson, R. & Abu-Mostafa, Y. On the k-winners-take-all network. In Advances in Neural Information Processing Systems Vol. 1 (ed. Touretzky, D.) (Morgan-Kaufmann, 1988).

Download references

Acknowledgements

This work is supported by NIMH grants p50mh132642, r01mh134466, r01mh120118 (M.B.W. and M.M.H.) and NSF grants CCR-2139936, CCR-2003830, CCF-1810758 (M.B.W. and N.L.).

Author information

Authors and Affiliations

Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
Mien Brabeeba Wang & Nancy Lynch
School of Medicine, Tufts University, Boston, MA, USA
Michael M. Halassa

Authors

Mien Brabeeba Wang
View author publications
Search author on:PubMed Google Scholar
Nancy Lynch
View author publications
Search author on:PubMed Google Scholar
Michael M. Halassa
View author publications
Search author on:PubMed Google Scholar

Contributions

M.B.W. and M.M.H. conceived the project. M.B.W. developed the main models with inputs from M.M.H. M.B.W. developed the mathematical derivation and analysis with inputs from N.L. M.B.W. conducted the computational experiments and analyzed the data. M.B.W. wrote the manuscript with edits and feedback from M.M.H. and N.L. All the authors read the final version of the manuscript.

Corresponding author

Correspondence to Michael M. Halassa.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, M.B., Lynch, N. & Halassa, M.M. The neural basis for uncertainty processing in hierarchical decision making. Nat Commun 16, 9096 (2025). https://doi.org/10.1038/s41467-025-63994-y

Download citation

Received: 16 August 2024
Accepted: 03 September 2025
Published: 16 October 2025
DOI: https://doi.org/10.1038/s41467-025-63994-y