Introduction

One of the challenging aspects of learning in naturalistic settings is that it is inherently unclear what features or attributes of choices are predictive of subsequent reward outcomes. Imagine successfully operating a new coffee machine after switching it on and pressing a flashing button on the left side of its screen. What should you press the next time you want to get coffee: the same button or any button that is flashing? In this specific example, reward outcomes can be equally attributed to either the identity of a choice option (e.g., flashing button) or the action needed to obtain that option (pressing the left button), corresponding to uncertainty about the correct model of the environment (stimulus-based vs. action-based). More generally and during most naturalistic settings, reward outcomes could be linked to any combination of features or attributes of a selected option or chosen action. It has been suggested that the brain tackles such uncertainty by running multiple predictive models of the environment, with each model predicting outcomes based on different attributes of choice options, and using the reliability of these predictions to select the appropriate model to inform choice behavior1,2,3.

Although many conceptual and algorithmic solutions to model arbitration exist2,3,4, confirming implementation level details in terms of the operation of neural circuits has remained a challenge due to several factors. First, most experimental paradigms involve manipulating uncertainty in one of two ways. In some paradigms, there is uncertainty about which of multiple choice options is more rewarding through probabilistic reward contingencies and contingency reversals5,6,7. In other paradigms, the correct option is deterministically linked to reward outcomes, but there is uncertainty about the correct model of the environment and when to choose that option8,9,10,11,12. Few, if any, studies have manipulated expected and unexpected uncertainty13 related to stimulus-action-outcome relationships together with uncertainty about which model of the reward environment is relevant at the time. Critically, in reward environments where the correct model of the environment does not change frequently, the reliability of different models can reach their asymptotes very quickly, concealing the contributions of circuits involved in dynamic arbitration and model selection. Second, it is intrinsically difficult to measure and track the contributions of multiple learning systems and their interactions because different learning systems can drive choice behavior at any given moment. Third, because computations required for learning and arbitration under uncertainty must interact with each other, many cortical and subcortical areas may appear to be similarly involved in different processes (lack of specialization). For example, the amygdala has been shown to contribute to reward learning under uncertainty by both improving and impairing learning performance14,15,16,17,18,19,20 and by its contribution being associated with different types of uncertainty13,21.

To overcome these challenges and reveal the circuit and neural mechanisms underlying arbitration, we applied multiple computational approaches to examine choice behavior in three groups of monkeys (a control group and two groups with bilateral lesions of either the amygdala or ventral striatum) performing a probabilistic learning task that involved multiple forms of uncertainty. These included uncertainty about the better option on a given trial (expected uncertainty), uncertainty about the correct model of the environment, and uncertainty about when reward associations change (unexpected uncertainty), thus creating a challenging task that could reveal the role of the amygdala and ventral striatum (VS) in all three of these processes. To track the simultaneous contributions of multiple learning systems and their interaction, we extended metrics based on information theory22,23 to quantify consistency in choice and learning based on stimulus- and action-based learning over time. Additionally, we developed several reinforcement learning (RL) models that, along with previous models, were used to fit choice behavior on a trial-by-trial basis. These models extended the previous ones by incorporating static or dynamic arbitration among alternative learning systems, based on different signals. We then examined the best models and their estimated parameters, particularly those related to the arbitration process, to pinpoint the roles of the amygdala and VS in reward learning and arbitration. Moreover, by modulating the key parameters of the model, we simulated and qualitatively replicated the distinct behavioral signatures of amygdala or VS lesions. Together, by utilizing the above methods, we provide evidence for interactions between stimulus-based and action-based learning under uncertainty, uncover mechanisms underlying arbitration between the two systems, explore how arbitration and learning processes interact, and determine the amygdala’s contributions to arbitration and overall behavior.

Results

Behavioral paradigm with multiple forms of uncertainty

We examined monkeys’ choice behavior when performing a variant of a probabilistic learning paradigm that involves multiple forms of uncertainty. In this paradigm, during each block of 80 trials, monkeys selected between two novel visual stimuli that were randomly presented on the opposite sides of the screen (Fig. 1a; see Experimental paradigm in “Methods”). Selection of each option was rewarded with a certain probability (80:20, 70:30, or 60:40), but the probabilities for better and worse options reversed at a random point within the block without any signal to the monkey. Critically, at the start of a block of trials the monkeys were unaware if the assigned reward probabilities for a particular block were linked to the selection of   specific stimuli or locations (Fig. 1b). This created uncertainty about the correct model of the environment (stimulus-based vs. action-based). In one task, rewards were exclusively based on stimulus-outcome associations (What-only task; Fig. 1c). In the second task, rewards in each block of trials were determined by either stimulus-outcome or action-outcome associations (What/Where task; Fig. 1d). Together, these components resulted in three types of uncertainty: uncertainty about the correct option on a given trial (expected uncertainty), uncertainty about the correct model of the environment, and uncertainty regarding when reward associations reverse (unexpected uncertainty).

Fig. 1: Experimental paradigm, block types, and time course of performance.
figure 1

a Timeline of each block and a single trial of the experiment. At the beginning of each block, two novel stimuli (abstract visual objects) were introduced. On each trial, animals indicated a choice by making a saccade toward one of the two options on the left and right sides of fixation. The selection of each stimulus was rewarded probabilistically based on three reward schedules. Reward contingencies were reversed between better and worse options on a randomly selected trial (between trials 30 and 50). b Different block types. In What blocks (top), reward probabilities were assigned based on stimulus identity, with a particular object having a higher reward probability. In Where blocks (bottom), reward probabilities were assigned based on the location of the stimuli, with a particular side having a higher reward probability regardless of the object appearing on that side. c Outline of the What-only task. Here, only What blocks were used for the entire experiment. Vertical dotted lines, rev, indicate a random reversal point within each block. d Outline of the What/Where task. In this task, What (black) and Where (gray) blocks were randomly interleaved. eg Time course of performance of the monkeys in each group, measured as the probability of choosing the better option, P(Better), separately for different tasks and block types. Shaded regions indicate error bars. Insets show the averaged performance by reward schedules. Error bars = SEM across subjects (controls: n = 4 in What-only, n = 6 for What/Where; amygdala: n = 4 for both tasks; VS: n = 3 for both tasks). Source data are provided as a Source data file.

Evidence for multiple learning systems and their interaction

To examine the presence of multiple learning systems and reveal the effects of different forms of uncertainty, we first compared the performance of control monkeys across different tasks and reward schedules. Overall performance was best during the What-only task, in which only the stimulus identity was predictive of reward, and there was no inherent uncertainty about the model of the environment in terms of objective task structure. Using mixed-effects analysis which accounts for subject variability, we observed that the probability of choosing the more rewarding option, P(Better), during the What-only task was significantly higher than in either What blocks (main effects of block type; βWhat = −0.074, p = 2.74 × 10−28) or Where blocks (βWhere = −0.087, p = 2.39 × 10−37) of the What/Where task, which involved additional uncertainty about the correct model of the environment. We also found that expected uncertainty affects performance, measured by P(Better): across all block types and tasks, performance improved as it became easier to discriminate between the reward probabilities of the two options (main effect of reward variance; βvar(P) = −1.950, p = 4.60 × 10−31).

To capture the effect of reward feedback and how it was used to perform the task and adjust the behavior, we utilized information-theoretic metrics to quantify consistency in reward-dependent choice strategy on two attribute dimensions, stimulus identity and stimulus location22,23. Specifically, we examined the conditional entropy of reward-dependent strategy (ERDS), defined as the Shannon entropy of stay/switch strategy conditioned on the previous reward feedback, separately for stay/switch based on action or stimulus identity (see Eq. 1 in “Methods”). Lower values of ERDSStim suggest that the animals stayed or switched after reward feedback based on stimulus identity (stimulus-based learning), whereas lower values of ERDSAction indicate that the animals’ stay/switch strategy was based on assigning reward to the chosen action (action-based learning).

To quantify the interaction between the two learning systems, we computed the correlation between ERDSStim and ERDSAction computed from each 80-trial block (Supplementary Fig. 5). We found that even in the What-only task for which action-based learning was irrelevant and minimally used, there was a negative correlation between ERDSStim and ERDSAction (Spearman’s correlation, r = −0.123, p = 4.29 × 10−7), suggesting that more consistency in using one model resulted in less consistency using the other model. For the What/Where task, we observed stronger negative correlations between ERDSStim and action ERDSAction for both block types (What: r = −0.602, p = 2.43 × 10−296; Where: r = −0.578, p = 1.32 × 10−261). Overall, these results reveal significant interactions between the stimulus- and action-based learning systems.

Considering previous findings on the influence of learning strategy on response time24,25, we hypothesized that reaction time (RT) is influenced by the dominant learning system at any given time. To test this hypothesis, we categorized trials as either stimulus- or action-dominant by directly comparing ERDSStim and ERDSAction (see Data analysis and statistical tests in “Methods” for details). Using this approach, we found that during the What/Where task, responses in stimulus-dominant trials were significantly slower than action-dominant trials in both block types (Supplementary Note 1). This contrast in stimulus vs. action-driven RTs was harder to identify in the What-only task, in which choices were dominated by the stimulus-based system. Nonetheless, rare action-dominant trials happened when reward value estimates based on the two systems were close to each other, resulting in slower and more erroneous responses. Overall, these results show that entropy-based metrics could be used to identify the adopted model on a given trial and that RT reflected the adopted strategy, with the stimulus-based strategy resulting in longer RT than the action-based strategy.

Influences of amygdala and ventral striatum lesions on learning and choice behavior

Next, we compared the effects of amygdala and VS lesions on choice behavior to elucidate their contributions to decision-making, learning, and arbitration. During the What-only task, amygdala-lesioned monkeys exhibited the largest impairment in performance (P(Better) across all three reward schedules: M ± SD = 0.581 ± 0.11; main effect of group in mixed-effects analysis; βamyg = −0.172, p = 7.72 × 10−22; contrast between lesion groups: βamygβVS = −0.076, p = 1.11 × 10−4; Fig. 1e; Supplementary Table 1). VS-lesioned monkeys also showed impairment compared to the control monkeys (main effect of group in mixed-effects analysis; βVS = −0.096, p = 1.19 × 10−6; Fig. 1e; Supplementary Table 1). Although previous study17 have reported additional differences in the behavior of amygdala- and VS-lesioned monkeys, the better performance of VS- compared to amygdala-lesioned monkeys (Fig. 1e inset) is surprising, given the established role of VS for stimulus-based learning26,27, which is required for the What-only task.

During What blocks of the What/Where task, however, amygdala-lesioned monkeys performed significantly better than VS-lesioned monkeys (mixed-effects analysis, contrast between lesion groups: βamygβVS = 0.104, p = 0.0152; Fig. 1f inset; Supplementary Table 2), while both lesioned groups were impaired relative to control monkeys (amygdala: βamyg = −0.105, p = 0.00334; VS: βVS = −0.209, p = 1.05 × 10−7). In Where blocks that did not require stimulus-based learning, only amygdala-lesioned monkeys showed impairments in performance relative to controls (contrast from controls = −0.070, p = 0.0173; Fig. 1g inset; Supplementary Table 2). VS-lesioned performance was comparable to that of controls (contrast from controls = −0.037, p = 0.242) yet not significantly better than that of amygdala-lesioned monkeys (contrast between lesion groups = −0.033, p = 0.351).

Overall, these results demonstrate that in the absence of intrinsic uncertainty about the correct model of the environment and when this model was stimulus-based, VS-lesioned monkeys (with intact amygdala) were able to partially overcome the deficit in stimulus-based learning to a significantly larger degree than amygdala-lesioned monkeys. Under additional uncertainty about the correct model of the environment (the What/Where task), however, VS lesions caused significant impairment in What blocks only, consistent with the role of VS in stimulus-based learning. In contrast, amygdala-lesioned monkeys exhibited impaired performance in both What and Where blocks (but more strongly in What blocks) despite no clear evidence for the significant contribution of the amygdala to action-based learning (but see ref. 20), whereas there is action encoding in the amygdala and VS28,29. As noted in the previous work18, the similar impairments during What and Where blocks observed in amygdala-lesioned monkeys cannot be explained by the amygdala’s currently assumed role in stimulus-based learning.

These results are also puzzling because the higher performance of VS-lesioned compared to amygdala-lesioned monkeys in the What-only task suggests a stronger contribution of the amygdala to stimulus-based learning. However, the higher performance of amygdala-lesioned compared to VS-lesioned monkeys in What blocks of the What/Where task contradicts this idea. These findings hint at a potential role of the amygdala in arbitration between stimulus- and action-based learning, in addition to its known role in stimulus-based learning.

To study the relative adoption of the two learning strategies according to uncertainty of the reward environment, we examined the difference between ERDSStim and ERDSAction (ΔERDS) by block types and reward schedules (Supplementary Fig. 6). We found that the reward uncertainty (measured as variance13) is predictive of the relative degree of adoption between stimulus- and action-based strategies. More specifically, in the What-only and What blocks of the What/Where task, animals’ strategies became relatively more biased toward action-based (increasing ΔERDS) as the uncertainty of the reward schedule increased (Supplementary Fig. 6d, e). Consistently, in Where blocks, they tended to become relatively more stimulus-based under more uncertainty (decreasing ΔERDS; Supplementary Fig. 6f). These observations demonstrate that both control and lesioned monkeys adjust  to reward uncertainty by exploring the incorrect model of the environment, even though they start from different baselines. Critically, amygdala-lesioned monkeys exhibited the smallest  distinction between the two types of learning strategies (Supplementary Fig. 6d–f).

Finally, we also examined RT in the two lesioned groups and found consistent results to those of the control animals (Supplementary Note 1). Together, our findings suggest that VS lesions biased behavior toward action-based learning by impairing stimulus-based learning. In contrast, amygdala lesions resulted in more nuanced impairment of both stimulus- and action-based learning, as well as their coordination. To reveal the underlying mechanisms, we developed multiple computational models to fit the choice behavior of both control and lesioned monkeys.

Mechanisms of arbitration between stimulus- and action-based learning systems

To uncover mechanisms underlying the interaction between the two systems, we developed several hybrid RL models to fit the choice behavior of control monkeys on a trial-by-trial basis (see Supplementary Table 15 for the list of all models). In the simplest model, signals from distinct action-based and stimulus-based learning systems were combined linearly using a fixed weight to control choice behavior. We also tested models with dynamic arbitration in which the relative weighting of the two systems, ω, was updated on each trial based on the reliability of the two systems. Drawing on previous literature, we compared multiple methods for computing reliability: (1) the magnitude of the reward prediction error (|RPE|), (2) the value of the chosen option (Vcho), (3) discernibility between two competing options (|ΔV|), and (4) the sum of value estimates within each system (ΣV) (see Eqs. 1316 in “Methods”). Additionally, we considered a more general model in which the baseline (time-independent) ratio of value signals from the two learning systems (quantified by parameter, ρ) could be adjusted independently of ω (Fig. 2a; see Eq. 10 in “Methods”). As a result, this (Dynamic ω-ρ) model relaxes the assumption that an increase in signal strength from one system (or equivalently, the sensitivity of decision-making to those signals) is matched by an equal decrease in signal strength from the other system, and vice versa. To determine the best model, we computed the goodness-of-fit using five-fold cross-validation (see K-fold cross-validation of model performance in “Methods”).

Fig. 2: Fit of choice behavior of control monkeys using various RL models.
figure 2

a Schematic of the RL model with two parallel learning systems, showing an example trial in which stimulus A appeared on the left side. In the static model, a constant ω is assumed to be fixed for each block of trials. In the dynamic models, ω is updated on each trial according to the relative reliability of the two systems. In a more general dynamic model, the fixed parameter ρ (estimated for each subject) adjusts the baseline ratio of two value signals. For ρ = 0.5, the Dynamic ω-ρ model reduces to the Dynamic ω model. The overall value (OV) of a left or right saccade is determined as a weighted combination of action and stimulus values. b Comparison of goodness-of-fit across models. Plotted is the mean negative log-likelihood over all cross-validation instances for each task: What-only (black), What/Where (gray). Numbers in parentheses indicate McFadden R2 (Eq. 20). c An example Where block in the What/Where task and estimated arbitration weight from the Static ω model (dotted line), and arbitration weights (ω, dashed line) and effective arbitration weights (Ω, solid line) from the best model (Dynamic ω-ρ model). In this example, ρ = 0.61, effectively biasing behavior toward a stimulus-based strategy. In this block, rightward action (R) was a better option than leftward action (L) before reversal (rev, horizontal dashed line). d Average trajectory of Ω from the Dynamic ω-ρ model during different tasks and blocks: What-only (solid), What (dashed), and Where (dotted). Different colors correspond to different reward schedules: 80/20 (black), 70/30 (dark gray), 60/40 (light gray). <rev> indicates reversal (horizontal dashed line), with positions normalized across blocks. e Relationship between Ω (block-averaged) and median reaction time for a given block during the What/Where task. Reported are Spearman’s correlation coefficient r and its p-value (two-sided) for all blocks during the What/Where task. Source data are provided as a Source data file.

Comparing the single-system models with the simplest two-system model, which assumes a fixed relative weighting for the two systems (RLStim+Action + Static ω), we found that the latter provided a better fit. Interestingly, this model improved the goodness-of-fit even in the What-only task, in which action learning was not predictive of reward. Overall, however, all the dynamic models provided a better fit than the model with fixed weighting. Ultimately, the Dynamic ω-ρ model, which uses the value of the chosen option (Vcho) to estimate reliability and incorporates a baseline weighting of the two systems quantified by ρ, provided the best fit across all tasks and for each monkey (Fig. 2b; Supplementary Table 16).

To gain more insight into how dynamic arbitration improves the fit of choice behavior, we next examined the behavior of the Dynamic ω-ρ model and its arbitration weights over time. To that end, we computed the effective arbitration weight (“effective” ω denoted by Ω) to measure the overall relative weighting between two systems considering the parameter ρ (see Eq. 12 in “Methods”). Both the example block and the averaged trajectories of trial-by-trial Ω from the best model (Fig. 2c, d) showed dependence on the block type and uncertainty in the reward schedule, especially during the What/Where task that required arbitration between competing models of the environment. These results demonstrate that Ω can capture behavioral adjustments to uncertainty over time.

As shown above, stimulus-based choices lead to slower RTs (Supplementary Note 1). Motivated by this finding, we tested whether the Dynamic ω-ρ model could capture the differences in RT according to the dominant learning system. To that end, we computed the correlation between the median RTs of the block and the average estimated values of Ω, which measures the overall relative weighting of the stimulus-based to action-based system. For the What-only task, we found a small yet significant correlation between the effective arbitration weight and RT (Spearman’s r = 0.094, p = 1.24 × 10−4). In comparison, Ω and RT were highly correlated in the What/Where task (r = 0.414, p = 3.78 × 10−245; Fig. 2e). However, because these simple correlations do not control for other confounding factors such as choice confidence (measured by value difference), choice accuracy (choosing the better or worse option), and long-term drift in RT, we conducted further regression analyses. These analyses included these factors along with other model-derived measures to predict trial-by-trial RT (Supplementary Note 2). Using these analyses, we found that across all groups and conditions, higher Ω predicted longer RT (Supplementary Note 2). Together, these results indicate that slower RTs occurred when larger weights were assigned to the stimulus-based system, and faster RTs occurred with larger weights on the action-based system. This is consistent with the previous analysis, which showed that action-dominant trials (determined using ERDS) were accompanied by faster RTs.

As part of our exploration of arbitration mechanisms, we also compared multiple algorithms for estimating the reliability of the two systems, including Vcho, |RPE|, discernibility between two competing options (|ΔV|), and the sum of value estimates within each system (ΣV) (see Eqs. 1316 in “Methods”). We found that among the four reliability measures considered, Vcho best explained the control monkeys’ choice behavior across all block types (Supplementary Fig. 7a). Specifically, the Dynamic ω model based on Vcho improved the fit over the Static ω model in all tasks, whereas the Dynamic ω model based on |RPE| improved the fit over the Static ω model only in the What/Where task (Fig. 2b). Through model recovery, we confirmed that our model fitting procedure could effectively discriminate between the alternative one-system and two-system models (Supplementary Fig. 8a–d).

Importantly, Vcho and |RPE| are conceptually related as both signals measure the predictiveness of reward values in each system. However, the two signals differ in their sensitivity to negative feedback (i.e., no reward). For example, when the chosen value of the more reliable system (e.g., stimulus-based system in What blocks) is (correctly) estimated to be high, negative RPE (and consequently |RPE|) will also be high, undesirably facilitating the update toward the incorrect system. In comparison, Vcho by itself is less sensitive to negative feedback, as the updated value of Vcho after omission of reward will still reflect the high value estimates for the more reliable system. As a result, the Vcho signal distinguishes the more reliable system better than the |RPE| signal, especially for more uncertain reward schedules (Supplementary Fig. 7b–d). Ultimately, the difference in the Vcho drives the arbitration process (Eqs. 89), and this difference is equal to the difference in signed RPE. This suggests that the reliability signal could be more connected to signed  rather than unsigned RPE.

Finally, to further validate our models, we compared the predictions of different models regarding the observed negative interaction between ERDSStim and ERDSAction during the What-only task. This is to ensure that this relationship was due to competition between the two learning systems and not due to task structure, as the animals could not stay/switch on the stimuli and location dimension at the same time, while positions of stimuli were pseudo-randomly assigned to either side. To that end, we simulated choice behavior using single-system or two-system models and computed regression weights between ERDSStim and ERDSAction (see Model fitting and simulation in “Methods” for more details). Competition between the two learning systems during the What-only task would suggest that weaker stimulus-based learning corresponds to stronger action-based learning and vice versa. We found that in the single-system model that learned the stimulus-outcome contingencies only (RLStim-only), ERDSStim was only weakly predictive of ERDSAction (Supplementary Fig. 9b; β = −0.007, p = 1.07 × 10−44). In contrast, in both the static and dynamic two-system models, ERDSStim was negatively predictive of ERDSAction, thus reproducing the competitive interaction between the two systems (Supplementary Fig. 9c; β = −0.0648, p = 1.09 × 10−56; Supplementary Fig. 9d; β = −0.0893, p = 6.73 × 10−57). These simulation results further support the presence of multiple learning systems and their dynamic interaction, even in an environment where one of the two systems was not beneficial for performing the task.

Deficits in arbitration due to amygdala but not VS lesions

Fit of choice behavior of lesioned monkeys revealed that the Static ω model explained the choice behavior of both lesioned groups better than models with a single learning system (Fig. 3a). Moreover, incorporating dynamic arbitration as in the Dynamic ω model further improved the fit beyond what the Static ω model achieved in both groups. Furthermore, including baseline relative weighting, as in the Dynamic ω-ρ model, resulted in the best overall fit (Fig. 3a). Finally, consistent with the results in controls, for both lesioned groups, the dynamic model that used Vcho to estimate reliability accounted for choice behavior better than the model using |RPE| for estimating reliability (Supplementary Fig. 7a).

Fig. 3: Fit of choice behavior for amygdala- and VS-lesioned monkeys.
figure 3

a Comparison of the models’ goodness-of-fit for the choice behavior of amygdala-lesioned (left) and VS-lesioned (right) monkeys. Plotted is the mean negative log-likelihood over all cross-validation instances for each task: What-only (black), What/Where (gray). Numbers in parenthesis indicate McFadden R2. Averaged trajectory of estimated Ω (effective ω) in amygdala-lesioned (b) and VS-lesioned (c) monkeys, separately for each block type and reward schedule. Solid, dashed, and dotted curves indicate What-only, What, and Where blocks, respectively. <rev> indicates reversal (horizontal dashed line), normalized across blocks. Colors indicate different reward schedules: 80/20 (brown and navy), 70/30 (red and blue), 60/40 (orange and cyan). d Schematic of effective arbitration rates. ψ+ and ψ represent the rate of update toward the stimulus-based (increase in Ω) and action-based system (decrease in Ω). e Plotted are the time courses of effective arbitration rates toward the stimulus-based or action-based system during the What-only task for each group of monkeys (controls: black; VS-lesioned: blue; amygdala-lesioned: red). Solid and dotted lines indicate effective arbitration rates toward stimulus-based (ψ+) and action-based systems (ψ), respectively. Shaded regions indicate error bars. Insets show the mean paired differences between the two arbitration rates within each block after reversal (Δψ = ψ+ψ). Asterisks indicate significant difference from zero within each group as determined by mixed-effects analysis (p < 0.05, two-sided, corrected for multiple comparisons using Benjamini–Hochberg procedure; control: p = 1.24 × 10−102; amygdala: p = 0.00116; VS: p = 1.20 × 10−4; see Supplementary Table 3 for the full statistics). Individual data points represent the mean of each monkey. Error bars = SEM across subjects (control: n = 4; amygdala: n = 4; VS: n = 3). f Same plot as in (e) but for What blocks of the What/Where task (control: p = 1.59 × 10−9; amygdala: p = 0.218; VS: p = 3.95 × 10−7; see Supplementary Table 4 for the full statistics). g Same plot as in (e) but for Where blocks of the What/Where task (control: p = 0.00968; amygdala: p = 0.824; VS: p = 2.03 × 10−5; see Supplementary Table 4 for the full statistics). Error bars = SEM across subjects (control: n = 6; amygdala: n = 4; VS: n = 3). Source data are provided as a Source data file.

To determine the mechanisms by which different lesions impact the arbitration process, we examined the estimated parameters in the best model. We first confirmed that the parameters of this model were recovered well (Supplementary Fig. 8e, f). The estimated trajectory of Ω revealed that, similar to controls, arbitration was modulated by reward uncertainty during the What/Where task in both amygdala-lesioned (Fig. 3b) and VS-lesioned monkeys (Fig. 3c). Nonetheless, Ω values were overall smaller than in controls, corresponding to a more action-based strategy in lesioned animals (compare Fig. 3b, c and Fig. 2d). Importantly, a key difference between the two lesioned groups demonstrates deficits in the arbitration process due to amygdala lesions. During the What/Where task, the effective arbitration weight (Ω) for amygdala-lesioned monkeys increased over time in both What and Where blocks (dashed and dotted curves in Fig. 3b). In contrast, VS-lesioned monkeys showed an increase in Ω during What blocks and a decrease in during Where blocks (dashed and dotted curves in Fig. 3c), mirroring the pattern observed in control animals (dashed and dotted curves in Fig. 2d). Meanwhile, during the What-only task, Ω remained stable but at lower values for amygdala-lesioned (solid curves in Fig. 3b) compared to VS-lesioned monkeys (solid curves in Fig. 3c), and significantly lower than in controls (solid curves in Fig. 2d).

These results demonstrate that VS lesions biased behavior toward action-based learning while keeping the arbitration processes relatively intact, whereas amygdala lesions impaired arbitration in addition to biasing behavior toward action-based learning. These suggest that the deficits observed in amygdala-lesioned monkeys cannot solely be attributed to impairments in stimulus-based learning; instead, they involve a more complex interaction between stimulus- and action-based signals. Consistent with this interpretation, we found that the winning model with dynamic arbitration (Dynamic ω-ρ) more accurately captures the key aspects of behavioral strategy in the two lesioned groups compared to the single-system models (Supplementary Fig. 10).

To further investigate the dynamics of the arbitration weight, we next examined the rate of change in Ω across the three groups. To that end, we calculated the “effective” arbitration rates by calculating the ratio of the overall change in Ω toward 1 (favoring stimulus-based system) or 0 (favoring action-based system) relative to its original value (Fig. 3d; see Eqs. 1719 in “Methods”). This quantity measures the rate of arbitration, analogous to the learning rate for updating value estimates. We found that in control monkeys, the effective arbitration rates toward the stimulus-based (ψ+) or action-based (ψ) system diverged toward the end of a block, reflecting the adoption of the correct model of the environment. That is, when the stimulus-based system was more reliable, the effective arbitration rates toward the stimulus-based system were larger than those toward the action-based system (mixed-effects model with a single fixed intercept, representing the mean Δψ; What-only task: β0 = 0.0930, p = 1.24 × 10−102; What blocks of What/Where task: β0 = 0.0444, p = 1.59 × 10−9; Fig. 3e, f; Supplementary Tables 3 and 4). Similarly, in Where blocks, where the action-based system was more reliable, the effective arbitration rate toward the action-based system was significantly larger than that toward the stimulus-based system (β0 = −0.0301, p = 0.00968; Fig. 3g).

In contrast, amygdala-lesioned monkeys showed the minimum differentiation between adjustments toward the more and less reliable (correct and incorrect) learning systems. Notably, during the What/Where task, amygdala-lesioned monkeys exhibited no significant difference between the two arbitration rates during either block type (mixed-effects analysis on Δψ with a single intercept; What: β0 = 0.0114, p = 0.218; Where: β0 = −0.00347, p = 0.824; Fig. 3f, g; Supplementary Table 4). In contrast, VS-lesioned monkeys exhibited an overall large bias in arbitration rates toward the action-based system (i.e., higher ψ) during both block types (mixed-effects analysis on Δψ with a single intercept; What: β0 = −0.0536, p = 3.95 × 10−7; Where: β0 = −0.0739, p = 2.03 × 10−5; Fig. 3f, g; Supplementary Table 4). Overall, amygdala-lesioned monkeys were characterized by the least amount of differentiation between the two arbitration rates (mixed-effects analysis on |Δψ|; contrast from controls = −0.0254, p = 0.0114; contrast from VS = −0.0282, p = 0.0177; Supplementary Table 5).

In the What-only task, with reduced uncertainty about the model of the environment, the difference between the two arbitration rates in amygdala-lesioned monkeys was positive (contrast on group means = 0.0142, p = 0.00116; Fig. 3e; Supplementary Table 3) but much smaller than that of control monkeys (main effect of group in mixed-effects analysis; βamyg = −0.0786, p = 1.15 × 10−37). The VS-lesioned group also exhibited higher arbitration rates toward the stimulus-based system (contrast on group means = 0.0226, p = 1.20 × 10−4), which aligns with the recovered performance observed in these monkeys.

Together, these results suggest that amygdala lesions impair arbitration between the two learning systems by eliminating differential updates for the correct and incorrect (more and less reliable) systems. This indicates that the amygdala is critical for identifying and/or retaining the correct model of the environment, or biasing arbitration toward it. In contrast, VS lesions mainly impair stimulus-based learning and increase the overall arbitration bias toward action-based learning.

Dynamic interaction between learning and arbitration processes and the impact of the initial state

Considering the observed effects of amygdala and VS lesions on arbitration dynamics, we next examined the estimated parameters from the best-fit model (Dynamic ω-ρ). In this model, ρ captures whether there is an overall reduction in baseline value signals from the stimulus-based system relative to the action-based system. Consistent with the hypothesized role of VS in stimulus learning, the estimated values of ρ were on average smaller in VS-lesioned monkeys compared to controls (permutation test for difference in group mean; p = 0.0044), indicating a larger baseline reduction in stimulus-value signals relative to action-value signals in VS-lesioned monkeys (Fig. 4a). In contrast, we found no such evidence for reduction in ρ in amygdala-lesioned monkeys relative to controls (permutation test for difference in group mean; p = 0.869). As a result, VS-lesioned monkeys exhibited a smaller difference in choice sensitivity to stimulus- and action-value signals (Δβ = βstimβaction), with a bias toward action-value signals, compared to controls in both the What-only task (mixed-effects analysis on Δβ, main effect of group; βVS = −7.48, p = 0.00658; Fig. 4b inset; Supplementary Table 6) and the What/Where task (βVS = −3.53, p = 0.0333; Fig. 4c inset). This was not the case for amygdala-lesioned monkeys, which exhibited no significant difference compared to controls in either task (What-only: βamyg = −2.69, p = 0.280; Fig. 4b inset; What/Where: βamyg = 0.542, p = 0.719; Fig. 4c inset; Supplementary Tables 6 and 7). These results suggest that, unlike VS lesions, amygdala lesions did not significantly alter the relative baseline strength of stimulus-value vs. action-value signals. Instead, amygdala lesions reduced sensitivity to both systems. Therefore, consistent with previous observations, the deficit observed in amygdala-lesioned monkeys cannot be solely attributed to impairments in stimulus-based learning. Instead, they suggest deficits in arbitration processes that subsequently affect learning and decision making.

Fig. 4: Comparison of relative weighting and sensitivity of choice to stimulus- and action-based signals across control and lesion groups, and their simulations.
figure 4

a Plots show the mean and individual values (each point represents a monkey) of the relative strength of two systems on choice (ρ), separately for each group and task. Error bars = SEM across subjects (controls: n = 4 in What-only, n = 6 for What/Where; amygdala: n = 4 for both tasks; VS: n = 3 for both tasks). b Comparison of the sensitivity of choice to signals in the two systems, separately for each group during the What-only task. β1 is the common sensitivity of choice (inverse temperature) estimated for each session, and ρ is the relative strength of the two systems as in (a). Insets show violin plots of the paired differences (Δβ = βstimβaction), and asterisks indicate significant group effects as determined by mixed-effects analysis (p < 0.05, two-sided, corrected for multiple comparisons using Benjamini–Hochberg procedure; cont vs. amyg: p = 0.280; amyg vs. VS: p = 0.080; cont vs. VS: p = 0.00658; see Supplementary Table 6 for the full statistics). c Same plot as in (b) but for the What/Where task (cont vs. amyg: p = 0.719; amyg vs. VS: p = 0.0232; cont vs. VS: p = 0.0333; see Supplementary Table 7 for the full statistics). Only VS-lesioned monkeys showed larger sensitivity to action- than stimulus-based systems during both tasks, consistent with the role of the VS in stimulus-based learning. df Plots show the simulated trajectory of ω and the difference in the effective arbitration rates (ψ+ψ). Each line represents averaged trajectories (10,000 simulated blocks) of ω with different initial values (ω0) and specified values of ρ and β1 during What blocks. All other parameters are fixed (α+ = α = 0.5, β0 = 0, ζ = 0.3, αω = 0.2, ζω = 0.05). Black, red, and gray arrows show the trajectory simulated with ω0 equal to the mean of control, amygdala-lesioned, and VS-lesioned monkeys, showing ψ+ > ψ, ψ+ ≈ ψ, and ψ+ < ψ, respectively. Horizontal line (trial 40) indicates reversal. Inset in (e) shows distributions of ω0 across the three groups during the What/Where task, and asterisks indicate significant group difference (mixed-effects analysis). gi Plots show the simulated trajectory of ω and performance, P(Better). Conventions are the same as in (df). Source data are provided as a Source data file.

To confirm this point, we examined the initial arbitration weights that determine the weights of the two systems on choice at the beginning of each block, when the monkeys were unaware of thecorrect model of the environment during the What/Where task. We note that amygdala- and VS-lesioned monkeys did not significantly differ in the initial Ω (effective ω) values (planned contrast for group difference in Ω0= 0.0242, p = 0.657; Supplementary Table 8; compare Ω of the first trial in Fig. 3b, c). However, by examining the initial arbitration weights (ω0) before scaling by ρ, we found that amygdala-lesioned monkeys had significantly smaller ω0 values compared to controls (mixed-effects analysis on Ω0; βamyg = −0.194, p = 0.00419) while the VS-lesioned group did not (βVS = −0.119, p = 0.108; Supplementary Table 9). More directly, the amygdala group showed larger changes in ω0 after scaling by ρ compared to the VS group (group contrast in mixed-effects analysis on Ω0ω0 = 0.111, p = 0.0204; Supplementary Table 10). This means that larger values of ρ in amygdala-lesioned monkeys were offset by lower ω0 values to yield Ω0 comparable to VS-lesioned monkeys. Therefore, deficits due to amygdala lesions can be mainly attributed to the reduction in the initial weight (ω0) and subsequent interaction between arbitration and learning processes. In contrast, deficits due to VS lesions are largely caused by a reduction in the relative baseline strength of stimulus-value to action-value signals, measured by ρ.

To further validate this idea through model simulations, we generated the choice behavior of the Dynamic ω-ρ model by adjusting two key parameters to mimic the effects of brain lesions: baseline ratio of the weights of the stimulus- to action-value signals (ρ) and the initial arbitration weight (ω0). These two parameters reflected the most consistent effect of lesions across the two tasks, with reduced ρ in VS-lesioned monkeys (Fig.4a–c; βVS = −0.195, p = 0.00279; mixed-effects analysis on compiled ρ across all groups/tasks) and reduced ω0 in amygdala-lesioned monkeys (βamyg = −0.191, p = 0.0171; mixed-effects analysis on compiled ω0 across groups/tasks). We kept all other parameters constant except for the common inverse temperature, β1.

Trajectories of simulated ω during What blocks revealed that different values of ρ and ω0 can create different dynamics with respect to arbitration rates (Fig. 4d–f). More specifically, the simulated trajectory of ω based on mean ω0 in control monkeys during the What/Where task (Fig. 4d, black arrow) resulted in larger transitions toward the stimulus-based system (ψ+ > ψ), whereas mean ω0 in amygdala-lesioned monkeys (Fig. 4e, red arrow) reduced the distinction between the two arbitration rates (ψ+ ≈ ψ). In comparison, simulations using mean ω0 in VS-lesioned monkeys (Fig. 4f, gray arrow) resulted in an overall update bias toward the action-based system (ψ+ < ψ). These results qualitatively mimic the pattern of effective arbitration rates in the three groups (Fig. 3e–g).

Finally, we also tested the causal contribution of initial arbitration weight on performance using simulated choice behavior (Fig. 4g–i). Crucially, we found that lower values of ω0, as observed in amygdala-lesioned monkeys (ω0 = 0.18; red arrows in Fig. 4h), lead to reduced performance when compared to higher ω0 values (e.g., ω0 = 0.40). This effect was reflected in a significant main effect of ω0 on the simulated performance (F(20,1659) = 40.6, p = 8.38 × 10−128). These simulation results demonstrate that reduced flexibility in the arbitration process––reflected by a lower ω0––could be the main cause of impaired performance, rather than just a secondary consequence.

Although control monkeys were also biased toward the action-based system at the start of the What/Where task (mean ± s.e.m; ω0 = 0.374 ± 0.058), lesions to the amygdala resulted in an even larger bias toward the action-based system (ω0 = 0.179 ± 0.053), and this consequently led to a lack of differential updates for the two systems. Therefore, our simulations indicate that amygdala-lesioned monkeys operate within a parameter space that produces smaller differences in arbitration rates favoring the correct system for a given environment, which ultimately reduces performance. Overall, these results suggest that the initial state of the system (ω0) is crucial for determining the later trajectory and rates of transition in the arbitration process.

In contrast, lesions to VS mainly decreased ρ to bias signals toward the action-based system, while affecting the initial state of arbitration to a lesser degree (ω0 = 0.276 ± 0.042). It is worth noting that the simulations using ρ = 0.4 (Fig.4f, i), which mimics the reduction in the relative baseline strength of stimulus-value signals due to VS lesions, result in the biased update rates toward the action-based system for many of the ω0 values (blue lines in Fig.4f), including ω0 ~ 0.37, which matches the initial values for control monkeys. Therefore, the consistent adoption of an action-based strategy in the What/Where task can be sufficiently accounted for by a reduced ρ value, without the need for additional constraints on ω0. These results support the notion that the impairments observed in VS-lesioned monkeys during this task can be fully explained by a reduction in stimulus-value signals, with minimal direct impact on arbitration processes.

Diversity of behavior driven by the dynamic interaction between learning and arbitration processes

To demonstrate the impact of dynamic interaction between the learning and arbitration processes on behavior, we simulated the model within the task by adjusting parameters such as the learning and forgetting rates. These simulations revealed a wide range of dynamics in performance and arbitration weights, highlighting complex interactions between learning, arbitration, and decision-making processes (Fig. 5). Interestingly, we observed that higher initial arbitration weights, which would allow the animals to correctly bias their behavior toward the stimulus-based system during a stimulus-learning task, can both facilitate and impede learning after reversals depending on other parameters of the model.

Fig. 5: Complex interaction between arbitration and learning gives rise to diverse behavioral patterns.
figure 5

Each line represents averaged trajectories (10,000 simulated blocks) of Ω with different initial values during a stimulus-based learning task with reversal at trial 100. All simulations were performed with ρ = 0.5, causing Ω = ω. af Different behavioral patterns based on simulation of choice behavior using different model parameters, as indicated on the top. All non-specified parameters are fixed across panels at β1 = 20, β0 = 0, and ζω = 0.05. a Larger initial values (ω0) facilitate learning after reversal. b Larger initial values (ω0) impede learning after reversal. c Initial values ω0 > 0.15 increase stable points of ω toward 1, whereas small ω0 (<0.15) results in low performance. d Small decay or forgetting for the unchosen option (ζ) and large transition rate (αω) facilitates arbitration toward the correct model. e For certain model parameters, bifurcation of trajectories happens around ω of 0.5. f Steady state of arbitration is controlled by the initial value. Source data are provided as a Source data file.

However, in most cases, a larger initial bias toward the stimulus-based system helps both initial learning of stimuli and their reversals (Fig. 5a, c–f).

However, in scenarios where the positive learning rate significantly exceeds the negative learning rate, a smaller initial arbitration weight—though it may incorrectly bias behavior toward an action-based strategy—can actually facilitate adjustments to reversals in stimulus values (Fig. 5b). This happens because lower values of ω0, as in the case of amygdala lesions, result in a dependence of choice on both stimulus- and action-based signals and thus, more explorations that greatly benefit response to reversals. These results, based on our best dynamic arbitration model, can thus explain the paradoxical improvements in performance observed following amygdala lesions or inactivation.

Contribution of the amygdala to long-term adjustments of behavior

Lesions to certain brain areas are often accompanied by adjustments or compensation by other brain areas that result in reducing initial behavioral impairments over the long term. Considering the observed effects of amygdala and VS lesions on learning and decision-making behavior, we investigated long-term adjustments in these behaviors in the absence of task-imposed, objective uncertainty about the correct model of the environment. To that end, we examined ERDS, median RT, and initial arbitration weight across all sessions of the What-only task using the proportion of sessions completed as an independent variable (Methods).

For consistency in stimulus-based strategy, we observed a long-term decrease in ERDSStim in VS-lesioned monkeys (planned contrast for the slope of VS group = −0.626, p = 4.94 × 10−324; Supplementary Fig. 11a), to a significantly greater extent than control monkeys (βVS:sess% = −0.549, p = 7.78 × 10−12, controls as a reference group; Supplementary Table 11). There was no evidence for such adjustment in control (βcont:sess% = −0.077, p = 0.0881) or amygdala-lesioned monkeys across time (planned contrast for the slope of amygdala group = −0.033, p = 0.478). Specifically, despite their impaired stimulus-based learning, monkeys with VS lesions were able to increase their adoption of stimulus-based strategy over time. Consistently, these monkeys also decreased their adoption of action-based strategy as reflected in the positive slope of ERDSAction over time (planned contrast for the slope of VS group = 0.192, p = 3.49 × 10−4; Supplementary Fig. 11b), which was significantly greater compared to controls (βVS:sess% = 0.211, p = 0.00208; Supplementary Table 12). There was no evidence of such an effect in control monkeys (βcont:sess% = −0.019, p = 0.652) or in monkeys with amygdala lesions (planned contrast for the slope of amygdala group = 0.014, p = 0.748). Interestingly, consistent with previous results, the complementary changes in model adoption in VS-lesioned monkeys were also reflected in increased median RT over time in these monkeys (planned contrast for the slope of VS group = 30.7, p = 1.61 × 10−6; Supplementary Fig. 11c; Supplementary Table 13) to greater extent than controls (βVS:sess% = 24.3, p = 0.00236) or amygdala-lesioned monkeys (planned contrast for group difference in slopes = −29.6, p = 0.00224). This was accompanied by a long-term increase in the initial effective arbitration weights Ω0 (toward stimulus-based system) only in the VS-lesioned monkeys (planned contrast for the slope of VS group = 0.455, p = 2.88 × 10−4; Supplementary Fig. 11d; Supplementary Table 14), which was significantly greater compared to both controls (βVS:sess% = 0.568, p = 5.42 × 10−5) and amygdala group (planned contrast for group difference in slopes = −0.431, p = 0.00183).

These results provide evidence for adjustments on a long timescale in VS-lesioned but not amygdala-lesioned monkeys. They suggest that, in the absence of additional uncertainty about the model of the environment, intact amygdala in VS-lesioned monkeys (and not intact VS in amygdala-lesioned monkeys) enabled these animals to slowly improve their performance over time. This amygdala-driven mechanism enabled VS-lesioned monkeys to gradually suppress action-based strategy, resulting in an increase in overall RT and initial effective arbitration weight (Ω0) over time.

Discussion

Here, we applied a combination of computational approaches to reanalyze data from control monkeys and those with amygdala and VS lesions17,18,24 to explore the interaction between stimulus- and action-based learning and to uncover computational and neural mechanisms underlying arbitration processes. Our main goal was to investigate the interaction between stimulus- and action-based learning systems, instead of examining them in isolation as in the original studies. Using multiple behavioral metrics, we found evidence for competitive interaction between the two learning systems. Moreover, by developing various models with arbitration and fitting choice data to these and competing models, we tested the plausibility of various mechanisms for estimating reliability signals that guide arbitration processes. Using this approach, we mapped the distinct effects of two brain lesions onto two key parameters of the model: the initial state of the arbitration (ω0) for amygdala lesions and the relative baseline strength of stimulus-value to action-value signals (ρ) for VS lesions.

For amygdala-lesioned monkeys, the reduced initial arbitration weight was implicated in undifferentiated arbitration rates toward and away from the correct learning system for a given environment. This suggests that the amygdala may have a role in identifying and retaining the correct model of the environment, or in biases model arbitration toward the correct model of the environment. Our model simulations also illustrated that the interaction between learning and arbitration processes generates diverse behaviors with strong dependency on the initial state.

Previous studies using the same dataset have identified deficits in both stimulus-based and action-based learning due to amygdala lesions17,18, but they considered these deficits independently, as they did not examine the interaction between stimulus- and action-based learning. Using a single-system RL model, they found that amygdala lesions reduce choice consistency (sensitivity to value signals) for stimulus-based learning17 and increase sensitivity to negative feedback (α) for action-based learning18. We found results using our two-system model with dynamic arbitration (Supplementary Figs. 12b, 13d and 1416) Moreover, we provided a unified account of the monkeys' choice behavior based on the dynamic interaction between learning and arbitration under uncertainty. Specifically, our simulation results mimicking amygdala lesions (Fig. 4) suggest that a biased initial state strongly favoring action-based learning is the key feature of the deficits observed in amygdala-lesioned monkeys. This strong initial bias altered the interaction between decision-making, learning, and arbitration processes, making the arbitration update rates between the two systems to be less distinguishable from each other compared to the controls. Importantly, VS-lesioned monkeys exhibited a smaller sensitivity to the stimulus-based compared to the action-based signals. As a result, VS lesions led to an overall bias in arbitration update rates toward action-based learning in both What and Where blocks.

More specifically, we found that the difference between the effective weighting of the two systems (βstim and βaction) in amygdala-lesioned monkeys was not significantly different from that of controls (Fig. 4b, c inset). This indicates that amygdala lesions reduced sensitivity to stimulus-based and action-based signals to a similar degree, unlike the pattern observed in VS-lesioned monkeys. Instead, the main deficits due to amygdala lesions were captured by a biased initial state of arbitration that favors action-based signals. When coupled with the overall reduced sensitivity to value signals, this effect diminishes differential effective arbitration rates for correct and incorrect (more reliable and less reliable) models (Fig. 4e). This suggests that in addition to its contribution to stimulus-based learning, amygdala also plays a crucial role in identifying and/or retaining the more reliable model of the environment16, or mediating the influence of such identification on arbitration processes. This suggests that the amygdala, like the prefrontal cortex, is involved in learning to learn30 and can explain why amygdala lesions weaken the amount of evidence needed before the animals reverse their choice preference16.

Interestingly, VS-lesioned monkeys (with intact amygdala) were able to gradually overcome their impaired stimulus-based learning while showing a significantly larger arbitration rate for the stimulus-based system during the What-only task (Fig. 3e). This suggests that a signal to or from the amygdala, but not in the amygdala-to-VS pathway, could bias arbitration toward the more reliable model and lead to slow long-term behavioral adjustments. We found that arbitration was still present in amygdala-lesioned monkeys, suggesting that the amygdala is not required for arbitration per se but has a more nuanced role by setting and/or retaining the initial balance between models and improving the overall sensitivity to value signals. These two effects result in larger arbitration rates for the more reliable model of the environment, thus altering the trajectory of learning and choice behavior.

Moreover, we found that while VS-lesioned monkeys exhibit a bias toward the action-based strategy during the What/Where task, their response times are also shorter than those of control and amygdala-lesioned monkeys. Given the significant involvement of VS in effort exertion31,32,33,34, the shorter RT in VS-lesioned monkeys could be linked to the fact that the stimulus-based strategy requires more cognitive effort. This is reflected in the slightly longer RT in What blocks compared to Where blocks, as well as the positive correlation between RT and arbitration weight. As a result, VS-lesioned monkeys may default to the action-based strategy, allowing them to perform the task faster. The stronger reliance on the action-based strategy in VS-lesioned monkeys can be adequately explained by the reduction in the ρ parameter of our model, without necessarily suggesting an impaired arbitration process.

Arbitration between alternative models has garnered significant interest in cognitive, behavioral, and systems neuroscience. This includes arbitration between model-free vs. model-based RL35,36,37,38, Pavlovian vs. instrumental control39, habitual vs. goal-directed system40,41, competing sets of strategies for solving complex stimulus-response mappings42,43, and during social decision-making44,45,46. Here, we explored more basic arbitration required for any type of decision-making, as any choice option has to be selected by taking an action. Unlike arbitration between different types of learning systems––which often requires distinct reliability signals (e.g., model-free vs. model-based relying on unsigned reward prediction error and unsigned state prediction error36)––we found that the same reliability signal, based on the value of the chosen option or chosen action (Vcho), can be used for arbitration between stimulus- and action-based learning. Critically, we found that in both controls and lesioned monkeys, the reliability signal based on Vcho captured arbitration better than the reliability signal based on unsigned RPE. Because the difference in chosen values is equal to the difference in signed RPEs (Eq. 14), our results suggest that the reliability of alternative models may be more linked to signed RPE than to unsigned RPE.

Our proposal that the amygdala contributes to model arbitration to identify and reinforce the correct model of the environment is consistent with its postulated role in signaling attentional shifts for relevant control of behavior47,48. There are several pathways through which the amygdala could affect model arbitration. One major candidate is prefrontal-amygdala circuits49,50. In particular, the orbitofrontal cortex (OFC) receives substantial projections from the amygdala51,52 and could serve a central role in encoding and monitoring the reliability of multiple actor predictive models3. Given that amygdala-to-OFC input has been reported to be significantly involved in value coding by OFC neurons53,54, it is possible that this input also carries information for selective arbitration to appropriately bias behavior toward the relevant learning system in a given environment. Conversely, the PFC-to-amygdala pathway could signal  an internal state variable (arbitration weight in our model), serving as the necessary input to the amygdala for computing a differential adjustment in model arbitration that is relayed back to the PFC. This may explain why basolateral amygdala lesions could reduce OFC-induced impairment in reversal learning14. Thus, strong reciprocal connections between amygdala and PFC, in particular vlPFC, OFC, or ACC, could be crucial for proper arbitration between alternative models of the environment.

While earlier lesion studies have attributed varying degrees of behavioral deficits to amygdala55,56, its role in instrumental learning has since been a matter of debate due to mixed evidence both in favor of17,18,19,57,58 and against14,15,59,60,61 its involvement in reward learning. Critically, our framework can account for the amygdala’s seemingly inconsistent role. Through simulations of our best-fitting model with dynamic interaction between two systems, we found that in certain situations, the lower initial arbitration weight that biases behavior toward the action-based system can actually facilitate adjustments to reversals during stimulus-based learning, especially if the performance has saturated before reversal (Fig. 5b). This happens because more reliance on the less reliable action-based system allows faster exploration of alternative stimuli and thus faster reversal. This could account for puzzling improvement in performance due to basolateral amygdala lesions in rats15 or monkeys16,59, which has been attributed to increased benefits from negative feedback. In these examples, the stimulus-based system would still prefer the previously better option, which is no longer rewarding, but a less reliable action-based system would cause more switching from that option. Other studies that reported null results from reversible amygdala inactivations60,61 have also utilized object reversals after initial learning of stimulus-reward associations over a long period of time (referred to as discrimination learning).

What these studies have in common is that they all utilize visual discrimination learning over a long period of time before a reversal, which would suppress learning from unrewarded trials (i.e., small α) and allow the reliability of the stimulus-based system to reach its asymptote, thus slowing down reversal. This is very different from our experimental paradigm in which reversals happened on a short timescale before the reliability of the stimulus-based system could stabilize. Overall, our study suggests that observed discrepancies in the effects of amygdala lesions are due to a dynamic interaction between arbitration, learning, and decision-making processes.

The dynamic interaction between arbitration and learning processes is particularly relevant in interpreting results using  behavioral paradigms that were intended to parse the contributions of one learning system, but where competing learning systems could have strong unintended effects on behavior. Our results indicate that interpreting behaviors shaped by various learning systems should be approached with caution. This is particularly important when the manipulations in use might influence the arbitration process and thereby change the interplay among the learning systems. In principle, a multitude of simple learning strategies could underlie the heterogeneity in the so-called decision variables62,63,64, and careful examination of neural signals65,66,67 is needed to properly identify the neural substrates of corresponding learning systems.

Methods

Experimental paradigm

We examined two variants of a probabilistic reversal learning task in which monkeys selected between two visual stimuli to obtain a juice reward. During each block of the What-only task, reward was assigned stochastically according to stimulus identity only while reward probabilities on the two stimuli (selected afresh on each block) switched between trial 30 and 50 of the block without any signal to the monkeys (Fig. 1a). In the What/Where task, reward was assigned based on either stimulus identity (What blocks) or stimulus location (Where blocks) with reversal similar to the What-only task (more details below). Data were collected from a total of 20 unique monkeys, some of whom received bilateral excitotoxic lesions to either the amygdala or the VS (Supplementary Fig. 17). All experimental procedures for all monkeys were performed in accordance with the Guide for the Care and Use of Laboratory Animals and were approved by the National Institute of Mental Health Animal Care and Use Committee. We describe each experimental setup in more detail below.

What-only task

Data from this task17 were collected in eleven male rhesus macaques weighing 6.5–10.5 kg (controls: n = 4; amygdala-lesioned: n = 4; VS-lesioned: n = 3). The monkeys completed an average of 26.73 sessions (SD = 5.98) and an average of 16.81 (SD = 6.72) blocks per session. In total, monkeys completed on average 372.9 blocks (SD = 89.6). Each block consisted of 80 trials and involved a single reversal of stimulus-outcome contingencies on a randomly selected trial between trials 30 and 50 from a uniform distribution (Fig. 1a). On each trial, monkeys were trained to fixate on a central point on a screen (500–750 ms) to initiate the trial. After fixation, two stimuli, a square and a circle of random colors, were assigned pseudo-randomly to the left and right of the fixation point (6° visual angle). Monkeys indicated their choice by making a saccade to the target stimulus and fixating for 500 ms. Reward (0.085 ml juice) was delivered according to the assigned reward schedule for a given block. Each trial was followed by a fixed 1.5 s inter-trial interval. Trials in which monkeys failed to fixate within 5 s or make a choice within 1 s were aborted and then repeated.

The reward schedule was determined by the probabilities of reward on two choice options selected from four possible values: 100/0, 80/20, 70/30, and 60/40. The reward schedule was randomly selected at the start of each block and remained constant within that block. Monkeys performed the deterministic task (100/0 reward schedule) after the data collection for the stochastic task had been completed. Here, we focus on our analyses of the task’s stochastic variant to match the reward schedules used in the What/Where task (which only contained stochastic schedules, as described below), and therefore, we have excluded the deterministic portion of the data from our analyses. All monkeys that received lesions were trained and tested following their recovery from surgery. For more detailed surgical information, see the Supplemental Experimental Procedures in the original study17. This experimental setup and some analyses of the data have been previously reported17.

What/Where task

Unlike the What-only task, the What/Where task involved both stimulus-based and action-based learning, and this feature introduced additional uncertainty about the correct model of the environment. The effects of lesions to the VS and amygdala during the What/Where task were examined in two different studies with separate sets of controls for each. The first study investigating the effect of VS lesions24 had a total of eight subjects weighing 6.5–11 kg (controls: n = 5; VS-lesioned: n = 3). One of the five control monkeys and all three of the VS-lesioned monkeys were the same monkeys used in the What-only task17. The second study investigating the effect of amygdala lesions18 had a total of 10 subjects weighing 6–11 kg (controls: n = 6; amygdala-lesioned: n = 4). One of the six unoperated controls was the same monkey used in the What-only task17 and the What/Where task involving VS lesion24. One additional control monkey was used as an unoperated control for the earlier What/Where task only24. The remaining control and the four amygdala-lesioned monkeys were additionally trained for the subsequent study using the What/Where task18. See Supplementary Fig. 17a for a summary diagram. Any monkeys that have participated in both the What-only and What/Where tasks (i.e., one control and three VS-lesioned monkeys) first completed the What-only task and then later completed the What/Where task. Notably, all newly trained monkeys that performed the What/Where task were first trained on a simple two-armed bandit stimulus-based reward associations (“What” condition). After learning about this task, they were then trained with a deterministic version (100/0) of the What/Where task and gradually transitioned into the probabilistic outcomes used for this experiment (80/20, 70/30, 60/40). As such, all monkeys in the study shared the same prior training experience, specifically learning the “What” task first.

The monkeys in this task completed an average of 29.56 sessions (SD = 5.36), with an average of 18.41 (SD = 4.26) blocks per session. In total, monkeys completed on average 559.1 blocks (SD = 141.6). Each block consisted of 80 trials and involved a single reversal of the stimulus-based or action-based contingencies between trials 30 and 50. A given block was randomly assigned as a What or a Where block and remained constant within that block (Fig. 1b). In What blocks, reward probabilities were assigned based on stimulus identity, with a particular object having a higher reward probability. In Where blocks, reward probabilities were assigned based on location, with a particular side having a higher reward probability regardless of stimulus identity. What and Where blocks were randomly interleaved throughout the session, and the block type was not indicated to the monkey. The reward schedule was randomly selected from three schedules (80/20, 70/30, 60/40) at the start of each block and remained constant within that block.

On each trial, monkeys were trained to fixate on a central point on a screen (400–600 ms) to initiate the trial. After fixation, two visual objects were assigned pseudo-randomly to the left and right of the fixation point (6° visual angle). Each block used two novel images that the animal had never seen before. Monkeys indicated their choice by making a saccade to the target stimulus and fixating for 500 ms. Reward was delivered probabilistically according to the assigned reward schedule for a given block. Each trial was followed by a fixed 1.5 s inter-trial interval. Trials in which monkeys failed to fixate within 5 s or make a choice within 1 s were aborted and then repeated. This experimental setup, surgical information, and some analyses of the data have been previously reported18,24.

Quantification and statistical analysis

Entropy-based metrics

Here, we utilized information-theoretic metrics to quantify learning and choice behavior22,23. Specifically, we focused on the conditional entropy of reward-dependent strategy (ERDS) to measure  how monkeys associated reward feedback with stimulus identity or action.

Generally, ERDS is calculated as follows:

$${{ERDS}}=\,H\left({{str}}|{{rew}}\right)= -\left\{P({{stay}},{{win}})\times {\log }_{2}\frac{P\left({{stay}},{{win}}\right)}{P\left({{win}}\right)}+P({{switch}},{{win}})\right.\\ \left. \times {\log }_{2}\frac{P\left({{switch}},{{win}}\right)}{P\left({{win}}\right)}+P\left({{stay}},{{lose}}\right)\times {\log }_{2}\frac{P\left({{stay}},{{lose}}\right)}{P\left({{lose}}\right)} \right.\\ \left.+P\left({{switch}},{{los}}e\right)\times {\log }_{2}\frac{P\left({{switch}},{{lose}}\right)}{P\left({{lose}}\right)}\right\}$$
(1)

where str is the adopted strategy coded as stay (1) or switch (0), rew is the previous reward outcome coded as reward (1) or no reward (0). Noting that  \(\frac{P({{\mathrm{stay}}},{{\mathrm{win}}})}{P({{\mathrm{win}}})}\) and \(\frac{P({{\mathrm{switch}}},{{\mathrm{lose}}})}{P({{\mathrm{lose}}})}\) measure  tendencies for win-stay and lose-switch strategies,  one can see the ERDS combines these tendencies into a single quantity:

$$ERDS=\, -\{P({stay,win})\times {\log }_{2} (WinStay)+P({switch,win}) \\ \times {\log }_{2}(1- WinStay)+P({stay,lose})\times {\log }_{2} (1-LoseSwitch)\\ +P({switch,lose})\times {\log }_{2}(LoseSwitch)\left)\right.\}.$$
(2)

As the equations above suggest, ERDS measures the consistency in response to reward feedback (win or lose). Lower ERDS values correspond to decreased randomness in the variable and thus more consistency in the utilized strategy, which could be stimulus-based and/or action-based. To detect these strategies, we defined two types of ERDS by considering choice and reward feedback in terms of stimulus identity or action, corresponding to ERDSStim and ERDSAction, respectively.

Therefore, lower values of ERDSStim suggest that the animals stay or switch consistently based on stimulus identity according to reward feedback, indicating the stronger adoption of the stimulus-based strategy. Conversely, lower values of ERDSAction indicate that the animals adopted the action-based strategy more strongly. Overall, comparison of ERDSStim and ERDSAction enables us to quantify the adopted strategy on a trial-by-trial basis, either by computing the average values across a block of trials or by aligning all trials relative to the beginning, reversal point, and the end of each block.

Computational models

Single-system reinforcement learning (RL) models

We first used two standard RL models that learn about one type of reward contingencies to fit monkeys’ choice data. Specifically, the RLStim-only and RLAction-only models associate reward outcomes to choice options either in terms of stimulus identity or chosen action in order to estimate stimulus and action values, respectively. These values were used to determine choice on each trial and were updated based on reward outcome at the end of trial, as described below.

More specifically, the value of the chosen option (VC) is updated using reward prediction error (RPE) and two separate learning rates for rewarded and unrewarded trials (α+ and α, respectively) while the value of the unchosen option (VU) decays to zero:

$${V}_{C}(t+1)=\left\{\begin{array}{l}{V}_{C}(t)+{\alpha }_{+}\left(R\left(t\right)-{V}_{C}\left(t\right)\right)\,{{{\mathrm{if}}}}\,R(t)=1,\\ {V}_{C}(t)+{\alpha }_{-}\left(R\left(t\right)-{V}_{C}\left(t\right)\right)\,{{{\mathrm{if}}}}\,R(t)=0,\end{array}\right.$$
(3)
$${V}_{U}\left(t+1\right)={\left(1-\zeta \right)V}_{U}\left(t\right),$$
(4)

where R(t) is the reward feedback on trial t and ζ is the decay or forgetting rate for the unchosen option. Chosen and unchosen options are coded as {stimulus A, stimulus B} in the RLStim-only model and as {Left, Right} in the RLAction-only model. In these and other models, the probability of choosing the option on the right, PRight(t), was computed using a softmax function:

$${P}_{{{\mathrm{Right}}}}(t)=\frac{1}{1+\exp (-{\beta }_{1}\left({{OV}}_{{Right}}(t)-{{OV}}_{{Left}}(t)\right){-\beta }_{0})},$$
(5)

where OVLeft and OVRight denote the overall reward value of options on the left and right, β1 controls the steepness of the sigmoid function (inverse temperature) measuring the baseline sensitivity of choice to difference in value signals, and β0 is the side bias with positive values corresponding to a bias toward choosing right. For the RLStim-only model, OVLeft and OVRight were assigned based on the stimulus identity appearing on the respective side for a given trial. For example, if stimulus A appeared on the left of fixation, then OVLeft = VStimA and OVRight = VStimB. For the RLAction-only model, the overall reward values correspond to action values; i.e., OVLeft = VLeft and OVRight = VRight.

Two-system model with static weighting of stimulus- and action-based learning

As an extension of the above RL models, we considered hybrid RL models that constituted two value functions, VStim and VAction, to simultaneously track the reward value for alternative stimuli and actions and made choices based on a weighted sum of value signals from the two systems with a weight that was fixed on each block of the experiment (RLStim+Action+Static ω or Static ω model for short). Specifically, the value functions were updated in parallel using Eqs. 3 and 4.

Therefore, the overall values in this model are computed as follows:

$${{OV}}_{i}\,={V}_{{Stim}\left(i\right)}\omega+{V}_{{Action}(i)}(1-\omega ),$$
(6)

where ω represents the relative weight of the stimulus-based system compared to the action-based system, i {Left, Right}, and VStim(i) indicates the stimulus value for the option appearing on the side i. For example, if the stimulus A appeared on the left side, then OVLeft = VStimAω + VLeft(1−ω) and OVRight = VStimBω + VRight(1−ω). Similar to the learning rates and other parameters, a single value of ω was estimated for each block of trials. In the special case where ω = 0.5, the stimulus-value and action-value exert equal influence on choice. We note that the overall reward value in our model is used mainly as a convenience for presenting the model and does not require stimulus and action values to be integrated. Rather, each system can first compare its own values (stimulus values against other stimulus values, and action values against other action values), and the results of these within-system comparisons are then combined, with different weights, to determine the choice (see Eq. 7).

Using the above OVs (Eq. 6), the decision rule in Eq. 5 can be rewritten as follows (omitting the trial index t for simplicity):

$${{{\mathrm{logit}}}}\left({P}_{{Right}}\right)=\, {\beta }_{0}+{\beta }_{1}\left({{OV}}_{{Right}}-{{OV}}_{{Left}}\right)\\=\, {\beta }_{0}+{\beta }_{1}\left\{{V}_{{Stim}\left({Right}\right)}\omega+{V}_{{Right}}\left(1-\omega \right)\right.\\ \left.- \, \left({V}_{{Stim}\left({Left}\right)}\omega+{V}_{{Left}}\left(1-\omega \right)\right)\right\}\\=\, {\beta }_{0}+{\beta }_{1}\left\{\left({V}_{{{Stim}}\left({Right}\right)}-{V}_{{Stim}\left({Left}\right)}\right)\omega \right. \\ \left.+\left({V}_{{Right}}-{V}_{{Left}}\right)\left(1-\omega \right)\right\}\\=\, {\beta }_{0}+{\beta }_{1}\left\{\Delta {V}_{{Stim}}\omega+{\Delta V}_{{Action}}\left(1-\omega \right)\right\}\\=\, {\beta }_{0}+{\beta }_{1}\omega \left(\Delta {V}_{{Stim}}\right)+{\beta }_{1}\left(1-\omega \right)\left({\Delta V}_{{Action}}\right),$$
(7)

where ΔVStim = VStim(Right) – VStim(Left) and ΔVAction = VRight – VLeft. That is, β1ω and β1(1–ω) represent the sensitivity of choice to value signals from the stimulus- and action-based systems, respectively. Therefore, ω controls the relative sensitivity of choice to the two competing systems, with stronger ω corresponding to a stronger influence of the stimulus-based system, and vice versa.

Two-system models with dynamic weighting of stimulus- and action-based learning

To allow for dynamic arbitration, we constructed hybrid models in which ω was updated on a trial-by-trial basis using the relative reliability of the two systems. In this model (Dynamic ω), the difference in reliability of the two systems is computed at the end of each trial to update the value of ω toward the more reliable system. More specifically, the relative reliability, ΔRel, between two systems at the end of trial t is computed as follows:

$${\Delta Rel}(t)=\Delta {V}_{{cho}}(t)={V}_{C,{Stim}}(t)-{V}_{C,{Action}}(t),$$
(8)

where VC,Stim and VC,Action correspond to the value of the chosen option in the stimulus- and action-based system, respectively, and ΔVcho denotes the (signed) difference between the two. ΔRel ranges [−1, 1], with positive values indicating a more reliable stimulus-based system. Intuitively, ΔVcho signals the system that gives an overall larger value for the given choice and thus, is more reliable in predicting reward.

Subsequently, the relative sensitivity of choice to the two systems, ω, is updated as follows:

$$\begin{array}{l}\omega (1)={\omega }_{0},\\ \omega (t+1)=\left\{\begin{array}{l}\omega (t)+{\alpha }_{\omega }{\varDelta Rel}(t)\left(1-\omega (t)\right)+{\zeta }_{\omega }({\omega }_{0}-\omega (t))\,{{{\mathrm{if}}}}\,{\varDelta Rel}(t) > 0,\\ \omega (t)+{\alpha }_{\omega }\left|{\varDelta Rel}(t)\right|\left(0-\omega (t)\right)+{\zeta }_{\omega }({\omega }_{0}-\omega (t))\,{{{\mathrm{if}}}}\,{\varDelta Rel}(t) < 0,\end{array}\right.\end{array}$$
(9)

where ω0 is the initial ω on the first trial (onset of each block), αω is the baseline model arbitration rate (distinct from the learning rates in in Eq. 3), and ζω is the passive decay rate that pulls ω toward its initial value ω0 (distinct from the decay rate in Eq. 4). This additional decay mechanism assumes that ω defaults back to its initial bias in the absence of exogenous input signaling the reliability difference (ΔRel). We focus on this model with passive decay for all analyses, as it fits better than the variants without the passive decay term (Supplementary Fig. 18a). Importantly, the arbitration rates αω tend to be larger than the decay rates ζω across groups and tasks (Supplementary Fig. 18b, c).

Two-system models with dynamic weighting and separate baseline signals for stimulus- and action-based learning

To rule out the possibility that the observed effects in amygdala-lesioned monkeys are solely due to impairment in learning stimulus values (e.g., by reducing the baseline strength of stimulus-value signals) and without any changes to arbitration processes, we included an additional parameter to separate these two types of changes. In the Dynamic ω model, an increase in the sensitivity to the stimulus-based system, β1ω(t), is strictly tied to a decrease in the sensitivity to the action-based system, β1(1–ω(t)), and vice versa. This constraint can be removed by introducing a constant factor that further scales the value of a given model before combining with the value from the other model. In this new model referred to as Dynamic ω-ρ, the overall values are equal to:

$${{OV}}_{i}={V}_{{Stim}\left(i\right)}\rho \omega (t)+{V}_{{Action}(i)}(1-\rho )(1-\omega (t)),$$
(10)

where ρ is a constant that measures the baseline ratio of signal strength from the stimulus-based system relative to the action-based system (estimated for each monkey), independent of the time-dependent arbitration weight (ω). The update for ω is the same as in the Dynamic ω model (Eq. 9). Therefore, similar to Eq. 7, the decision rule for the Dynamic ω-ρ model can be simplified as follows:

$${{{\mathrm{logit}}}}\left({P}_{{{\mathrm{Right}}}}(t)\right)=\, {\beta }_{0}+{\beta }_{1}\left\{{{OV}}_{{Right}}\left(t\right)-{{OV}}_{{Left}}\left(t\right)\right\} \\=\, {\beta }_{0}+{\beta }_{1}\rho \omega \left(t\right)\Delta {V}_{{Stim}}\left(t\right)+{\beta }_{1}\left(1-\rho \right)\left(1-\omega \left(t\right)\right){\Delta V}_{{Action}}\left(t\right).$$
(11)

This shows that β1ρ and β1(1–ρ) can be interpreted as the baseline strength of the stimulus- and action-based signals on choice, respectively, prior to modulation by the time-dependent arbitration parameter ω. Therefore, including ρ allows us to capture baseline (time-independent) differences in the strength of signals from the two learning systems. Therefore, ρ < 0.5 (ρ > 0.5) corresponds to lower baseline activity or impairment in the stimulus-based (respectively, action-based) system. When ρ = 0.5, the Dynamic ω-ρ model reduces to the Dynamic ω model. Because ρ is assumed to capture the baseline activity, we estimated a single value of ρ per each monkey for the entire duration of the experiment (see Model fitting and simulation for more details).

Because in this model, the arbitration weight is further weighted by ρ (or 1–ρ), we defined an “effective” arbitration weight to measure the overall relative weighting between the stimulus-based (βStim = β1ρω) and action-based (βAction = β1(1−ρ)(1−ω)) systems as follows:

$$\Omega (t)=\frac{{\beta }_{{Stim}}(t)}{{\beta }_{{Stim}}(t)+{\beta }_{{Action}}(t)}=\frac{\rho \omega (t)}{\rho \omega (t)+(1-\rho )(1-\omega (t))},$$
(12)

where Ω is the relative weight of the stimulus-based system with respect to the total weight of the two systems on choice. It is worth noting that the common baseline inverse temperature β1 is independent of Ω, which is determined only by ρ and ω, and that Ω reduces to ω when ρ = 0.5.

Alternative signals for estimating the reliability of learning systems

In addition to Vcho as the reliability signal for updating ω in the dynamic models, we also considered several other quantities to estimate reliability. As the first alternative to ΔVcho for the relative reliability used to update the relative weight (Eq. 8), we considered the difference in magnitudes of RPE (|RPE|) of the action- and stimulus-based systems:

$${\varDelta Rel}(t)={|}{{RPE}}_{{Action}}(t)|-{|}{{RPE}}_{{Stim}}(t)|.$$
(13)

Conceptually, a system that yields better prediction of reward on a given trial has a lower magnitude of RPE and thus is more reliable. Note that the difference in chosen stimulus and action values, ΔVcho, can be also rewritten as:

$$\Delta {V}_{{cho}}\left(t\right)=\, {V}_{C,{Stim}}\left(t\right)-{V}_{C,{Action}}\left(t\right) \\=\, \left(R\left(t\right)-{V}_{C,{Action}}\left(t\right)\right)-\left(R\left(t\right)-{V}_{C,{Stim}}\left(t\right)\right) \\=\, {{RPE}}_{{Action}}\left(t\right)-{{RPE}}_{{Stim}}\left(t\right).$$
(14)

This demonstrates that using ΔVcho for the relative reliability corresponds to the signed RPE instead of the unsigned RPE.

We also considered the difference between the value of chosen and unchosen options within each system (|ΔV|) as a measure of the reliability of that system. Intuitively, this reliability signal is larger for a system that yields a better discernibility between the two competing options on a given trial. In this case, the relative reliability can be written as:

$${\varDelta Rel}\left(t\right)=\, \left|\Delta {V}_{{Stim}}\left(t\right)\right|-\left|\Delta {V}_{{Action}}\left(t\right)\right| \\=\, \left|{V}_{{StimA}}\left(t\right)-{V}_{{StimB}}\left(t\right)\right|-\left|{V}_{{Left}}\left(t\right)-{V}_{{Right}}\left(t\right)\right|.$$
(15)

Finally, we also considered the total sum of the two value estimates (|ΣV|) as a signal for estimating the reliability of a given system. For this measure, reliability is larger for a system that gives overall larger combined values from the two options. In this case, the relative reliability is equal to:

$${\Delta Rel}\left(t\right)=\, \Sigma {V}_{{Stim}}\left(t\right)-\Sigma {V}_{{Action}}\left(t\right)=\left({V}_{{StimA}}\left(t\right)+{V}_{{StimB}}\left(t\right)\right)\\ -\left({V}_{{Left}}\left(t\right)+{V}_{{Right}}\left(t\right)\right).$$
(16)

For all these different versions of ΔRel, we used the same equation for updating ω (Eq. 9).

Effective arbitration rates

In the above formulation of arbitration mechanism (Eqs. 812), the rate of update for ω depends on several factors including the baseline ratio parameter ρ, baseline model arbitration rate αω, the trial-by-trial difference in reliability (ΔRel), and the passive decay mechanism (ζω). To capture the overall update rates in arbitration weight Ω, we defined the “effective” arbitration rates that quantify the overall rate of change toward either the stimulus- or action-based system (Fig. 3d). Analogous to the update rule in the valuation system (Eq. 3), an update rule for the effective arbitration weight can be written as follows:

$$\Omega (t+1)=\left\{\begin{array}{l}\Omega (t)+{\psi }_{+}(t)\left(1-\Omega \left(t\right)\right)\,{{\mathrm{if}}}\,\Delta \varOmega (t) > 0,\\ \Omega \left(t\right)+{\psi }_{-}\left(t\right)\left(0-\Omega \left(t\right)\right)\,{{\mathrm{if}}}\,\Delta \varOmega \left(t\right) < 0.\end{array}\right.$$
(17)

where ψ+ and ψ represent the effective arbitration rate toward the stimulus-based and action-based systems, respectively (ΔΩ = 0 corresponds to no change in arbitration). Unlike the learning rates (α+ and α) that are fitted to each session, the effective arbitration rates (ψ) are estimated for each trial (Fig. 3e–g). More specifically, the effective arbitration rate on trial t when Ω increases, shifting choice toward the stimulus-based system, is defined as follows:

$${\psi }_{+}\left(t\right)=\frac{{[\Delta \Omega \left(t\right)]}^{+}}{1-\Omega \left(t\right)},$$
(18)

whereas the effective arbitration rate when Ω decreases, biasing choice toward the action-based system, is defined as follows:

$${\psi }_{-}\left(t\right)=\frac{{[\Delta \Omega \left(t\right)]}^{-}}{0-\Omega \left(t\right)},$$
(19)

where []+ and []- indicate positive and negative changes in Ω, respectively. Similar to the learning rates for rewarded and unrewarded trials (α+ and α), which capture differential updates based on different reward outcomes, the effective arbitration rates capture differential update rates toward the correct and incorrect learning systems based on the difference in reliability. Specifically, we tested whether the effective arbitration rate of the correct system in a given block type (i.e., ψ+ in What blocks and ψ in Where blocks) is distinct from that of the incorrect system, as larger effective rates for the correct system would reduce noise in incorporating feedback, enabling more efficient arbitration and, ultimately, leading to improved performance.

Model fitting and simulation

We used the standard maximum likelihood estimation method to fit choice data and estimate the best-fit parameters for the described models. One set of model parameters was fitted to each session (consisting of ~20 blocks) of monkeys’ choice data. Fitting was performed using the MATLAB optimization function fmincon, repeating the search for 100 sets of random initial parameter values to ensure global minima. See Supplementary Table 15 for the list of models and the ranges of parameters used. We report the mean Akaike Information Criterion (AIC) values of each model in Supplementary Fig. 14.

For the Dynamic ω-ρ model (Eqs. 1012), we assumed that the value of ρ is fixed for the entirety of the experiment and does not vary across sessions, as it aims to capture the relative strength of one learning system relative to the other. Accordingly, to estimate a single value of ρ for each monkey, we fitted the entire dataset of a given monkey and obtained a single set of best-fit parameters. From this set, we only kept the value of ρ and fit the choice data again for the remaining parameters by allowing different values across sessions. The distributions of the other fitted parameters of this model are reported in Supplementary Figs. 1213.

To test whether the observed relationship between stimulus- and action-based learning indeed required competition between the two learning systems and was not due to task structure, we simulated choice behavior using single-system or two-system models and computed model-simulated ERDSStim and ERDSAction (Eq. 1) (Supplementary Fig. 9). To measure the association between the two measures and isolate the within-subject effect, we mean-centered the predictor (ERDSStim) and fitted the mixed-effects model as ERDSAction ~ ERDSStim + (1|subject) to account for subject variability. To that end, we used the fitted parameters for each session and simulated the choice behavior 100 times for each block, using the same random reversal position and reward schedule as the behavioral data. We obtained the final averaged ERDS values for stimulus and action.

K-fold cross-validation of model performance

To compare the goodness-of-fit and determine the winning model, we used five-fold cross-validation, where each set of training/testing blocks was tested repeatedly with 50 unique instances. For each task, we created the training and testing sets as follows: for each subject, we randomly partitioned all the block data experienced by the monkey into five equal subsamples or “folds,” and selected each fold as the test data (20%) while using the remaining four folds as the training data (80%). For each instance, we obtained best-fit parameters from the training set that minimized the negative log-likelihood across all training blocks and used these to calculate the negative log-likelihood from the test blocks. Model performance was tested for each fold, thereby exhaustively testing all blocks data for a given subject. We repeated this procedure 50 times for each subject, each time using a unique combination of partitioning the data into five folds. Final mean negative log-likelihoods (-LL) were obtained by averaging across all tested blocks, reflecting the total sum of -LL during each block on average. We report these values in Figs. 2b and 3a, Supplementary Fig. 7a, and Supplementary Fig. 18a. Model performance for individual monkeys is also reported in Supplementary Table 16.

In these results, we note that even a small improvement in -LL implies a significant improvement in the predictability of the model for the tested blocks (Supplementary Note 3). Along with the -LL results, we also provide McFadden’s R2 values68 as an absolute measure for goodness-of-fit for each model in comparison to a null model with chance-level prediction. This quantity is calculated as follows:

$${McFadden}\,{R}^{2}=1-\frac{{\sum }_{t}{{LL}}_{{model}}\left(t\right)}{{{\sum }_{t}{LL}}_{{null}}}=1-\frac{{\sum }_{t}{{LL}}_{{model}}\left(t\right)}{80\times {ln}\left(0.5\right)},$$
(20)

where t indicates trial number within each block of 80 trials.

Models and parameters recovery

To perform model recovery (Supplementary Fig. 8a–d), we simulated the choice behavior of each model during a randomly created block environment similar to the experiment. Specifically, each session consisted of 30 blocks, each block having 80 trials with a reversal and with a randomly assigned reward schedule (80/20, 70/30, 60/40) and block type (What-only or What/Where). To ensure that choice behavior is simulated within a plausible range of parameters, we randomly sampled each parameter value from a kernel distribution fitted to all observed parameter values for a given model. We then fit all models and determined the best-fit model for each simulated session based on the AIC. We repeated this procedure 1000 times and report the proportion of sessions that a given fitted model best accounts for each simulated model.

For parameter recovery (Supplementary Fig. 8e, f), we simulated the choice behavior of the best model (Dynamic ω-ρ) using the estimated parameters from the experimental data and then refit the simulated data with the same model. Each block environment was set up using the same reward schedule and reversal position as the actual experiment. We simulated each block one time to ensure that the total number of simulated trials is the same as that of the experiment, from which the true parameters were estimated. We recovered the parameters from the simulated data using the same fitting procedure based on AIC, repeating the search for 100 initial random parameter values to ensure global minima. For recovering the ρ parameter, we used the entire simulated data for each monkey and obtained a single set of best-fit parameters, as was done for the actual data.

Data analysis and statistical tests

All analyses were carried out using MATLAB (MathWorks, version 2021b). All comparisons were performed using appropriate statistical tests reported throughout the text. For each test, we report the exact p-values and effect sizes when appropriate. Unless otherwise noted, all statistical tests used in this study were two-sided. In cases where highly significant results caused the software to return p-values of zero due to numerical precision limits, we report the smallest representable floating-point number at machine precision (i.e., 4.94 × 10−324).

Because the data were collected across different animals, we primarily utilized linear mixed-effects regression analyses (using MATLAB fitlme function) for between-group comparisons, with the group assignment as a fixed effect and subjects as a random effect, to appropriately account for between-subject variance. To maximize the accuracy of models and reduce Type-I error, we considered subject-level intercepts and additional random slopes for long-term adjustment effects, both across the experiment (proportion of session completed) and within each session day (block within a session). When performing a within-group significance test to compare a paired set of samples (e.g., Δβ = βstimβaction, Δψ = ψ+ψ), we fitted mixed-effects models with a fixed intercept and subject-level random effects (data ~ 1 + (1 + sess_perc + block_in_sess|subject)), where the main intercept represents the mean value of the paired difference. We then tested whether this intercept significantly differed from zero, as indicated by the coefficient β0. For comparing block-wise performance within each group, we included random effects of subjects and fixed effects of block types (What-only, What, Where) and reward uncertainty (measured as the variance of the outcome13) with the following formula: P(Better) ~ variance + BlockType + (1 + variance+sess_perc+block_in_sess|subject). Specifically, the variance was calculated as pB*(1−pB), where pB represents the reward probability of the better option (i.e., pB = {0.8, 0.7, 0.6}). Variance was further mean-centered by subject for better interpretability of other coefficients.

To categorize each trial as either stimulus- or action-dominant, we directly compared ERDSStim and ERDSAction (computed from a moving window of 10 trials; Eq. 1). Trials with ERDSStim <ERDSAction and ERDSAction <ERDSStim were categorized as stimulus-dominant and action-dominant, respectively. We dropped trials with ERDSStim = ERDSAction, amounting to 11.1% of total trials in the What-only and 11.7% in the What/Where tasks. We used a mixed-effects regression model with subjects as random intercepts to test whether the dominant strategy significantly modulates RT. Fixed effects included the following predictors: dominant strategy, coded as stim-dominant (0) or action-dominant (1), whether the monkey had chosen the better option (1) or not (0), reward schedule or uncertainty (measured as variance), trial number within a block, block number within a session, and session number within the subject. We also included interaction between the dominant strategy and the choice of a better option, as the latter could depend on the adopted strategy. Variables were normalized by each subject to yield comparable standardized regression coefficients.

To study the long-term adjustment in behavior across the time course of the experiment (Supplementary Fig. 11), we calculated ERDSStim, ERDSAction (Eq. 1), and median RT for each block and regressed them on the proportion of the sessions (total block number) completed as the predictor variable. This regressor was further mean-centered by subject for better interpretability of other coefficients. Initial effective arbitration weight (Ω0), which was estimated for each session, was analyzed at the session level. To further account for the variability in adjustment at the subject level, we considered random slopes and intercepts for the effect of sessions within each subject using mixed-effects models. We included all group data from the What-only task and used planned contrasts to infer slopes for each group and the group differences in the slopes. Full results are reported in Supplementary Tables 1114. For visualization purposes only, we plot the simple least-squares lines and the estimated slopes for each group (βsess) in Supplementary Fig. 11.

To estimate the effective arbitration rate across time (Fig. 3e–g; Eqs. 18 and 19), we computed the mean trajectory for each of two transition rates (toward stimulus- or action-based system) across blocks by aligning all trials relative to the beginning, reversal point, and the end of each block. For the bar plots in the insets, we computed the difference between the mean of the two transition rates within each block. In particular, we focused on the trials after the reversal to avoid potential confounds with initial bias (ω0) and observed the transition behavior after adjusting to the initial uncertainty of the block.

To plot performance (Fig. 1e–g) and effective arbitration rates over time (Fig. 3e–g; Eqs. 18 and 19), we obtained the trajectories by concatenating 20 trials relative to the beginning, reversal point, and the end of each block, to account for the random reversal positions. These curves were then smoothed with a moving window of five trials, separately within the acquisition and reversal phases.

To identify the model parameters that are significantly modulated in the lesioned group with the consistent direction of effects across task conditions (What-only and What/Where), we adopted the following mixed-effects model: parameter ~ group + (group|task) + (1|subject:task). Namely, the model included a fixed effect of group (control, amygdala, VS) and random effects of subjects and tasks with a random slope of each group in each task. This formulation assumes a single, uniform effect of the group regardless of task condition, thereby identifying parameters that are consistently modulated by brain lesions. Note that subjects were nested within tasks to allow for a separate baseline for each task, as some of the monkeys participated in both the What-only and What/Where tasks. We tested this mixed-effects model on each of the parameters compiled across groups and tasks, with control monkeys as the reference group. For testing the group difference in the ρ parameter (Eqs. 10 and 11) across task conditions, which was fitted for each subject and therefore lacks adequate sample size, we utilized two-sided permutation tests. Specifically, we conducted permutation tests on the ρ values of each subject from a given pair of tested groups across both tasks and generated a null distribution of the test statistic (i.e., mean group difference in ρ) by randomly shuffling the group assignment 10,000 times. The p-value was calculated as the proportion of permuted test statistics that were as extreme or more extreme than the empirically observed test statistic.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.