Evolution of cooperation guided by the coexistence of imitation learning and reinforcement learning

Tang, Wei; Wang, Guoling; Xing, Zhiyan

doi:10.1038/s41598-025-11557-y

Download PDF

Article
Open access
Published: 18 July 2025

Evolution of cooperation guided by the coexistence of imitation learning and reinforcement learning

Wei Tang^1,2,
Guoling Wang² &
Zhiyan Xing²

Scientific Reports volume 15, Article number: 26136 (2025) Cite this article

1950 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Promoting cooperation remains a major challenge in natural science. While most studies focus on single strategy update rules, individuals in real-life often use multiple strategies in response to dynamic environments. This paper introduces a mixed update rule combining imitation and reinforcement learning (RL). In imitation learning (IL), individuals adopt strategies from higher-payoff opponents, while RL relies on personal experience. Simulations of the Prisoner’s Dilemma Game (PDG), Coexistence Game (CG), and Coordination Game (CoG), both in well-mixed populations and square lattice networks, show that: (i) cooperation and defection coexist in the PDG, resolving the dilemma of universal defection; (ii) cooperation exceeds the mixed Nash equilibrium in the CG; and (iii) cooperators dominate in the CoG. The mixed update rule outperforms single strategy approaches in those games, highlighting its effectiveness in fostering cooperation.

Imitation dynamics on networks with incomplete information

Article Open access 17 November 2023

Intrinsic fluctuations of reinforcement learning promote cooperation

Article Open access 24 January 2023

Network coevolution drives segregation and enhances Pareto optimal equilibrium selection in coordination games

Article Open access 17 February 2023

Introduction

Cooperation is a widespread phenomenon observed in both nature and human society, encompassing levels ranging from molecular interactions to complex biological systems. However, Darwin’s theory of evolution, which emphasizes survival of the fittest and individual interest maximization, does not adequately account for the prevalence of cooperation. Although cooperative behavior enhances collective benefits, it often conflicts with individual self-interest, creating a social dilemma of cooperation¹. Consequently, understanding the mechanisms by which cooperation emerges and persists among rational, self-interested individuals has become a fundamental question in game theory research².

Evolutionary game theory^3,4,5 serves as a powerful framework for quantitatively analyzing cooperative behavior. Within this field, five evolutionary mechanisms-kin selection^6,7,8, direct reciprocity^9,10, indirect reciprocity¹¹, group selection^12,13, and network reciprocity^14,15,16,17 have been identified as pivotal contributors. Among these, punishment^18,19,20, social exclusion²¹ and reputation^{22,23,24,25,26} have garnered significant attention due to its promising results in studies conducted on square lattice networks. In essence, the evolution of cooperation is ultimately studied from the perspective of strategy updating.

Strategy updating, or learning mechanisms, is a crucial factor influencing cooperation in evolutionary game theory. Successful strategies are typically assumed to spread more rapidly^27,28, characterized by higher reproduction rates or an increased likelihood of imitation. Various update rules have been proposed, including the Moran process^29,30,31,32, pairwise comparison³³, imitation^{34,35,36,37,38}, aspiration-driven processes^{39,40,41,42,43,44}, and RL^{45,46,47,48,49}. The imitation mechanism promotes the spread of successful strategies through payoff comparisons, while the aspiration-driven mechanism³⁹ adjusts strategies based on the gap between actual payoffs and predetermined expectations. In contrast, RL⁴⁵ relies on individuals’ historical experiences and corresponding payoffs to guide current strategy choices. This diversity of update rules provides a foundation for further exploration of cooperative behavior under combined mechanisms.

While most studies have focused on the emergence and maintenance of cooperation under a single update rule, heterogeneous strategy update rules are prevalent in nature. For example, ants combine IL and experiential learning when foraging for food⁵⁰. In recent years, research on heterogeneous update rules has garnered increasing attention. For instance, the combination of imitation and aspiration-driven update rules has been used to study the evolution of cooperation^51,52,53, and the integration of imitation with innovation-based update rules has also been explored⁵⁴. These studies typically employ probabilistic models or group-specific mechanisms to investigate the interaction between different update rules⁵¹.

Inspired by previous studies, this paper proposes a mixed update mechanism that combines RL and IL. In this mechanism, RL guides decision-making based on an individual’s historical strategies and payoffs, while IL selects better strategies through payoff comparisons. Individuals adopt RL with probability $\gamma$ and IL with probability $1-\gamma$. When the two update rules lead to conflicting strategies, the final strategy is determined probabilistically, with $\gamma$ for RL and $1-\gamma$ for IL.

This study differs from⁵² in three key ways. First, it employs a mixed mechanism that combines RL and IL, rather than aspiration-driven and imitation-based update rules. Second, the results are consistent across both well-mixed populations and square lattice networks: RL promotes cooperation in the PDG and CGs but suppresses it in the CoG, while IL has the opposite effect, promoting cooperation in the CoG and suppressing it in the PDG and CGs. In contrast,⁵² reports opposite outcomes for these games in the two network structures. Third, under the mixed mechanism, cooperators dominate the population in the CoG, whereas cooperators and defectors coexist in⁵². These contributions provide a novel perspective and methodology for studying the evolution of cooperation.

Additionally, this study differs significantly from⁵⁴ in two key aspects. First, the mixed learning (ML) update mechanism in this paper combines IL and RL, whereas⁵⁴ explores a mechanism that integrates IL with innovation-based update processes. Second, in this study, agents select the RL update rule with probability $\gamma$ ($\gamma \in [0,1]$) and the IL update rule with probability $1-\gamma$. In contrast,⁵⁴ assumes that a subset of agents adopts the imitation update rule, while the remaining agents follow the innovation update rule.

The remainder of this paper is organized as follows. Section 2 introduces the game model. In Section 3, the strategy update rules are described and convergence analyses are provided. Simulation experiments are presented for a well-mixed population in Section 4 and for a square lattice network in Section 5. Finally, Section 6 concludes the study.

The game models

This section provides a detailed introduction to two models: well-mixed populations and square lattice networks. In both models, agents have two alternative strategies: cooperation (C) and defection (D). In each round, each agent receives a payoff based on a symmetric payoff matrix, defined as follows:

$$\begin{aligned} \begin{array}{*{20}c} C \\ D \\ \end{array} \mathop {\left( {\begin{array}{*{20}l} a_{11} & a_{12} \\ a_{21} & a_{22} \\ \end{array} } \right) }\limits ^{{\begin{array}{*{20}l} C & \quad D \\ \end{array} }}. \end{aligned}$$

A cooperator receives a reward, $a_{11}$, when paired with another cooperator, but a sucker’s payoff, $a_{12}$, when paired with a defector. A defector, on the other hand, receives a temptation payoff, $a_{21}$, when paired with a cooperator, and a punishment payoff, $a_{22}$, when paired with another defector.

This paper considers PDG $(a_{21}>a_{11}>a_{22}>a_{12})$, CG $(a_{21}>a_{11}$, $a_{12}>a_{22})$ and CoG $(a_{11}>a_{21}$, $a_{22}>a_{12})$.

Well-mixed populations

In a well-mixed population, agents do not have fixed neighbors, and any agent may interact with any other. See Fig. 5a. For a well-mixed population of size N, at round t, the population state is denoted as $x(t) = (x_C(t), x_D(t))$, where $x_C(t)$ (respectively, $x_D(t)$) represents the number of agents choosing strategy C (respectively, D), such that $x_C(t) + x_D(t) = N$. The expected payoffs for agents choosing strategies C and D can be obtained using the following Eq.^56,57:

$$\begin{aligned} u_C(x(t))&=\frac{a_{11}(x_C(t)-1)+a_{12}x_D(t)}{N-1}, \end{aligned}$$

(1)

$$\begin{aligned} u_D(x(t))&=\frac{a_{21}x_C(t)+a_{22}(x_D(t)-1)}{N-1}. \end{aligned}$$

(2)

Structured populations

This paper examines the structure of a square lattice network, where nodes represent agents and edges define the relationships between neighboring agents. Once the network is generated, the neighbors of all agents are fixed, and each agent can interact only with its neighbors. Agents located in the interior have four neighbors, while those on the boundary have either three or two neighbors, as shown in Fig. 5b. For agent $i \in \{1, \cdots , N\}$, at round t, the average payoff is defined as follows:

$$\begin{aligned} \begin{aligned} \pi _i(t)= \left\{ \begin{array}{lr} a_{11}y^i_C(t)+a_{12}y^i_D(t), & if~S_{i}(t)= C, \\ a_{21}y^i_C(t)+a_{22}y^i_D(t), & if~S_{i}(t)= D, \end{array} \right. \end{aligned} \end{aligned}$$

(3)

where $y^i_C(t)$ (respectively, $y^i_D(t))$ denotes the proportion of the neighbors for agent i choosing strategy C (respectively, strategy D) at round t. It clearly holds that $y^i_C(t)+y^i_D(t)=1$.

Theoretical analyses of updating rules

IL update rule

Social learning theory, proposed by Fudenberg and Levine⁵⁸, is a form of IL in which only certain agents actively learn from those with higher payoffs. The social learning update rule is described as follows: if all agents have the same payoff, no changes occur; otherwise, part of agents with lower payoffs will imitate those with higher payoffs, while the rest will remain unchanged⁵⁹.

Let $\alpha$ be the fraction of active learning agents among those with the same payoff. Then the population state is updated as follows:

$$\begin{aligned} x_C(t+1)=(1-\alpha )x_C(t)+\alpha [q_1(x(t))x_C(t)+ q_2(x(t))x_D(t)], \end{aligned}$$

(4)

where

$$\begin{aligned} \begin{array}{c} q_1(x(t))=\left\{ \begin{array}{lr} 1, & u_C(x(t)) \ge u_D(x(t)), \\ 0, & otherwise, \end{array} \right. \end{array} \begin{array}{c} q_2(x(t))=\left\{ \begin{array}{lr} 1, & u_C(x(t)) > u_D(x(t)), \\ 0, & otherwise, \end{array} \right. \end{array} \end{aligned}$$

and $u_C(x(t))$ (respectively, $u_D(x(t)))$ is the expected payoff of agents choosing strategy C (respectively, strategy D).

Take PDG, CG and CoG as examples to analyze the convergence under the IL update rule.

(I)
PDG

It follows from Eqs. (1) and (2) that $u_D(x(t))-u_C(x(t))=\frac{(a_{21}-a_{11})x_C(t)+(a_{22}-a_{12})x_D(t)+a_{11}-a_{22}}{N-1}$. Since $a_{21}>a_{11}>a_{22}>a_{12}$, then $u_C(x(t))<u_D(x(t))$ holds for all x(t). By Eq. (4), then $q_1(x(t))=q_2(x(t))=0$ and $x_C(t+1)=(1-\alpha )x_C(t)$. If $\alpha \in (0,1)$, then $\lim _{t \rightarrow \infty } x_C(t) \rightarrow 0$, which is the Nash equilibrium of this game.
(II)
CG

By Eqs. (1) and (2), when $x_C(t)=\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, $u_C(x(t))=u_D(x(t))$ is obtained. According to Eq. (4), then $q_1(x(t))=1$, $q_2(x(t))=0$ and $x_C(t+1)=x_C(t)$, i.e., the population state has remained the same. The result of convergence is the mixed strategy Nash equilibrium $(\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}},\frac{(N-1)a_{11}-N a_{21}+a_{22}}{a_{11}+a_{22}-a_{12}-a_{21}})$.

Since $a_{21}>a_{11}$ and $a_{12}>a_{22}$, when $x_C(t)>\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, $u_C(x(t))<u_D(x(t))$ is obtained. It follows from Eq. (4) that $q_1(x(t))=q_2(x(t))=0$ and $x_C(t+1)=(1-\alpha )x_C(t)$. If $\alpha \in (0,1)$, as time goes on, $x_C(t)$ is gradually decreasing, and when it decreases to $\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, it follows from the first case that $x_C(t)$ will remain constant. That is, the result of convergence is the mixed strategy Nash equilibrium $(\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}},\frac{(N-1)a_{11}-N a_{21}+a_{22}}{a_{11}+a_{22}-a_{12}-a_{21}})$.

When $x_C(t)<\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, $u_C(x(t))>u_D(x(t))$ is obtained. By Eq. (4), $q_1(x(t))=q_2(x(t))=1$ and $x_C(t+1)=x_C(t)+\alpha x_D(t)$. When $\alpha \in (0,1)$, as time goes on, $x_C(t)$ is gradually increasing, and when it increases to $\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, it follows from the first case that $x_C(t)$ will remain constant. That is, the result of convergence is the mixed strategy Nash equilibrium $(\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}},\frac{(N-1)a_{11}-N a_{21}+a_{22}}{a_{11}+a_{22}-a_{12}-a_{21}})$.

To conclude, when $\alpha \in (0,1)$, $\lim _{t \rightarrow \infty } x_C(t) \rightarrow \frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, i.e., the result of convergence is the mixed strategy Nash equilibrium $(\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}},\frac{(N-1)a_{11}-N a_{21}+a_{22}}{a_{11}+a_{22}-a_{12}-a_{21}})$ regardless of the initial population state.
(III)
CoG

By Eqs. (1) and (2), when $x_C(t)=\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, $u_C(x(t))=u_D(x(t))$ is obtained. According to Eq. (4), then $q_1(x(t))=1$, $q_2(x(t))=0$ and $x_C(t+1)=x_C(t)$, i.e., the population state has remained unchanged. The result of convergence is the mixed strategy Nash equilibrium $(\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}},\frac{(N-1)a_{11}-N a_{21}+a_{22}}{a_{11}+a_{22}-a_{12}-a_{21}})$.

Due to $a_{11}>a_{21}$ and $a_{22}>a_{12}$, when $x_C(t)>\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, $u_C(x(t))>u_D(x(t))$ is obtained. It follows from Eq. (4) that $q_1(x(t))=q_2(x(t))=1$ and $x_C(t+1)=x_C(t)+\alpha x_D(t)$. When $\alpha \in (0,1)$, $\lim _{t \rightarrow \infty } x_C(t) \rightarrow N$, i.e., the result of convergence is the pure strategy Nash equilibrium (N, 0), where all agents choose to cooperate.

When $x_C(t)<\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, $u_C(x(t))<u_D(x(t))$ is obtained. By Eq. (4), $q_1(x(t))=q_2(x(t))=0$ and $x_C(t+1)=(1-\alpha )x_C(t)$. When $\alpha \in (0,1)$, $\lim _{t \rightarrow \infty } x_C(t) \rightarrow 0$, i.e., the result of convergence is the pure strategy Nash equilibrium (0, N), where all agents choose to betray.

In summary, when $x_C(t)=\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, then the population state remains unchanged. When $x_C(t)>\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, the population converges to the pure strategy Nash equilibrium (N, 0), where all agents choose to cooperate. When $x_C(t)<\frac{a_{11}-a_{22}+N(a_{22}-a_{12})}{a_{11}+a_{22}-a_{12}-a_{21}}$, the result of convergence is the pure strategy Nash equilibrium (0, N), where all agents choose to betray.

RL update rule

RL is independent of the strategic environment and can be interpreted as a form of self-learning. In 1997, Camerer and Ho⁵⁵ proposed a general model known as experience-weighted attraction (EWA) learning. In this model, agents update the preferences of all strategies, with the strategy having a higher preference being more likely to be selected. The preference of strategy $a \in \{C, D\}$ for agent i is defined as follows:

$$\begin{aligned} \begin{aligned} F_i^a(t)= \left\{ \begin{array}{lr} \frac{\phi H(t-1) F_i^a(t-1)+f_i^a(t)}{H(t)}, & if~S_i(t)=a, \\ \frac{\phi H(t-1) F_i^a(t-1)+\delta f_i^a(t)}{H(t)}, & otherwise, \end{array} \right. \end{aligned} \end{aligned}$$

(5)

In this model, $F_i^a(t)$ represents the preference of agent i for strategy a after round t; $\phi$ denotes the discount factor; $\delta$ is the weighting coefficient for the assumed payoff obtained from the unselected strategy; $f_i^a(t)$ represents the payoff of agent i when selecting strategy a in round t. If a denotes the cooperative strategy, then $f_i^a(t) = a_{11}$ or $a_{12}$; if a denotes the defection strategy, then $f_i^a(t) = a_{21}$ or $a_{22}$; $H(t) = \rho H(t-1) + 1$ represents the number of ‘observation-equivalents’ of past experiences before round t; $\rho$ is the depreciation rate or retrospective discount factor, which measures the impact of past experiences relative to the new period; H(0) is the value assigned before the game, and $F_i^a(0)$ represents agent i’s initial preference for strategy a at the start of the game.

By Eq. (5), the probability of choosing strategy a ($a \in \{C,D\}$) is defined as follows:

$$\begin{aligned} Pr_i^a(t+1)=\frac{e^{\lambda ^a F_i^a(t)}}{\sum _{k \in \{C,D\}}e^{\lambda ^k F_i^k(t)}}, \end{aligned}$$

(6)

where $\lambda ^k ~ (k \in \{C,D\})$ represents sensitivity to strategy k.

For all experiments analyzed in this paper, the payoff matrices for the PDG, CG and CoG are $\begin{bmatrix} -1 & -10 \\ 0 & -8 \end{bmatrix}$, $\begin{bmatrix} 2 & 0 \\ 4 & -1 \end{bmatrix}$ and $\begin{bmatrix} 3 & 0 \\ 0 & 2 \end{bmatrix}$.

To verify the convergence of those games under the RL update rule, different initial population states $\frac{x_C(0)}{N}$, depreciation rates $\rho$, parameters $\delta$ and decay rates $\phi$ are considered. We set parameters $N=1000$, $H(0)=3$, $maxgen=300$, $\lambda ^C=\lambda ^D=1$ and $F_i^C(0)=F_i^D(0)=3$ for all $i \in \{1, \cdots , N\}$. When $\delta =\rho =0.5$, the $\phi$ vs $\frac{x_C(0)}{N}$ phase plane for the mean fraction of cooperators (denoted briefly by MFC) are shown in Figs. 1a, 2a and 3a. In addition, cooperative traits in $\delta -\frac{x_C(0)}{N}$ parameter panel with $\phi =\rho =0.5$ are depicted in Figs. 1b, 2b and 3b. The $\rho$ vs $\frac{x_C(0)}{N}$ phase plane for MFC with $\phi =\delta =0.5$ are visualized in Figs. 1c, 2c and 3c.

(I)
PDG

From Fig. 1b,c, as $\delta$ and $\rho$ increase, so does the MFC, namely, high values of $\delta$ and $\rho$ are more effective at promoting cooperation. However, the MFC decreases instead as $\phi$ increases from Fig. 1a, i.e., lower value of $\phi$ is more effective at promoting cooperation.
(II)
CG para From Fig. 2a, the MFC decreases instead as $\phi$ increases, i.e., lower value of $\phi$ is more effective at promoting cooperation. However, from Fig. 2b,c, as $\delta$ and $\rho$ increase, the MFC also increases, namely, large values of $\delta$ and $\rho$ are more effective at promoting cooperation.
(III)
CoG

From Fig. 3b,c, the MFC decreases instead as $\delta$ and $\rho$ increase, i.e., lower values of $\delta$ and $\rho$ are more effective at promoting cooperation. However, as $\phi$ increases, so does the MFC from Fig. 3a, that is, large value of $\phi$ is more effective at promoting cooperation.

The ML update rule

Given that agents possess a certain degree of rationality, they exhibit both IL and RL abilities in each round of the game. Based on the IL update rule outlined in Section 3.1 and the RL update rule described in Section 3.2, this paper proposes a ML update rule. Specifically, agents select the RL update rule with probability $\gamma$ ($\gamma \in [0,1]$) and the IL update rule with probability $1-\gamma$. When the strategies updated through IL and RL are inconsistent, the strategy updated via RL is selected with probability $\gamma$, and the strategy updated through IL is chosen with probability $1-\gamma$. The parameter $\gamma$ represents the tradeoff between IL and RL. When $\gamma = 0$, agents rely solely on IL; when $\gamma = 1$, they adopt only RL. For values of $\gamma \in (0,1)$, agents must consider both update rules. To examine the impact of the ML update rule on cooperation evolution, three games-PDG, CG, and CoG-are analyzed in both well-mixed populations and square lattice networks. Diagrams illustrating the adoption of the proposed ML update rule in a well-mixed population and square lattice network are presented in Fig. 4a,b, respectively.

Based on the ML update rule proposed in this paper, the agent’s strategy update process is outlined as follows:

(a)
Initialize parameters, including the introspection rate $\alpha$, discount factor $\phi$, depreciation rate $\rho$, the weighting coefficient for unselected strategies $\delta$, the trade-off coefficient $\gamma$, initial preferences $F^C(0)$ and $F^D(0)$, the initial experience value H(0), population size N, and the maximum number of iterations maxgen.
(b)
Initialize the population: each agent randomly selects a strategy from $\{C,D\}$, i.e., each agent is randomly assigned a value of 0 (defection) or 1 (cooperation).
(c)
Determine the population state: the population state is obtained by calculating $\sum _i s_i$ and $N - \sum _i s_i$, where $i \in \{1, \dots , N\}$ and $s_i \in \{0,1\}$. This state is used in Eqs. (1) and (2).
(d)
Calculate average payoffs, pure payoffs, and probabilities: average payoffs are calculated under the IL update rule, while pure payoffs and probabilities are determined under the RL update rule. Expected payoffs of agents choosing strategies C and D are obtained from Eqs. (1) and (2) in a well-mixed population, and from Eq. (3) in a square lattice network. Pure payoffs are derived from the payoff matrix and opponents. Based on the preferences for strategies C and D, computed using Eq. (5), the probability of choosing strategy a ($a \in \{C,D\}$) is determined by Eq. (6).
(e)
Calculate pending strategies: the updated strategy of an agent is defined as pending strategy 1 under the RL update rule, and pending strategy 2 under the IL update rule. In a well-mixed population, the expected payoff of an agent choosing strategy C is compared with that of an agent choosing strategy D, and $P_{max} = \max _{s \in {C,D}} u_s(x(t))$. The agent whose expected payoff is less than $P_{max}$ is recorded, and we randomly select agents with a ratio of $\alpha$ from those agents to adjust their strategies. In a square lattice network, based on the IL update rule, for each agent, if both the agent and all of their neighbors follow the same strategy, the agent remains unchanged. Otherwise, the agent’s average payoff is compared with those of neighbors who have a different strategy. If the agent’s average payoff is minimal, the agent may adjust their strategy. The agents likely to adjust their strategies are recorded, and we randomly select agents with a ratio of $\alpha$ from those agents to adjust their strategies. Additionally, agent i selects strategy C (respectively, strategy D) with probability $Pr_i^C$ (respectively, $Pr_i^D$) under the RL update rule in both the well-mixed population and the square lattice network.
(f)
Each agent compares their pending strategy 1 with pending strategy 2. If the strategies are unequal, the agent selects pending strategy 1 with probability $\gamma$ and pending strategy 2 with probability $1-\gamma$. If the strategies are equal, the agent selects pending strategy 1.
(g)
If the maximum number of iterations maxgen is reached, the process ends. Otherwise, return to step (c). See Fig. 2 for the flow chart of specific policy update.

Simulation experiment of well-mixed populations

Convergence analysis of the mixed update rule in a well-mixed population

This subsection focuses on the convergence of three games under the ML update rule in a well-mixed population. $\phi$ is the discount factor, indicating the extent to which past experiences affect the present. In general, past experiences are instructive for the present. $\delta$ is the weighting coefficient for the assumed payoff obtained from the unselected strategy. If a certain strategy has hardly been selected before, it indicates that the strategy is a poor one and the preference for it will also be relatively low. According to the analysis results in 3.2, when $\delta =0.01$, $\rho =0.01$ and $\phi =0.8$, the proportion of cooperators in the three games is relatively high. We set parameters $N=10000$, $maxgen=100$, $F_i^C(0)=F_i^D(0)=3$ for all $i \in \{1, \cdots , N\}$, $\gamma =0.5$, $\alpha =0.3$, $\delta =0.01$, $\rho =0.01$, $\phi =0.8$, $\lambda ^C=\lambda ^D=1$, $H(0)=3$ and $x_C(0)=x_D(0)=\frac{N}{2}$. Figure 6 shows the trend of the MFC of the three games in a well-mixed population with the iterative process.

In the PDG, the MFC fluctuates around $42\%$ as iterations progress under the ML update rule. Since the unique Nash equilibrium of the PDG is that defectors dominate the entire population, the ML update rule alleviates the dilemma to some extent, encouraging agents to cooperate at a proportion of approximately $42\%$.

In the CG, the MFC fluctuates around $35\%$ as iterations progress under the ML update rule. According to the convergence analysis of IL, regardless of the initial population state, the system will ultimately converge to the mixed Nash equilibrium state $(\frac{1}{3}, \frac{2}{3})$. Thus, the proportion of cooperators increases under the ML update rule.

In the CoG, the proportion of cooperators reaches 1 by the 20-th iteration and remains constant thereafter under the ML update rule. Based on the convergence analysis of IL, if $x_C(0) > \frac{2N+1}{5}$, the system will converge to a state where all agents choose cooperation. Therefore, under the ML update rule, the convergence result is consistent with that of the IL update rule.

Sensitivity analysis of trade-off coefficient on MFC

This subsection focuses on the sensitivity analysis of trade-off coefficient $\gamma$ for three games under the ML update rule. We set parameters $N=10000$, $F_i^C(0)=F_i^D(0)=3$ for all $i \in \{1, \cdots , N\}$, $\alpha =0.3$, $\delta =0.01$, $\rho =0.01$, $\phi =0.8$, $\lambda ^C=\lambda ^D=1$, $H(0)=3$ and $x_C(0)=x_D(0)=\frac{N}{2}$.

For the PDG, it can be seen from Fig. 7a that the MFC increases with the increase of $\gamma$. When $\gamma =0$, the MFC decreases to $0\%$, and remains unchanged thereafter. When $\gamma =1$, the MFC fluctuates around $45\%$.

In the CG, as seen in Fig. 7b, as $\gamma$ increases, so does the MFC. The MFC is around $43\%$ when $\gamma = 0.7$. The MFC fluctuates around $57\%$ when $\gamma = 1$.

In the CoG, as shown in Fig. 7c, it takes longer for cooperators to dominate the population as $\gamma$ increases. Cooperators occupy the entire population by the 18-th iteration when $\gamma = 0$, whereas they occupy the entire population by the 40-th iteration when $\gamma = 1$.

Sensitivity analysis of introspection rate on MFC

This subsection focuses on the sensitivity analysis of introspection rate $\alpha$ for three types of games under the ML update rule. We set parameters $N=10000$, $F_i^C(0)=F_i^D(0)=3$ for all $i \in \{1, \cdots , N\}$, $\gamma =0.5$, $\delta =0.01$, $\rho =0.01$, $\phi =0.8$, $\lambda ^C=\lambda ^D=1$, $H(0)=3$ and $x_C(0)=x_D(0)=\frac{N}{2}$.

As shown in Fig. 8a, the MFC in the PDG decreases as the introspection rate $\alpha$ increases. When $\alpha = 0$, the MFC fluctuates around $45\%$, whereas it fluctuates around $40\%$ when $\alpha = 1$.

In Fig. 8b, the MFC in the CG game decreases as the introspection rate $\alpha$ increases. When $\alpha = 0$, the MFC is around $57\%$, whereas it fluctuates around $35\%$ when $\alpha = 0.3$.

In Fig. 8c, for the CoG, the time required for cooperators to dominate the entire population increases as $\alpha$ decreases. When $\alpha = 0$, cooperators occupy the entire population by the 70-th iteration, whereas at $\alpha = 1$, they occupy the entire population by the 9-th iteration.

Figure 7 shows that the trade-off coefficient $\gamma$ inhibits cooperation in the CoG but promotes cooperation in the PDG and CGs within a well-mixed population.

Figure 8 illustrates that the introspection rate $\alpha$ promotes cooperation in the CoG but inhibits cooperation in the PDG and CGs within a well-mixed population.

Simulation experiment of square lattice network

To further substantiate the conclusions drawn in Section 5 and gain a more detailed understanding of this phenomenon at the microscopic level, snapshots of a square lattice in three games were analyzed (see Figs. 9, 10, and 12). First, a $100 \times 100$ square network is constructed, where each node represents an agent and the edges between nodes represent neighbor relationships. Internal nodes have four neighbors, while boundary nodes have either two or three neighbors, as shown in Fig. 4b. Unlike well-mixed populations, each agent can only interact with its fixed neighbors. Under the IL update rule, the average payoff of each agent is calculated based on interactions with all of its neighbors, while the pure payoff is used under the RL update rule. All agents choose the RL update rule with probability $\gamma$ ($\gamma \in [0,1]$) and the IL update rule with probability $1 - \gamma$.

Convergence analysis of the ML rule in a square network

We set parameters $N=10000$, $maxgen=200$, $\gamma =0.5$, $\alpha =0.3$, $\delta =0.01$, $\rho =0.01$, $\phi =0.8$, $\lambda ^C=\lambda ^D=1$, $F_i^C(0)=F_i^D(0)=3$ for all $i \in \{1, \cdots , N\}$ and $H(0)=3$. The initial population state is given randomly.

Fig. 9 shows snapshots of the population states for three games under the ML update rule at the 50-th, 100-th and 150-th iteration, respectively. By Fig. 9a–c, the MFC of the CoG gradually increases as iterations proceed, eventually cooperators occupy the entire population. It follows from Fig. 9d–f that the MFC of the CG remains near $40\%$, which improves the proportion of cooperators compared to mixed Nash equilibrium $(\frac{1}{3},\frac{2}{3})$. By Fig. 9g–i, the MFC of the PDG fluctuates around $44\%$, which solves the dilemma all agents choosing to defect in the unique Nash equilibrium state.

Sensitivity analysis of trade-off coefficient on MFC

We set parameters $N=10000$, $maxgen=200$, $\alpha =0.3$, $\delta =0.01$, $\rho =0.01$, $\phi =0.8$, $\lambda ^C=\lambda ^D=1$, $F_i^C(0)=F_i^D(0)=3$ for all $i \in \{1, \cdots , N\}$ and $H(0)=3$. The initial population state is given randomly.

Figures 10 and 11 show snapshots and the MFC of three games varies with trade-off coefficient $\gamma$ as the iteration proceeds in a square lattice network, respectively. From Figs. 10a–d and 11a, it takes more time for cooperators to occupy the population as $\gamma$ grows when $\gamma <1$. When $\gamma =1$, the proportion of cooperators gradually increases as iterations proceed, reaching $80\%$ at the 200-th iteration. Therefore, lower value of $\gamma$ is more effective at promoting cooperation in the CoG.

From Figs. 10e–h and 11b, if $\gamma =1$, the MFC of the CG is the largest converging to $58\%$. If $\gamma =0$, the MFC of the game converges to $36\%$. Therefore, higher value of $\gamma$ is more effective at promoting cooperation in the CG.

It follows from Figs. 10i–l and 11c that higher value of $\gamma$ is more effective at promoting cooperation in the PDG. The MFC of the PDG is around $47\%$ when $\gamma =1$. However, that of the game fluctuates around $37\%$.

Sensitivity analysis of introspection rate on MFC

We set parameters $N=10000$, $maxgen=200$, $\gamma =0.5$, $\delta =0.01$, $\rho =0.01$, $\phi =0.8$, $\lambda ^C=\lambda ^D=1$, $F_i^C(0)=F_i^D(0)=3$ for all $i \in \{1, \cdots , N\}$ and $H(0)=3$. The initial population state is given randomly.

Figures 12 and 13 show snapshots and the MFC of three games varies with introspection rate $\alpha$ as the iteration proceeds in a square lattice network. From Figs. 12a–d and 13a, in the CoG, it takes less time for cooperators to occupy the population as $\alpha$ increases when $\alpha >0$. The proportion of cooperators gradually increases as the iteration progresses, reaching $77\%$ when $\alpha =0$. Therefore, high value of $\alpha$ is more effective at promoting cooperation.

It follows from Figs. 12e–h and 13b that the MFC of the CG decreases as the increase of $\alpha$. The MFC of the game fluctuates around $58\%$ when $\alpha =0$, while that of the game fluctuates around $32\%$ when $\alpha =1$.

Similarly, it follows from Figs. 12i–l and 13c that lower value of $\alpha$ is more effective at promoting cooperation in the PDG.

Sensitivity analyses of trade-off coefficient and introspection rate on MFC

We set parameters $N=10000$, $maxgen=200$, $\delta =0.01$, $\rho =0.01$, $\phi =0.8$, $\lambda ^C=\lambda ^D=1$, $F_i^C(0)=F_i^D(0)=3$ for all $i \in \{1, \cdots , N\}$ and $H(0)=3$. The initial population state is given randomly.

From Figs. 10, 11 and 14a, trade-off coefficient $\gamma$ inhibits cooperation for the CoG but promotes cooperation for the CG and PDG in a square lattice network.

From Figs. 12, 13 and 14b, introspection rate $\alpha$ promotes cooperation for the CoG but inhibits cooperation for the CG and PDG in a square lattice network.

Conclusion

This paper examines the evolution of cooperation under a ML update mechanism, which combines IL and RL update rules, within the context of a finite homogeneous population. RL agents are independent of the strategic environment and make strategy choices based on a probability proportional to their preference size, while imitators compare their payoffs with those of their opponents and adopt the strategies of opponents with higher payoffs.

The results obtained from the mixed population are consistent with those observed in a square lattice. Simulation analyses under the ML mechanism, conducted in both a well-mixed population and a square lattice network, reveal the following key findings: In the PDG, a stable proportion of cooperators emerges, overcoming the dilemma where defectors dominate the population; in the CG, the proportion of cooperators increases compared to the mixed Nash equilibrium; and in the CoG, cooperators eventually occupy the entire population. Sensitivity analyses of the ML update rule’s parameters show that increasing the probability of choosing RL ($\gamma$) enhances the proportion of cooperators in the CG and PDG, while prolonging the time for cooperators to occupy the entire population in the CoG. Additionally, increasing the proportion of imitators ($\alpha$) decreases the proportion of cooperators in the PDG and CGs, but accelerates the time for cooperators to dominate the population in the CoG.

Furthermore, a sole RL update rule suppresses cooperation in the CoG compared to the ML update mechanism, while the opposite effect is observed in the PDG and CGs. In contrast, a sole IL update rule promotes cooperation in the CoG relative to the ML update mechanism, whereas the opposite effect occurs in the PDG and CGs.

In the future, the case of infinite populations can be considered. In addition, the factors like reputation and social exclusion can be take into account for payoffs or preferences of agents.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

References

Wang, S. Y., Liu, Y. P., Zhang, F. & Wang, R. W. Super-rational aspiration induced strategy updating promotes cooperation in the asymmetric prisoner’s dilemma game. Appl. Math. Comput. 403(5805), 126180 (2021).
MathSciNet Google Scholar
Pennisi, E. How did cooperative behavior evolve?. Science 309, 93–93 (2005).
Article CAS PubMed Google Scholar
Amaral, M. A. et al. Evolutionary mixed games in structured populations: Cooperation and the benefits of heterogeneity. Phys. Rev. E 93(4), 042304 (2016).
Article ADS PubMed Google Scholar
Hofbauer, J. & Sigmund, K. Evolutionary Games and Population Dynamics (Cambridge University Press, 1998).
Book Google Scholar
Sandholm, W. H. Population Games and Evolutionary Dynamics (MIT Press, 2010).
Google Scholar
Michod, R. E. The theory of kin selection. Annu. Rev. Ecol. Syst. 13(1), 23–55 (2003).
Article Google Scholar
Foster, K. R., Wenseleers, T. & Ratnieks, F. L. W. Kin selection is the key to altruism. Trends Ecol. Evol. 21(2), 57–60 (2006).
Article PubMed Google Scholar
Gardner, A. Kin selection. International Encyclopedia of the Social and Behavioral Sciences (Second Edition) 26–31 (2015).
Maskin, E., Fundenberg D. Evolution and cooperation in noisy repeated games. Am. Econ. Rev. 80(2), 274–279 (1990).
Google Scholar
Pacheco, J. M., Traulsen, A., Ohtsuki, H. & Nowak, M. A. Repeated games and direct reciprocity under active linking. J. Theor. Biol. 250(4), 723–731 (2008).
Article ADS MathSciNet PubMed Google Scholar
Panchanathan, K. & Boyd, R. Indirect reciprocity can stabilize cooperation without the second-order free rider problem. Nature 432(7016), 499–502 (2004).
Article ADS CAS PubMed Google Scholar
Nowak, M. Five rules for the evolution of cooperation. Science 314(5805), 1560–1563 (2006).
Article ADS CAS PubMed PubMed Central Google Scholar
Traulsen, A. & Nowak, M. A. Evolution of cooperation by multilevel selection. P. Natl. A. Sci. 103(29), 10952–10955 (2006).
Article ADS CAS Google Scholar
Ohtsuki, H. & Nowak, M. A. The replicator equation on graphs. J. Theor. Biol. 243(1), 86–97 (2006).
Article ADS MathSciNet PubMed PubMed Central Google Scholar
Ohtsuki, H. & Nowak, M. A. Evolutionary stability on graphs. J. Theor. Biol. 251, 698–707 (2008).
Article ADS MathSciNet PubMed PubMed Central Google Scholar
Du, W. B., Cao, X. B., Hu, M. B. & Wang, W. X. Asymmetric cost in snowdrift game on scale-free networks. Europhys. Lett. 87(6), 60004 (2009).
Article ADS Google Scholar
Su, Q., McAvoy, A. & Plotkin, J. B. Strategy evolution on dynamic networks. Nat. Comput. Sci. 3, 763–776 (2023).
Article PubMed Google Scholar
Gao, S. P., Du, J. M. & Liang, J. L. Evolution of cooperation under punishment. Phys. Rev. E 101, 062419 (2021).
Article ADS Google Scholar
Quan, J., Pu, Z. & Wang, X. Comparison of social exclusion and punishment in promoting cooperation: Who should play the leading role?. Chaos Solitons Fractals 151, 111229 (2021).
Article MathSciNet Google Scholar
Pi, J. X., Yang, G. H. & Yang, H. Evolutionary dynamics of cooperation in N-person snowdrift games with peer punishment and individual disguise. Physica A 592, 126839 (2022).
Article Google Scholar
Liu, L., Chen, X. J. & Perc, M. Evolutionary dynamics of cooperation in the public goods game with pool exclusion strategies. Nonlinear Dyn. 97, 749–766 (2019).
Article Google Scholar
Pan, Q., Wang, L. & He, M. Social dilemma based on reputation and successive behavior. Appl. Math. Comput. 384, 125358 (2020).
MathSciNet Google Scholar
Zhu, H. et al. Reputation-based adjustment of fitness promotes the cooperation under heterogeneous strategy updating rules. Phys. Lett. A 384(34), 126882 (2020).
Article MathSciNet CAS Google Scholar
Tang, W., Yang, H. & Wang, C. Reputation mechanisms and cooperative emergence in complex network games: Current status and prospects. EPL 149, 41001 (2025).
Article Google Scholar
Tang, W. et al. Cooperative emergence of spatial public goods games with reputation discount accumulation. New J. Phys. 26, 013017 (2024).
Article ADS MathSciNet Google Scholar
Bi, Y. & Yang, H. Based on reputation consistent strategy times promotes cooperation in spatial prisoner’s dilemma game. Appl. Math. Comput. 444, 127818 (2023).
MathSciNet Google Scholar
Santos, F. C. & Pacheco, J. M. Scale-free networks provide a unifying framework for the emergence of cooperation. Phys. Rev. Lett. 95, 098104 (2005).
Article ADS CAS PubMed Google Scholar
Stojkoski, V., Utkovski, Z., Basnarkov, L. & Kocarev, L. Cooperation dynamics of generalized reciprocity in state-based social dilemmas. Phys. Rev. E 97(5), 052305 (2018).
Article ADS PubMed Google Scholar
Moran, P. A. P. The Statistical Processes of Evolutionary Theory (Clarendon Press, 1962).
Google Scholar
Nowak, M. A., Sasaki, A., Taylor, C. & Fudenberg, D. Emergence of cooperation and evolutionary stability in finite populations. Nature 428(6983), 646 (2004).
Article ADS CAS PubMed Google Scholar
Altrock, P. M. & Traulsen, A. Deterministic evolutionary game dynamics in finite populations. Phys. Rev. E 80, 011909 (2009).
Article ADS MathSciNet Google Scholar
Nowak, M. A. Evolutionary Dynamics: Exploring the Equations of Life (Harvard University Press, 2006).
Book Google Scholar
Traulsen, A., Pacheco, J. M. & Nowak, M. A. Pairwise comparison and selection temperature in evolutionary game dynamics. J. Theor. Biol. 246(3), 522–529 (2007).
Article ADS MathSciNet PubMed PubMed Central Google Scholar
Sigmund, K. The Calculus of Selfishness (Princeton University Press, 2010).
Book Google Scholar
Traulsen, A., Claussen, J. C. & Hauert, C. Coevolutionary dynamics: From finite to infinite populations. Phys. Rev. Lett. 95(23), 238701 (2005).
Article ADS PubMed Google Scholar
Traulsen, A., Nowak, M. A. & Pacheco, J. M. Stochastic Payoff Evaluation Increases the Temperature of Selection. J Theor Biol 244, 349–356 (2007).
Article ADS MathSciNet PubMed Google Scholar
Altrock, P. M. & Traulsen, A. Fixation times in evolutionary games under weak selection. New J. Phys. 11(1), 013012 (2009).
Article ADS Google Scholar
Szolnoki, A. & Chen, X. Gradual learning supports cooperation in spatial prisoner’s dilemma game. Chaos Soliton Fract. 130, 109447 (2020).
Article MathSciNet Google Scholar
Du, J., Wu, B., Altrock, P. M. & Wang, L. Aspiration dynamics of multi-player games in finite populations. J. R. Soc. Interface 11(94), 20140077 (2014).
Article PubMed PubMed Central Google Scholar
Du, J., Wu, B. & Wang, L. Aspiration dynamics in structured population acts as if in a well-mixed one. Sci. Rep.-UK 5, 8014 (2015).
Article CAS Google Scholar
Liu, X., He, M., Kang, Y. & Pan, Q. Aspiration promotes cooperation in the prisoner’s dilemma game with the imitation rule. Phys. Rev. E 94(1), 012124 (2016).
Article ADS PubMed Google Scholar
Matja, P., Zhen, W. & Marshall, J. A. R. Heterogeneous aspirations promote cooperation in the prisoner’s dilemma game. Plos One 5, 515117 (2010).
Google Scholar
Platkowski, T. Enhanced cooperation in prisoner’s dilemma with aspiration. Appl. Math. Lett. 22(8), 1161–1165 (2009).
Article MathSciNet Google Scholar
Lim, I. S. & Wittek, P. Satisfied-defect, unsatisfied-cooperate: A novel evolutionary dynamics of cooperation led by aspiration. Phys. Rev. E 98(6), 062113 (2018).
Article ADS CAS Google Scholar
Wang, L. et al. Levy noise promotes cooperation in the prisoner’s dilemma game with reinforcement learning. Nonlinear dynam. 2, 1–10 (2022).
Google Scholar
Jia, N. & Ma, S. Evolution of cooperation in the snowdrift game among mobile players with random-pairing and reinforcement learning. Physica A 392(22), 5700–5710 (2013).
Article ADS Google Scholar
Zhang, S. P., Zhang, J. Q., Chen, L. & Liu, X. D. Oscillatory evolution of collective behavior in evolutionary games played with reinforcement learning. Nonlinear Dyn. 99, 3301–3312 (2020).
Article Google Scholar
Jia, D., Li, T., Zhao, Y., Zhang, X. & Wang, Z. Empty nodes affect conditional cooperation under reinforcement learning. Appl. Math. Comput. 413(6398), 126658 (2022).
MathSciNet Google Scholar
Deng, Y. & Zhang, J. Memory-based prisoner’s dilemma game with history optimal strategy learning promotes cooperation on interdependent networks. Appl. Math. Comput. 390, 125675 (2021).
MathSciNet Google Scholar
Amaral, M. A., Wardil, L., Perc, M. & Silva, J. K. L. D. Stochastic win-stay-lose-shift strategy with dynamic aspirations in evolutionary social dilemmas. Phys. Rev. E 94(3–1), 032317 (2016).
Article ADS PubMed Google Scholar
Arefin, M. R. & Tanimoto, J. Evolution of cooperation in social dilemmas under the coexistence of aspiration and imitation mechanisms. Phys. Rev. E 102, 032120 (2020).
Article ADS MathSciNet CAS PubMed Google Scholar
Wang, X., Gu, C., Zhao, J. & Quan, J. Evolutionary game dynamics of combining the imitation and aspiration-driven update rules. Phys. Rev. E 100(2–1), 022411 (2019).
Article ADS CAS PubMed Google Scholar
Xu, K., Li, K., Cong, R. & Wang, L. Cooperation guided by the coexistence of imitation dynamics and aspiration dynamics in structured populations. Europhys. Lett. 117(4), 48002 (2017).
Article ADS Google Scholar
Amaral, M. A. & Javarone, M. A. Heterogeneous update mechanisms in evolutionary games: mixing innovative and imitative dynamics. Phys. Rev. E 97(4–1), 042305 (2018).
Article ADS CAS PubMed Google Scholar
Camerer, C. & Ho, T. Experience-Weighted Attraction Learning in Games: A Unifying Approach. Working Papers (1997).
Imhof, L. A. & Nowak, M. A. Evolutionary game dynamics in a wright-fisher process. J. Math. Biol. 52(5), 667–681 (2006).
Article MathSciNet PubMed PubMed Central Google Scholar
Ashcroft, P., Altrock, P. M. & Galla, T. Fixation in finite populations evolving in fluctuating environments. J. R. Soc. Interface 11(100), 20140663 (2014).
Article PubMed PubMed Central Google Scholar
Fudenberg, D. & Levine, D. K. The Theory of Learning in Games (MIT Press, 1998).
Google Scholar
Bjornerstedt, J. & Weibull, J.W. Nash equilibrium and evolution by imitation. Working Paper Series (1994).

Download references

Acknowledgements

This study received support from the Guizhou Provincial Science and Technology Projects (ZK[2022] General 168), the National Science Foundation of China (Grant 11271098), Scientific Research Project Results of Guizhou Open University (Guizhou Vocational and Technical College) (Project Number: 2023YB26) and Scientific Research Projects for the Introduced Talents of Guizhou University (No. [2021]90).

Author information

Authors and Affiliations

School of Information Engineering, Guizhou Open University, Guiyang, Guizhou, 550023, China
Wei Tang
School of Mathematics and Statistics, Guizhou University, Guiyang, Guizhou, 550025, China
Wei Tang, Guoling Wang & Zhiyan Xing

Authors

Wei Tang
View author publications
Search author on:PubMed Google Scholar
Guoling Wang
View author publications
Search author on:PubMed Google Scholar
Zhiyan Xing
View author publications
Search author on:PubMed Google Scholar

Contributions

Wei Tang: Formal analysis, Methodology, Writing – original draft, Writing – review & editing. Guolin Wang: Conceptualization, Formal analysis, Writing – review & editing. Zhiyan Xing: Formal analysis, Writing – review & editing.

Corresponding author

Correspondence to Guoling Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Tang, W., Wang, G. & Xing, Z. Evolution of cooperation guided by the coexistence of imitation learning and reinforcement learning. Sci Rep 15, 26136 (2025). https://doi.org/10.1038/s41598-025-11557-y

Download citation

Received: 24 December 2024
Accepted: 10 July 2025
Published: 18 July 2025
Version of record: 18 July 2025
DOI: https://doi.org/10.1038/s41598-025-11557-y

Subjects

Abstract

Similar content being viewed by others

Imitation dynamics on networks with incomplete information

Intrinsic fluctuations of reinforcement learning promote cooperation

Network coevolution drives segregation and enhances Pareto optimal equilibrium selection in coordination games

Introduction

The game models

Well-mixed populations

Structured populations

Theoretical analyses of updating rules

IL update rule

RL update rule

The ML update rule

Simulation experiment of well-mixed populations

Convergence analysis of the mixed update rule in a well-mixed population

Sensitivity analysis of trade-off coefficient on MFC

Sensitivity analysis of introspection rate on MFC

Simulation experiment of square lattice network

Convergence analysis of the ML rule in a square network

Sensitivity analysis of trade-off coefficient on MFC

Sensitivity analysis of introspection rate on MFC

Sensitivity analyses of trade-off coefficient and introspection rate on MFC

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links