Introduction

GPT has been found to exhibit excellence in various sophisticated tasks, with few, if any, shortcomings, to the point of being considered an actual artificial general intelligence (Bubeck et al., 2023; Campello de Souza et al., 2023). Not only does GPT display a high level of ability in various real-world tasks (Kung et al., 2023; Choi et al., 2021; Terwiesch, 2023), but it also displays a high level of cognitive ability (Campello de Souza et al., 2023; Kosinski, 2023). Several recent studies have suggested that GPT has human-like abilities in many domains of individual and interpersonal decision making (Chen et al., 2023; Mei et al., 2024; Yin et al., 2024; Goli and Singh, 2024).

Humans are emotional animals, and emotions play an essential role in human communication and cognition (Zhao et al., 2022). With its development, whether GPT has human-like emotion abilities is also of interest to researchers. Existing studies on GPT’s emotion abilities focus on GPT’s ability of emotion recognition (Yin et al., 2024; Zhao et al., 2022; Valor et al., 2022; Roberts et al., 2015; Martı́nez-Miranda, and Aldea, 2005; Khare et al., 2023; Lian et al., 2024). However, whether GPT has emotion venting ability has not been investigated.

Inspired by the AI behavioral science approach (Chen et al., 2023; Yin et al., 2024; Camerer, 2019; Meng, 2024), we investigate emotion venting ability of GPT based on classic behavioral economic paradigms. Behavioral economic games have consistently demonstrated that individuals often respond to violations of social norms with costly punishments. But why do individuals incur such personal costs to punish others? Behavioral economics uses emotional motives to explain this phenomenon. Negative emotions provoked by unfair offers can lead people to sacrifice some financial gain to punish norm violators (Sanfey et al., 2003; Fehr and Gächter, 2002; Pillutla and Murnighan, 1996; Xiao and Houser, 2005; Dickinson and Masclet, 2015). For example, Sanfey et al. (2003) found that rejected unfair offers in the ultimatum game (UG) evoke significantly heightened activity in anterior insula, a brain region related to emotion processing, and that unfair offers are also associated with increased activity in ACC, which has been implicated in detection of cognitive conflict. As unfair offers in UG induce conflict in the responder between cognitive (“accept”) and emotional (“reject”) motives, these neural findings suggest that emotions play an important role in decision-making. Pillutla and Murnighan (1996) found that responders in UG are most likely to reject unfair offers when they attribute responsibility to the proposers, and that anger is a better explanation for the rejections than the perception that the offers are unfair. Fehr and Gächter (2002) suggested that free riding may cause strong negative emotions among the cooperators and these emotions, in turn, may trigger their willingness to punish the free riders.

However, there are various ways to express emotions, and it is important to examine whether these modes of emotional expression influence punitive behavior. Xiao and Houser (2005) used UG to study the effect of emotion venting on second-party punishment. They conducted two treatments: Emotion Expression (EE) and No Emotion Expression (NEE). NEE is the standard UG, while the responders in the EE were given an opportunity to write a message to the proposer. For unfair offers (20% or less), almost all players expressed negative emotions. Crucially, rejection rates of unfair offers were significantly lower in the EE treatment than in the NEE treatment, suggesting that the opportunity to express emotions mitigated punitive behavior. Dickinson and Masclet (2015) explored whether venting emotions in different ways could influence the level of punishment in a public goods game. Their findings indicated that venting emotions reduced (excessive) punishment. Building on this line of research, Xu and Houser (2024) and others used DG to study the effect of emotion expression on indirect reciprocity. Xu and Houser (2024) found that players who were treated unkindly behaved more generously towards others in subsequent interactions if they had the opportunity to convey their emotions to a third party. They suggested that opportunities to express emotion can break negative reciprocity chains and promote generosity and social well-being even among those who have not themselves been previously well-treated. Li et al. (2020) suggested that without leaving messages, the amounts received in a previous stage were strongly correlated to subjects’ giving in the subsequent stage. However, if subjects had an opportunity to leave messages to their respective dictators, they gave more generously to unrelated parties, and the previous received amounts and subsequent giving amounts were not correlated. Strange et al. (2016) investigated two emotion regulation strategies, namely writing a message to the dictator and describing a neutral picture. They found that those participants who regulated their emotions successfully by writing a message made higher allocations to a third person, suggesting that message writing as an emotion regulation strategy can interrupt the chain of unfairness.

People vary widely in how they respond emotionally to unfair treatment. Some react with intense anger, while others remain indifferent (Fehr and Fischbacher, 2004; Xu and Houser, 2024; Li et al., 2020; Strange et al., 2016; Bushman et al., 1999; Nils and Rimé, 2012). Allowing people to freely express their feelings might help release anger for some, but it may not affect those who are not initially angry. However, when people are asked to express their feelings using dissatisfied language, those who were not initially angry may experience increased anger. These individuals may have originally perceived the outcomes as fair and thus felt no need for punishment. Yet, being forced to articulate dissatisfaction can lead them to reinterpret the outcome as unfair, triggering negative emotions and increasing the likelihood of punishment behavior (Xu and Houser, 2024; Nils and Rimé, 2012; Bushman et al., 1999; Sanfey et al., 2003).

In previous studies, free-form message writing is the most common way of expressing emotions, and there is a general consensus that this type of emotion expression reduces individuals’ punitive behavior (Xiao and Houser, 2005; Dickinson and Masclet, 2015; Xu and Houser, 2024; Li et al., 2020; Strange et al., 2016). However, does expressing emotions using dissatisfied language yield the same effect? While direct experimental evidence is limited, existing research suggests that the impact may differ. For example, Nils and Rimé (2012) found that venting negative emotions can compel individuals to reappraise a situation in the same negative way, ultimately reinforcing the original negative experience. Similarly, Bushman et al. (1999) demonstrated that participants who read articles supporting catharsis theory and were angered showed a greater desire to hit a sandbag in subsequent activities, whereas those who read anti-catharsis theory articles did not exhibit such behavior.

In sum, since UG and DG are the most widely used paradigms for studying emotion venting and punishment, we designed two experiments to explore whether GPT has emotion venting ability: third-party punishment based on DG and second-party punishment based on UG. Considering that different ways of expressing emotions may have different effects on punishment, each experiment included three conditions: a baseline condition, a self-expressed message condition, and a dissatisfied message condition. In the baseline condition, we asked GPT to participate in the classic DG or UG as a third or second party to impose punishment. In the self-expressed message and dissatisfied message conditions, GPT was required to send a message to the dictator (or proposer) before making punishment decisions. The difference between these two conditions is that in the self-expressed message condition, the content of the message can be anything GPT wants to say, while in the dissatisfied message condition, GPT must express its anger with dissatisfied language in the message.

Based on existing literature, people are generally willing to incur costs to punish unfair behavior (Sanfey et al., 2003; Xiao and Houser, 2005; Fehr and Gächter, 2002; Pillutla and Murnighan, 1996). Studies have shown that GPT exhibits human-like abilities in many domains of individual and interpersonal decision-making (Chen et al., 2023; Mei et al., 2024; Yin et al., 2024; Goli and Singh, 2024), and demonstrates proficiency in emotion recognition (Yin et al., 2024; Zhao et al., 2022; Valor et al., 2022; Roberts et al., 2015; Martı́nez-Miranda, and Aldea, 2005; Khare et al., 2023; Lian et al., 2024). Building on prior research on emotion venting and punishment, and in light of recent advances in GPT capabilities, we propose the following hypotheses:

H1: GPT will impose punishments for unfair outcomes in the baseline condition.

H2: GPT has the ability of emotion venting. Compared to the baseline condition, punishment will decrease in the self-expressed message condition and increase in the dissatisfied message condition.

H3: As GPT iterates, its ability of emotion venting will increasingly resemble human behavior.

Methods

Experimental treatments

We instructed GPT to play as a third or second party in extended DG and UG to punish dictators or proposers who propose distribution scheme. There are two treatments in our experiment, each with three conditions, and a total of 6 sessions (SI Appendix, Table S1). To ensure the models produced highly probable responses, minimize variance across outputs, and enhance the replicability of our results, each session consisted of 100 rounds with the temperature parameter set to 0 (Rathje et al., 2024; Hagendorff et al., 2023). The temperature parameter, which is provided as part of the input, controls the level of randomness in the model’s responses. With this setting, the GPT output would not differ significantly if we ran our analysis a second time.

We primarily used three specific model versions to examine time-series variations in the degree of behavioral biases: GPT-3.5 (2022), GPT-4 (2023) and GPT-4o (2024). By examining an advanced version of the model alongside an earlier version, we can study the time-series variation in the model’s degree of behavioral biases.

The rules of games

We study GPT’s third-party punishment based on DG and second-party punishment based on UG. In the DG, there are three players: player A (the dictator), player B (the recipient), and player C (the third-party punisher). Player A is endowed with $20, and can transfer any amount from 0 to 10 to Player B, who receives no endowment. Player C is given $10 as an endowment, and has the option of punishing Player A after observing A’s transfer to B. For each deduction dollar that Player C transfers to Player A, Player C’s income decreases by $1, while Player A’s income decreases by $3. Player C can assign a number of deduction dollars between 0 and 10.

In the UG, two players have the opportunity to split $20. The first player, the “proposer,” suggests how to divide the money, and the second player, the “responder,” can either accept the offer, in which case the money is divided as proposed, or reject it, leaving both players with nothing.

The distribution and GPT decisions

We use an extended DG and UG that allows GPT to act as a recipient or third-party observer to punish the proposer of an unfair distribution. The control treatment setting is GPT players only make punishment decision; the experimental treatment asks GPT players express their emotions in self-expressed or dissatisfied message before imposing punishment. By comparing the treatment effect between the experimental and the control treatments, we can examine the emotional venting ability of GPT. In addition, the effects of emotional venting in different types of messages can be observed through GPT’s punishment decisions.

We chose 11 combinations in the distribution for DG: ($20, $0), ($19, $1), ($18, $2), ($17, $3), ($16, $4), ($15, $5), ($14, $6), ($13, $7), ($12, $8), ($11, $9) and ($10, $10), with the first number for the dictator and the second number for the recipient. For UG, we chose five combinations: ($18, $2), ($16, $4), ($14, $6), ($12, $8) and ($10, $10), with the first number for the proposer and the second number for the recipient. Three public OpenAI application programming interface (API) based on gpt-3.5-turbo-0125, gpt-4-0125-preview and gpt-4o-2024-08-06 were used. The framework for GPT is defined as: (1) the system’s role is as a human decision maker, making decisions like a human (Chen et al., 2023; Goli and Singh, 2024), which defines the behavioral style of GPT. (2) The assistant’s role is to implement a third-party or second-party punishment framework for dictators or proposers of unfair distributions in DG or UG. (3) The user’s role is to make a specific decision (see SI for details).

Social norm elicitation method

Since the fairness motive is an important determinant of second-party and third-party punishment, it is necessary to examine the fairness norms of GPT to better understand its punitive behavior.

We used the social norms elicitation method of Bicchieri and Xiao (2009) to measure the social norm of GPT. Using a two-step procedure, this method first elicits non-incentivized reports of GPTs’ personal normative beliefs about what one ought to do in a given situation. Then GPT is incentivized to indicate its empirical and normative expectations. Empirical expectations capture what GPT believes to constitute common behavior in the situation (i.e., what most others do). Normative expectations are second-order beliefs that describe GPT’s beliefs about what others believe they ought to do (SI).

Data analysis

Data analysis was conducted using STATA 16.

Results

GPT’s emotion venting and third-party punishment in DG

GPT’s emotion venting and third-party punishment intensity

GPT-3.5

Under the three conditions, GPT-3.5 implements third-party punishment for almost all of the outcomes, but its punishment behavior lacks regularity (Fig. 1A). There are large irregular fluctuations in punishment expenditures as the recipient’s share increases, especially for the baseline condition and the dissatisfied message condition. This indicates that although GPT-3.5 tries to vent emotions through punishment, its ability is very primary and still in the embryonic stage.

Fig. 1: Pattern of GPT players as third party’s punishment in DG.
figure 1

Panel A shows the pattern of third-party punishment of GPT-3.5. Panel B shows the pattern of third-party punishment of GPT-4. Panel C shows the pattern of third-party punishment of GPT-4o. Punishment intensity is indexed by expenditure to sanction dictators. Expenditure to sanction dictators = third party’s average expenditure to sanction dictators/third party’s endowment; Recipient’s share = the amount transferred to the recipient/dictator’s endowment. Error bars = mean ± standard error (SE).

GPT-4

In contrast to GPT-3.5, GPT-4 exhibits orderly emotion venting behavior. GPT-4’s third-party punishment decreases proportionally to the amount of the dictator’s transfer when the recipient’s share is <25% of the dictator’s endowment for all conditions (Fig. 1B). When GPT-4 players are given the opportunity to vent their emotions by sending messages, GPT-4 expresses different patterns of third-party punishment. For the very unfair outcomes (i.e., the recipient’s share is <10% of the dictator’s endowment), there is a significant effect across the three conditions (Kruskal–Wallis test: ps < 0.001). When compared self-expressed message or dissatisfied message condition with baseline, after GPT-4 vents emotion through the self-expressed message or dissatisfied message, the third-party punishment increases relative to baseline (Wilcoxon Rank-Sum Test with Bonferroni correction: ps < 0.001), and there is similar punishment level between the self-expressed and dissatisfied message conditions for these transfers (Wilcoxon Rank-Sum Test with Bonferroni correction: p = 0.951, 1.00).

For the relatively unfair outcomes (i.e., the recipient’s share is between 10% and 20%), there are significant distinct patterns of punishment across the three conditions (Kruskal–Wallis tests: ps < 0.001 for 10% and 15% shares, and p = 0.002 for 20% share). In the self-expressed message condition, GPT-4’s punishment decreases compared to the baseline condition (Wilcoxon Rank-Sum Test with Bonferroni correction: ps < 0.001 for 10% and 15% shares, and p = 0.006 for 20% share). In the dissatisfied message condition, GPT-4’s punishment of the dictators does not decrease relative to baseline, while it increases for the 10% transfer (Wilcoxon Rank-Sum Test with Bonferroni correction: p < 0.001) and similarly for the 15% and 20% transfers (Wilcoxon Rank-Sum Test with Bonferroni correction: ps = 1.00). More importantly, GPT-4 enforces more punishment in the dissatisfied message condition, compared to the self-expressed message condition (Wilcoxon Rank-Sum Test with Bonferroni correction: ps < 0.001 for 10% and 15% shares, and p = 0.001 for 20% share).

For the relatively fair outcomes (i.e., the recipient’s share is >25%), no significant differences in third-party punishment were found across the three conditions (Kruskal–Wallis: p = 0.368 for 25% share; ps = 1.00 for greater shares than 25%; Wilcoxon Rank-Sum Test with Bonferroni correction: ps = 1.00).

GPT-4o

Furthermore, GPT-4o exhibits similar emotion venting behavior to GPT-4 (Fig. 1C). However, the third-party punishment behavior of GPT-4o differs from that of GPT-4 in several ways. First, for each outcome, the third-party punishment intensity of GPT-4o is lower in the self-expressed message condition compared to the baseline and dissatisfied message conditions. When the recipient’s share is between 5% and 25%, the punishment expenditures are significantly different (Kruskal–Wallis test: all ps < 0.001). Second, when the recipient’s share is less than 20%, the third-party punishment intensity of GPT-4o is lower in the baseline condition than that in the dissatisfied message condition (when the recipient’s share is 10% and 15%: Wilcoxon Rank-Sum Test with Bonferroni correction: ps < 0.001), but the result is reversed when the recipient’s share is equal to or greater than 20% (when the recipient’s share is 20% and 25%: Wilcoxon Rank-Sum Test with Bonferroni correction: ps < 0.001).

GPT’s emotion venting and the percentage of third-party punishers

GPT-3.5

Figure 2A shows the percentage of third-party GPT-3.5 players who punish in DG. For most outcomes, either all GPT-3.5 players punish dictators or all do not punish dictators in all three conditions, and similar to the pattern of punishment intensity, the percentage of punishers lacks regularity.

Fig. 2: Percentage of GPT players as third party who punish in DG.
figure 2

Panel A shows the percentage of third-party GPT-3.5 players who punish in DG. Panel B shows the percentage of third-party GPT-4 players who punish in DG. Panel C shows the percentage of third-party GPT-4o players who punish in DG. Percentage of punishers = the numbers of third-party punishers/total players; recipient’s share = the amount transferred to the recipient/dictator’s endowment.

GPT-4

Figure 2B shows the percentage of third party GPT-4 players who punish in DG. Similar to punishment intensity, whether and how venting emotions through sending message affects punishment is closely related to the fairness of the distribution. For the very unfair outcomes, all GPT-4 players punish dictators in both self-expressed and dissatisfied message conditions, although their punishment intensity differs. For the relatively unfair outcomes, there are different patterns of punishment for the self-expressed and dissatisfied message conditions compared to baseline. For the 10% transfer, all GPT-4 players punish the dictators in the baseline and dissatisfied message conditions, while no GPT-4 players punish the dictators in the self-expressed message condition. For the 15% and 20% transfers, a similar proportion of GPT-4 players punish the dictators in the baseline and dissatisfied message conditions (Wilcoxon Rank-Sum Test with Bonferroni correction: ps = 1.00), In contrast, this proportion is significantly lower in the self-expressed message condition, where no GPT-4 players punish the dictator (for 15%: Wilcoxon Rank-Sum Test with Bonferroni correction: ps < 0.001; for 20%: Wilcoxon rank-sum test with Bonferroni correction: ps = 0.006, 0.001). For the relatively fair outcomes, almost no GPT-4 players punish the dictator in all conditions.

GPT-4o

Overall, for the percentage of third-party punishers, third-party GPT-4o punishers showed a similar trend to third-party GPT-4 punishers (Fig. 2C). However, GPT-4o differs from GPT-4 in three ways. First, in the baseline condition, the percentage of third-party GPT-4o punishers is greater than that of GPT-4 punishers when the recipient’s share is 15%, 20%, and 25%. Second, the percentage of third-party GPT-4o punishers is greater in the baseline condition compared to the dissatisfied message condition, while this percentage is similar for GPT-4. Third, in the self-expressed message condition, the percentage of third-party GPT-4o punishers is the same as that of GPT-4, except when the recipient’s share is 5%.

A person can release emotion in a variety of ways. If someone is prohibited from expressing negative emotions directly, they will find other ways to release them, such as intensifying their negative behavior. In this way, punishment can also be a relatively powerful way to release negative emotions. Punishment is based on strong negative emotion, and the stronger the negative emotion, the greater the punishment (Fehr and Fischbacher, 2004; Xu and Houser, 2024; Li et al., 2020; Strange et al., 2016; Waller, 1989; Elster, 1989).

In summary, punishment can be used to release negative emotion caused by the unfair distribution. Whether and how emotion venting through sending messages affects third-party punishment is closely related to the fairness degree of distribution. For the very unfair distributions, expressing emotion through sending messages may induce more negative feelings, which in turn leads to stronger third-party punishments. For the relatively unfair distributions, even though people think these are unfair, GPT thinks these are fair, so no matter how the emotion is vented, it will not affect GPT’s punishment.

Comparison between GPT’s and human third-party punishment in DG

Because third-party punishment is itself a way to vent negative emotions, we compare GPT emotion venting behavior to that of humans to examine whether GPT emotion venting is behaviorally similar to that of humans. Because our third-party punishment experiment follows the design of Fehr and Fischbacher (2004, hereinafter “F-F”), we compare GPT’s third-party punishment behavior to that of F-F. In addition, since GPT-3.5 basically shows poor ability to vent emotion, we focus only on the comparison between GPT-4, GPT-4o, and human players.

GPT-4, GPT-4o and human players show similar patterns of punishment expenditure, i.e., the average expenditure decreases proportionally with the amount of the dictator’s transfer (Fig. 3A). Although GPT-4’s average punishment expenditure is higher than that of human players when the dictator’s transfer is <10%, this expenditure is reversed when the dictator’s transfer is greater than 10%. In addition, GPT-4o’s crossover point is about 20%. Despite these differences, their punishment behavior shows significant similarities (Pearson Correlation: GPT-4 vs F-F: r = 0.948, p = 0.004); GPT-4o vs F-F: r = 0.963, p = 0.002). More importantly, these results indicate the third-party punishment behavior of GPT-4o is closer to the behavior of human in F-F, relative to GPT-4.

Fig. 3: Third-party punishment for GPT-4, GPT-4o and human players.
figure 3

Panel A shows the intensity of third-party punishment of GPT-4, GPT-4o and human players. Panel B shows the percentage of third-party GPT-4, GPT-4o and human players who punish in DG. Punishment intensity is indexed by expenditure to sanction dictators. Expenditure to sanction dictators = third party’s average expenditure to sanction dictators/third party’s endowment; percentage of punishers = the number of third-party punishers/total players; recipient’s share = the amount transferred to the recipient/proposer’s endowment.

In terms of the percentage of the punishers, for the very unfair outcomes (i.e., the dictator’s transfer is 10% or less of the endowment), all GPT decision makers punish the dictators; for the relatively unfair outcomes (i.e., the dictator’s transfer is between 10% and 30% of the endowment), the number of GPT-4 punishers decreases steadily, from 100% at the 10% transfer level to 9% at the 20% transfer level to 1% at the 25% transfer level, while the number of GPT-4o punishers decreases to 78% at the 20% transfer level to 44% at the 25% transfer level. At any transfer level above 25%, no GPT players punish the dictators. In contrast, in F-F’s study, ~60% of all third players punish the dictators at every transfer level below 50%, and only about 5% punish the dictators at the 50% transfer level (Fig. 3B). We perform a Pearson correlation test on the percentage of the punishers. Although the Pearson correlation coefficient is not significant (Pearson Correlation: GPT-4 vs F-F: r = 0.364, p = 0.478; GPT-4o vs F-F: r = 0.477, p = 0.339), the coefficient between GPT-4o vs F-F is higher than that between GPT-4 vs F-F. To some extent, this also indicates that GPT-4o has more human-like emotion ability than GPT-4.

GPT’s emotion venting and punishment in UG

GPT’s emotion venting and rejection rate in UG

Similar to the third-party punishment in DG, although GPT-3.5 exhibits punishment behavior, its punishment behavior lacks regularity (Fig. 4A).

Fig. 4: Percentage of GPT players as a second party who punish in UG.
figure 4

Panel A shows the pattern of second-party punishment of GPT-3.5. Panel B shows the pattern of second-party punishment of GPT-4. Panel C shows the pattern of second-party punishment of GPT-4o. In each allocation pair, the first number represents the responder’s share and the second number represents the proposer’s share.

Compared to GPT-3.5, GPT-4 showed better emotional venting behavior. We find that when GPT-4 vents emotion by sending message to the proposer, punishment behavior increases significantly in UG (Fig. 4B). When the proposer offers the responder 10% of the endowment, all such distribution schemes are rejected by GPT-4 responder in all conditions; when the proposer offers 20% to the responder, the rejection rate is 79% in the baseline condition and 100% in both the self-expressed and dissatisfied message conditions; when the proposer offers 30% to the responder, the rejection rate is 0% in the baseline condition, 100% in the self-expressed message condition and 29% in the dissatisfied message condition; when the proposer offers the responder more than or equal to 40%, no such schemes are rejected by GPT-4 under any condition.

Overall, GPT-4o exhibits the same punishment behavior as GPT-4, but only in the baseline condition (Fig. 4C). In the self-expressed message condition, GPT-4o exhibits the same punishment behavior as GPT-4, except when the proposer offers 30% to the responder. GPT-4o rejects this offer 5% of the time, while GPT-4 rejects it 100% of the time. In the dissatisfied message condition, when the proposer offers 20% or less to the responder, the rejection rates of GPT-4o and GPT-4 are 100%; when the proposer offers 30%, the rejection rate of GPT-4o is 100%, while it is 29% for GPT-4; when the proposer offers 40%, the rejection rate of GPT-4o is 12%, while no such schemes are rejected by GPT-4; when proposer offers 50%, no offers are rejected by GPT-4 and GPT-4o.

Comparison between GPT’s and human punishment in UG

Our data show that GPT players, as responders in UG, can punish the proposers by rejecting the proposer’s proposals to express their dissatisfaction. However, it remains unknown whether the second-party behavior of GPT is similar to that of humans. Since our UG experimental design follows the design of Xiao and Houser (2005; hereinafter “X-H”), we compare our data to that of X-H to investigate this question.

Our baseline condition is the same as X-H’s no emotion expression condition. For the 10% scheme (i.e., the proposer offers 10% of his endowment to the responder), GPT-4 and GPT-4o rejects all such schemes, while human players reject only about 83% of such schemes. For the 20% scheme, GPT-4 and GPT-4o rejects 79% of such schemes, while human players reject about 50% of such schemes. For the 40% scheme, GPT-4 and GPT-4o rejects none of these schemes, while human players reject about 11% of these schemes (Table 1). Despite these differences, GPT exhibits similar trend to human players.

Table 1 Rejection rate in baseline condition of UG.

Our self-expressed message condition is the same as X-H’s emotion expression condition. For the 10% scheme, GPT-4 and GPT-4o rejects all such schemes, while human players reject only ~75% of such schemes. For the 20% scheme, GPT-4 and GPT-4o rejects 100% of such schemes, while human players reject ~20%. For the 40% scheme, both GPT-4 and GPT-4o rejects none, while human players reject about 13% (Table 2).

Table 2 Rejection rate in the self-expressed condition of UG.

Fairness norms and punishment of GPT

Although both GPT players and human players exhibit third-party punishment behavior, there are some significant differences. Comparing our data with the F-F’s, we find that GPT exhibits polarized third-party punishment behavior, a consistent punishment at very low transfer levels (10% or less) and consistent non-punishment at higher transfer levels (30% or more), whereas F-F’s study shows a similar proportion of human punishers at each transfer level below 50%. Since the fairness motive is an important determinant of third-party punishment, it is necessary to examine the fairness norms of GPT to explain their punitive behavior.

We used the social norms elicitation method of Bicchieri and Xiao (2009) to measure the social norm of GPT. We find that all GPT-4 and GPT-4o players believe that the social norm for fair outcomes is that a dictator’s transfer is 25%.

Because GPT considers it the social norm for the dictator to transfer 25% of the endowment in DG, if a dictator’s transfer is less than 25%, GPT players think that this unfair distribution violates social norms, so they punish this behavior. On the contrary, a norm of 50-50 division seems to have considerable power in a wide range of economic settings, both in the real world and in the lab. Even in settings where one party unilaterally determines the distribution of a prize (the dictator game), many subjects voluntarily cede exactly half to another individual (Andreoni and Bernheim, 2009). Thus, people typically consider it the social norm for the dictator to transfer 50%, so as long as a dictator’s transfer is less than 50%, they impose a punishment.

For the second-party punishment in UG, we find that the emotion venting behavior of GPT is significantly different from that of humans, by comparing our data and X-H’s. Obviously, this difference in DG is smaller than the difference in UG, and the possible reason for this is that GPT has different fairness specifications in the UG setting. Our fairness norm experiment finds that all GPT-4 and GPT-4o players hold a norm of 50–50 division in UG. Although a norm of 50-50 division seems to be a widely accepted social norm, not everyone agrees with this norm, and people’s fairness norms are heterogeneous. Therefore, this difference in fairness norms may lead to different patterns of punishment.

Discussion

We ask GPT to act as a second or third party to make punishment decisions in DG and UG to quantitatively investigate whether GPT has the emotion venting ability. Our results show that observing unfair distributions can cause GPT’s negative emotions, both when GPT participates in the games as a second party and as a third party. GPT can vent its negative emotions in several ways, including direct punishment, sending (self-expressed or dissatisfied) messages that express feelings. Sending messages that vent emotions can replace, weaken, or strengthen the venting effect of punishment, depending on the specific situation. Moreover, the emotion venting of GPT-3.5 is in a chaotic state, and its emotion venting behavior lacks regularity. The emotion venting behavior of GPT-4 has become rational and orderly, showing a trend of human-like behavior. The emotion venting behavior of GPT-4o shows a trend toward more human-like behavior. We also find that the difference in fairness norms seems to explain the GPT’s emotion venting behavior, which differs from humans.

Our study contributes to the ongoing topic of the relationship between emotion venting and punishment behavior. In our DG experiment, for the unfair distributions, when GPT sends a self-expressed message to the dictator, the level of punishment is reduced relative to the baseline condition. This is consistent with the studies of Xiao and Houser (2005) and Dickinson and Masclet (2015). In their studies, subjects freely expressed their opinions to other players, and expressing dissatisfaction led to a reduction in punishment. Nyer (2000) also found that consumers who were encouraged to complain reported greater increases in satisfaction and product evaluation. In contrast, when GPT is forced to express their anger in dissatisfied language, their anger is intensified rather than reduced, leading to an increase in punishment. This is consistent with the findings of Bushman et al. (1999) and Bushman (2002), whose experiments examined whether reading cathartic information and punching sandbags were effective means of venting anger. The authors found that individuals were more aggressive after reading the cathartic messages and hitting the sandbag than the control group, in direct contradiction to cathartic theory.

Our study also contributes to the ongoing discussion of whether GPT have human-like decision making abilities beyond language processing. The pioneering work of Mei et al. (2024) used a series of six games designed to illuminate various behavioral traits to test whether GPT is behaviorally similar to humans. They found that GPT exhibits signs of human-like complex behavior, such as learning and behavioral change from role-playing. Other studies have found that GPT exhibits high levels of economic rationality (Chen et al., 2023), emotion detecting (Yin et al., 2024) and mind abilities (Mei et al., 2024). In contrast to these studies, our results suggest that although GPT can vent emotions in various ways, its emotion venting is not behaviorally similar to humans. GPT exhibits more extreme behavior compared to humans. For example, in the DG, all the GPT players punish the dictators for the very unfair distributions, implying GPT consistently exhibits social preference; for unfair distributions, the number of GPT punishers decrease steadily. At any transfer level above 25% of the endowment, no GPT players punish the dictators. In contrast, in F-F’s study, at any transfer level below 50%, ~60% of all human third parties punish the dictators.

Furthermore, we find that the GPT’s fairness norm differs from that of humans, and this may be one reason why GPT exhibits different characteristics from humans in third-party punishment. Combining our data with data from the existing literature (Fehr and Fischbacher, 2004; Andreoni and Bernheim, 2009), GPT considers the dictator to transfer 25% of the endowment consistent with the fairness norm, while humans consider the dictator to transfer 50% as the social norm. Besides, existing research has found that about 25% of human subjects have no social preferences and never punish others (Fehr and Fischbacher, 2004; Fischbacher et al., 2001; Fischbacher and Gächter, 2010), whereas we find that all GPT players have social preferences. This is in contrast to the findings of Mei et al. (2024), who found that GPT showed human-like social preferences.

Our results show that GPT’s emotion venting ability emerges later, and does not behave similarly to humans, which is different from other abilities (Bubeck et al., 2023; Campello de Souza et al., 2023; Kung et al., 2023; Choi et al., 2021; Terwiesch, 2023; Kosinski, 2023; Chen et al., 2023; Mei et al., 2024; Yin et al., 2024; Goli and Singh, 2024; Rahwan et al., 2019; Hagendorff, 2024; Ye et al., 2023). One possible reason is the difference in the amount of literature available and the difference in the consistency of the conclusions. The paradigms used in these studies tend to be widely used research paradigms, so the literature based on human subjects is extensive. Moreover, the conclusions of these classic studies are basically the same. GPT can train its own behavior patterns from these literatures, so that it can mimic human behavior well and generate human-like behavior. Our focus is on GPT’s ability to vent emotion. Although some studies have investigated the behavioral effects of emotion on punishment, only a few studies have investigated the behavioral effects of emotion venting on direct punishment (Xiao and Houser, 2005; Dickinson and Masclet, 2015; Xu and Houser, 2024; Li et al., 2020; Strange et al., 2016; Kersten and Greitemeyer, 2022; Bolle et al., 2014), and their results are somewhat mixed. Therefore, GPT may produce more errors regardless of whether it imitates human emotion venting or spontaneously produces emotion venting. Therefore, GPT’s emotion venting behavior is not similar to that of humans. Previous studies have found that GPT and human behavior are remarkably similar in many domains beyond language processing (Chen et al., 2023; Mei et al., 2024; Yin et al., 2024). These studies optimistically suggest that when GPT deviates from human behavior, the deviations are in a positive direction. Our study suggests that one should not only look at the optimistic side of GPT’s deviation behavior, but also look at the negative effects of these deviations, i.e., GPT’s looser fairness norm than that of humans.

Our work highlights the potential of large language models such as GPT to streamline experimental marketing research and generate new data and insights (Horton, 2023; Sudhir and Toubia, 2023; Sætra, 2023; Koc et al., 2023). As a state-of-the-art Generative Pre-trained Transformer model, GPT can not only learn to imitate existing human behavior, but also iteratively generate new human behavior, and these newly generated behaviors provide a good benchmark for understanding human natural intelligence (Meng, 2024; Gui and Toubia, 2023). Our study first designs two situations to explore GPT’s responses when asked to use different messages to vent their emotions. There are no studies in the existing literature that include this type of emotion venting. By synthesizing findings from different domains, our work provides a unique perspective on the nature of emotion venting and expands the methods that can be used to study it.

Our finding that venting emotions through different types of messages can have varying effects provides a strong impetus for future research on the role of message expression in emotion management. This aligns with Nils and Rimé’s (2012) findings that venting negative emotions can compel individuals to reappraise a situation in the same negative way, ultimately reinforcing the original negative experience. Besides, Bushman et al. (1999) demonstrated that participants who read articles supporting catharsis theory and were angered showed a greater desire to hit a sandbag in subsequent activities, whereas those who read anti-catharsis theory articles did not exhibit such behavior. Adena and Huck (2022) found that small changes in wording can affect behavior. Their crowdfunding experiments revealed that using the word “donation” resulted in higher revenue than “contribution,” possibly because “donation” evokes more positive emotional responses, strongly associated with crowdfunding contributions. Christodoulides et al. (2021) also found that restricting consumers’ freedom to express their frustration about a brand by adding guidelines that ask consumers to moderate their speech can decrease the negativity in written complaints and lead to higher levels of consumer-brand forgiveness. Similarly, GPT exhibits sensitivity to the content of emotion venting, as indicated by changes in its subsequent behavior in response to different types of emotional expression.

Conclusion

Our study shows that GPT has the ability to vent emotions and GPT’s emotion venting ability displays an evolutionary trend from chaos in GPT-3.5 to order in GPT-4 and GPT-4o. Although the emotion venting behavior of GPT-4 and GPT-4o has become rational and orderly, showing a trend of human-like behavior, this emotion venting behavior is not similar to humans. We find that the difference in fairness norms may provide a good explanation for this inconsistence.

While prior research has shown that GPT has exhibits excellence in various sophisticated tasks since the GPT-3.5 era or earlier, and that these abilities were already similar or superior to those of humans by the GPT-4 era, our study shows that GPT’s emotion venting ability emerges later than these abilities. This suggests that GPT has emotion venting ability which has traditionally been considered as a uniquely human trait. We need to further understand its capabilities, limitations, and underlying mechanisms to better understand and utilize AI.