Children’s attribution of mental states to humans and social robots assessed with the Theory of Mind Scale

Goldman, Elizabeth J.; Baumann, Anna-Elisabeth; Pare, Laetitia; Beaudoin, Jenna; Poulin-Dubois, Diane

doi:10.1038/s41598-025-96229-7

Download PDF

Article
Open access
Published: 10 May 2025

Children’s attribution of mental states to humans and social robots assessed with the Theory of Mind Scale

Scientific Reports volume 15, Article number: 16357 (2025) Cite this article

2674 Accesses
3 Citations
Metrics details

Subjects

Abstract

The present work examined children’s attribution of psychological properties to inanimate agents in two experiments. In Study 1, an Interview Task and the Theory of Mind Scale (ToM Scale) were administered to 4-year-olds with either a human or a humanoid robot (NAO) protagonist. Parents also completed the Children’s Social Understanding Scale (CSUS) to assess children’s Theory of Mind skills. Overall, children performed similarly on the Interview and the ToM Scale. Theory of Mind skills (CSUS) did not predict performance on either task (ToM Scale or Interview). In Study 2, 5-year-olds were tested with figurines of different humanoid robots for the ToM Scale. Additionally, a Property Projection Task assessed biological, psychological, sensory, and artifact attributions to people, robots, animals, and artifacts. The results indicated that children attributed mental states similarly to the robots and the humans in the ToM Scale but did not anthropomorphize the robots in the Property Projection Task. In contrast to Study 1, the parental measure of children’s ToM skills (CSUS) predicted performance on the ToM Scale in the Human Condition. Overall, the present findings indicate that mentalizing is generalized to humanoid robots by preschoolers, particularly when child-friendly scenarios are used.

Enhancing theory of mind in autism through humanoid robot interaction in a randomized controlled trial

Article Open access 29 July 2025

Anthropomorphism-based causal and responsibility attributions to robots

Article Open access 28 July 2023

The influence of individual characteristics on children’s learning with a social robot versus a human

Article Open access 14 January 2026

Introduction

Anthropomorphism is the attribution of human properties to non-human entities¹. In a first wave of developmental studies on anthropomorphism, children were often tested about the biological properties (e.g., grow, reproduce) of animals, artifacts, and plants^{2,3,4,5,6,7,8}. Much of this work was conducted through interviews where children answered questions while presented with images of animals, artifacts, and plants.

Anthropomorphism of robots

Similarly to plants, researchers have tested children using another ambiguous item, robots^{9,10,11,12,13,14,15}. Robots offer a unique way to test anthropomorphism, as robots are artifacts that can be designed to exhibit animate characteristics. Recent studies have used experimental tasks and detailed, structured questionnaires to test developmental changes in anthropomorphism. In two recent studies, children aged 3 and 5 attributed biological insides to animals and mechanical ones to artifacts. Notably, the younger children were unsure about the insides of robots, but the 5-year-olds knew that humanoid and non-humanoid robots had mechanical insides^11,16.

In addition to investigating the biological properties of animates and inanimates, researchers have also studied children’s anthropomorphism of mental properties (thinking, feeling)^{9,11,12,17,18}. For example, Saylor et al.¹⁵ found that 3- and 4-year-olds attributed properties of living things to an image of a girl more than to a camera or robot. However, only the 4-year-olds categorized the robot as a machine. These findings indicate that by 4 years, children understand that robots are machines that are different from people. A recent review by Goldman and Poulin-Dubois¹⁹ concludes that children anthropomorphize less as they age, but social robots appear to be an exception, with anthropomorphizing of robots occurring throughout childhood.

To better assess anthropomorphism, a team of Italian researchers has developed the standardized and widely used Attribution of Mental States Questionnaire (AMS-Q) to evaluate the attribution of sensory and mental states to various human and non-human agents^9,12,18. For example, in a recent study with the AMS-Q, Manzi et al.¹² asked 5-, 7-, and 9-year-olds about the mental states of two different robots (Robovie and NAO), which were depicted via visual images. The youngest children in the study, the 5-year-olds, attributed more mental states to both robots than the older age groups.

Jipson et al.¹⁷ also tested anthropomorphism by administering a questionnaire, the Property Projection Task, to 3- and 5-year-olds. Within the Property Projection Task, children responded to a series of interview questions about a robotic dog, a rodent, and a toy car to assess their understanding of these items across various domains (biological, psychological, sensory, artifact). Regardless of age, children tended to attribute biological properties more to the rodent than the robotic dog or toy car.

Beyond interviews

Although interviews offer an easy way to test anthropomorphism in children, they bring methodological limitations because they cannot be used with young children. Additionally, interviews with young children may yield a “yes bias.” Given these limitations, recent work has examined children’s ability to anthropomorphize with experimental tasks that provide context and better reflect children’s mentalizing in everyday life, increasing ecological validity. A review by van Straten et al.²⁰ found that younger children anthropomorphized robots more than older children.

In developmental science, the gold standard for measuring children’s ability to attribute mental states to others is the Theroy of Mind Scale (ToM Scale) developed by Wellman and Liu (2004)^{21,22,23,24,25}. It consists of five vignettes that feature a human protagonist who is experiencing mental states ranging from simple desires to hidden emotions. To our knowledge, only one study has used the ToM Scale to assess anthropomorphism in robots. Zhang and colleagues²⁶ administered a change of location false belief task and an unexpected contents task, modified from the ToM Scale, which featured a humanoid robot (NAO) as the protagonist. Researchers found that children aged 5 to 7 attributed false beliefs to the robot.

The studies presented here aimed to examine whether 4- and 5-year-olds will attribute mental states to humanoid robots and whether this is comparable to how children attribute mental states to humans. Both studies were conducted over Zoom. Study 1 examined whether children attribute mental states to humanoid robots. Children responded to a series of interview questions; some were modified and adapted from the AMS-Q¹². Children also completed the ToM Scale²⁷ using human or robot figurines to illustrate the protagonist featured in the vignettes. To our knowledge, this is the first study to administer the complete ToM Scale with non-human protagonists. Finally, parents completed the Children’s Social Understanding Scale (CSUS)²⁸, which assessed young children’s Theory of Mind. As preschoolers have been shown to anthropomorphize, we predicted that children in both conditions would attribute mental states to the robot and human agent equally. We did not hypothesize that children in the human condition would perform differently than children assigned to the humanoid robot condition on the individual items of the ToM scale. As direct tasks provide more context, we anticipated that children would anthropomorphize more on the ToM Scale than the Interview Task.

A follow-up study was conducted to address some limitations of Study 1 and to examine mental state attribution in slightly older children. In Study 2, 5-year-olds were administered the ToM Scale, which featured humanoid robot figurines that varied in morphology. To better understand children’s willingness to anthropomorphize and attribute mental states, children were asked about various animate and inanimate items via an adapted version of Jipson et al.’s¹⁷ Property Projection Task. The CSUS was also administered in Study 2. As in Study 1, we expected children to attribute mental states equally to the human and robot protagonists on the ToM Scale. We did not predict differences between the human and humanoid robot groups on the items of the ToM scale. Based on prior work, we hypothesized that children would judge animacy accurately for items they were familiar with (e.g., humans, animals, and artifacts) but be uncertain about the animacy of robots when responding to interview questions.

Study 1

Method

Participants

Participants included 102 children who were four years of age (Mage = 54 months, 16 days; Nmale = 52). The children were recruited from a university participant pool and through social media advertisements. The sample consisted mainly of children of Caucasian (42%) or Asian (35%) descent, and the remainder of the sample was of mixed ethnicity (20%) or did not provide their ethnicity (3%). A majority of children in the sample were from high (more than $100,000; n = 64) or middle (more than $50,000 but less than $100,000; n = 22) socioeconomic status (SES) families. The remainder of the sample were from low-SES (family income less than $50,000 ) families (n = 9) or opted not to report family SES on the demographic form (n = 7). The sample consisted of Canadian and American children tested in French (n = 8) or English (n = 94).

Power was calculated for the General Linear Model using G*Power 3.1. The required power was estimated using the linear multiple regression: fixed model, R² deviation from zero analysis. Effect size was estimated at medium 0.15, alpha error probability was entered at 0.05, power at 0.80, and the number of predictors entered was 5 (Condition, Age, SES, CSUS, and Robot Exposure). For this analysis, a total sample size of 92 is required. For the mixed-effects model (MEM), power was calculated the same way, adding in an additional variable measuring variation between the different interview subscales for a total of 6 predictors. Therefore, the required sample size for our MEM is 98. Therefore, our sample size is sufficient for our analyses.

An additional four participants were tested but excluded due to experimenter error (n = 1), parental interference (n = 1), and distractedness/failure to complete the study (n = 2). We examined the Robot Exposure Questionnaire to determine whether children had regular exposure to or interactions with robots. The questionnaire found that 16% of parents responded that their children watched a television show or movie that features robots, and 12% of parents reported having a robot at home, but only 5% said their child had regular interactions with a robot, but none of the robots were similar to the one used in the study. Therefore, no participants were excluded due to their familiarity with robots. The study was approved by the University’s Human Research Ethics Committee (certificate of ethical acceptability #10000548) and was conducted in accordance with the guidelines and regulations outlined by the Ethics Committee.

Measures

Robot Exposure Questionnaire

Parents completed a short questionnaire about their child’s exposure to and experiences with robots. Questions included (1) whether the family possessed a robot at home, (2) whether the child plays with a robot at home, (3) whether the child interacts with a robot frequently, and (4) how often children play video games or watch TV/movies that feature robot characters. The questions were all yes/no/unsure forced response.

Wellman & Liu Scale

To measure Theory of Mind skills, the well-validated ToM Scale²⁷ was administered. It consists of five items of increasing difficulty (Diverse Desires, Diverse Beliefs, Knowledge Access, Contents False Belief, and Hidden Emotion). Figurines (Human and Robot) and props are used to illustrate the story and characters in each of the tasks; see Table 1 and Fig. 1.

Table 1 Brief descriptions of the items in the Theory of Mind Scale.

Full size table

Following the standard procedure, the five items of the ToM Scale were always administered in the same order (Diverse Desires, Diverse Beliefs, Knowledge Access, Hidden Contents False Belief, Hidden Emotions), with the items being presented in increasing difficulty, as described in the original study²⁷. All robot and human figurines were assigned a name (e.g., Dash, Sam). For a full list of proper names used, see Appendix A.

Interview Task

The other task was an interview consisting of 14 questions (see Appendix B for a full list of questions). Some questions were modified from Manzi and colleagues’¹² Attribution of Mental States Questionnaire (AMS-Q), a validated measure originally administered in Italian. For the present study, the interview was conducted in either English or French as these were the official languages of the population tested. Two subsets of items were administered from the AMS-Q: (1) Epistemic (e.g., “Do you think this robot/person can learn?”) and (2) Intentions and Desires (e.g., “Do you think this robot/person can make a wish?”). A third subset was created by the experimenters (False Beliefs, e.g., “Can this robot/person believe something that is incorrect?”). The subscales used and created were selected because they aligned with the constructs measured on the Wellman and Liu Scale. There were five questions for the Epistemic subset, five for the Intentions and Desires subset, and four for the False Beliefs subset. Depending upon the assigned condition, children were presented with a picture of either a human-looking robot (NAO) or a participant gender-matched adult human and were asked the interview questions about the robot or the human shown in the picture, see Fig. 2. The questions were a forced choice format so that children could respond “yes” or “no” to each interview question. If the child responded, “maybe,” or failed to provide a clear answer, they were re-prompted by the experimenter and asked to make their best guess. The order in which the three subsets of interview questions (Epistemic, Intentions and Desires, and False Belief) were asked was randomized across the participants.

Children’s Social Understanding Scale (CSUS)

Parents completed the Children’s Social Understanding Scale (CSUS), a parent-report measure of ToM developed by Tahiroglu et al.²⁸. This parental report measure was used to measure a child’s ToM in a non-experimental context. Additionally, it allowed for the examination of whether parental perceptions of their child’s ToM aligned with children’s ToM skills, as tested experimentally. The questionnaire is comprised of six subscales: Beliefs (e.g., beliefs can differ about the same situation); Knowledge (e.g., people can have different levels of knowledge); Perception (e.g., reality and appearances are not always the same); Desire (e.g., people can desire different things); Intention (e.g., same intentions may have different outcomes); and Emotion (e.g., people may feel differently about the same situation). There are seven items per subscale for a total of 42 statements. Parents rated their child’s ability for each item on a Likert scale ranging from 1 (definitely untrue for my child) to 4 (definitely true for my child). Parents also had the option to respond “don’t know” if they lacked insight into their child’s behavior on a particular item. Parents received the questionnaire by email and filled it out before or after the testing session. Either the long form of the CSUS or the French adaptation, l’Échelle de compréhension sociale des enfants (ÉCSE;²⁹), was administered depending on the parent’s dominant language.

Materials

Distinct figurines (Human or Robot) were used for each of the five ToM Scale items. The figurines were approximately 5 inches tall and 2 inches wide. All the robot figurines varied in color, were 3D printed, and depicted the robot NAO. In addition to the figurines, printed images (see Table 1) were used as props in the vignettes (e.g., cookie, carrot, garage, bushes). Other materials included a 5 × 5 inch box with a toy dog used for the Knowledge Access item and a Crayola crayons box with a toy pig used for the Contents False Beliefs item. While the interview questions were asked, participants were shown an image of either a man or woman (gender-matched to the participant, Human Condition) or the robot NAO (Robot Condition).

Procedure

Prior to the Zoom session, parents completed a consent form, the CSUS, and a demographic form. Parents also completed a Robot Exposure Questionnaire to gauge whether children had regular interactions with robots or if they frequently watched television or movies that featured robot characters. The Zoom session began with a PowerPoint presentation during which the experimenter introduced the study and requested verbal informed consent from the parent and verbal assent from the child. Before the first task was administered, the experimenter assessed that the participant’s environment was free of distractions and confirmed that the child could be seen on camera and heard on the microphone. Parents were requested to use a tablet or a computer with a minimum screen size of 8 inches to join the Zoom session to ensure that the screen size was large enough for the child to clearly see the stimuli.

Participants were randomly assigned to one of two conditions (Human or Robot). In the Human Condition, children were presented with different human figurines or props for each task. In the Robot Condition, the child was presented with figurines and props of a humanoid robot. The order of the two tasks (ToM Scale and Interview) was counterbalanced. After both tasks were completed, parents were debriefed on the study’s goals and invited to ask any questions.

Scoring

Children received a passing or failing score for each ToM Scale item. Therefore, scores on this task ranged from 0 to 5. For the interview, the child received a point for each “yes” answer, with the total score ranging from 0 to 14. Each child received a mean total score for the CSUS (out of 4). For the Robot Exposure Questionaire, any data cells where parents had answered “unsure” were left blank. Otherwise, children received scores based on their robot exposure, as reported by the parent. For each question, ‘yes’ was coded as 1 and ‘no’ as 0. The score for all 4 questions was added together to create an overall Robot Exposure Score per child, with higher scores indicating more familiarity with and experiences with robots. To categorize Socio-Economic Status as a variable, SES was split into 3 categories, with families making less than $50,000 classified as low SES, $50,0000 to $100,000 classified as mid, and over $100,000 classified as high.

Results and Discussion

Data Analysis

The data for each task (ToM Scale, Interview) and the CSUS was analyzed independently. Then, correlations were run to determine if performance on one task was correlated with the CSUS. For cross-task analyses, raw scores were turned into proportions. Unless otherwise specified, statistical analyses were performed using JASP 0.18.3³⁰.

Theory of Mind Scale

Children in the Human Condition (M = 2.92, SD = 0.87; t(50) = 3.47, p = 0.001, d = 0.49) performed as expected for their age on the ToM Scale. Children in the Robot Condition (M = 2.67, SD = 0.95; t(50) = 1.25, p = 0.22, d = 0.18) had a slightly lower score. However, children assigned to the Human Condition (n = 51) performed statistically similarly on the ToM Scale compared to children assigned to the Robot Condition (n = 51), as shown by a Generalized Linear Model using the glm function run in R Version 4.3.3³¹ (χ²(1) = −0.24, p = 0.24, overall R² = 0.02). A number of factors were entered into the GLM (Age in months, Family SES, Child Robot Exposure, and the CSUS score) but none reached significance. Next, each item of the ToM Scale was analyzed independently to determine the success rate on each item and whether the children in one condition outperformed the other on that item.

As shown in Table 2, binomial tests revealed that children in the Human Condition performed above chance level on the Diverse Desires and the Diverse Beliefs items. Children assigned to the Human Condition were at chance level for the Knowledge Access item and below chance level for the Contents False Belief and the Hidden Emotion items. Regarding the Robot Condition, binomial tests showed that children performed at chance level on the Diverse Desires, Knowledge Access, and the Hidden Emotion items but performed above chance level on the Diverse Beliefs item and below chance level on the Contents False Belief item.

Table 2 Percentage of children who passed each item on the Theory of Mind Scale in each condition.

Full size table

Children in the Human Condition outperformed children in the Robot Condition for the Diverse Desires item (X²(1, N = 102) = 16.22, p < 0.001). In contrast, children in the Robot Condition outperformed children in the Human Condition for the Contents False Belief item (X²(1, N = 102) = 4.29, p = 0.038). Importantly, children in both conditions performed below chance for the Contents False Belief item; thus, this difference is not considered meaningful. Despite their poor performance on the Contents False Belief item, when asked, “What do you think is inside the box of crayons?” a majority of children, across conditions, correctly responded with crayons (0.86, p < 0.001). Thus, children knew what the box should contain but failed to recognize the agent would also expect the box to contain crayons in both conditions. Children’s performance did not differ based on condition for the Diverse Beliefs, Knowledge Access, and Hidden Emotions items (see Table 2 and Fig. 3).

Interview Task

Children in the Human Condition (M = 9.26, SD = 3.24; t(50) = 4.97, p < 0.001, d = 0.70) and the Robot Condition (M = 8.14, SD = 3.75; t(50) = 2.16, p = 0.035, d = 0.30) performed well on the Interview. We ran a linear mixed-effects model, fit by REML, testing the effects of condition on children’s Interview performance in Jamovi version 2.3³². Interview Question Type (Epistemic, Intentions and Desires, False Beliefs), SES, and Robot Exposure were entered as factors into the model while Age in months and the CSUS score were entered as covariates. The model was clustered by participant. To capture random effects, an intercept for participant was added. Coding for all factors was simple. Condition did not affect children’s interview responses, with children performing similarly in the Human and Robot conditions; b = − 1.85 [− 9.85, 6.15], SE = 4.08, p = 0.65. The overall conditional R² was 0.52, with just over half of the variance explained by the model. However, none of the factors or covariates reached significance except for the difference between the False Belief and Epistemic interview questions (b = − 0.93 [− 1.27, -0.60], SE = 0.17, p_bonferroni < 0.001). Post-hoc tests reveal that children performed better for the Epistemic (t(178) = 5.51, p_bonferroni < 0.001), and Intentions and Desires (t(178) =−7.35, p < 0.001) subsets when compared to the False Belief subset.

When each set of questions was considered, children in the Human Condition performed above chance on both the Epistemic and Intentions and Desires subsets. However, children performed at chance for the False Belief subset. For the Robot Condition, children performed above chance on the Intentions and Desires subset of questions but at chance for the Epistemic and False Belief subsets, see Table 3. When compairing performance between the groups for each interview subset, we found that children assigned to the Human Condition outperformed children in the Robot Condition on the Epistemic subset. There was no difference in performance on the other subsets; see Table 3.

Table 3 Mean scores on the Interview (total scores and subsets) in each condition.

Full size table

Children’s Social Understanding Scale (CSUS)

Only two parents failed to complete the CSUS for a final sample of 100²⁸. Participants’ total CSUS score was calculated (M = 3.18, SD = 0.39). Overall, parents rated their children’s Theory of Mind in line with those reported in prior research at a similar age^28,29.

Exploratory Inter-tasks Comparison

With a sample combining all children, the total score on ToM Scale and the total score on the Interview were not correlated (r(100) = -0.15, p = 0.15). Split by condition, the Human Condition tasks (ToM Scale, Interview) were not significantly correlated (r(49) = 0.03, p = 0.84). There was a significant negative correlation, however, between the tasks for the Robot Condition (r(49) = -0.33, p = 0.019), with better performance on the ToM Scale predicting worse performance on the Interview.

Correlations using the pooled sample of children from both conditions revealed that there were no significant correlations between the ToM Scale and the CSUS (all r(100) < 0.14, p > 0.17). Nor were any significant correlations found between the Interview scores and the CSUS (all r(100) < 0.14, p > 0.17). Correlations within each condition run separately revealed the same null results.

In this first experiment, children were tested on their attribution of a Theory of Mind to robots versus humans using a direct and interactive task (ToM Scale) and an indirect task, an Interview. Children performed similarly on both tasks across conditions, attributing an equal number of mental states to the robot and the human. In line with the scalability of these items, all children, regardless of condition, performed better on the first three items than on the more difficult ones^21,25. Overall, children’s performance in the Human Condition mirrors prior work, whereas the performance of those assigned to the Robot Condition indicates less mentalizing on some items.

One unexpected difference in the performance between the conditions was found in the Diverse Desires item of the ToM Scale. Children passed this item at much higher rates in the Human Condition than the Robot Condition. One explanation for this difference is that children understood that the robot could not eat and responded with their personal snack preference. Prior work supports this interpretation with 4- and 5-year-olds stating a robotic dog lacks biological attributions, including the ability to eat³³.

The performance on the Interview mirrored that of the ToM Scale and previous research using interviews^12,34. Regarding the three subsets of questions, children attributed Intentions and Desires as well as False Beliefs to robots at similar rates to humans. However, for the Epistemic subset, children attributed more mental states to the human than the robot. Perhaps children believe humans are more sentient than robots when asked about complex mental states (e.g., teaching, learning) as opposed to being wrong (False Belief subset). Additionally, much of the work using interviews and the AMS-Q measure specifically has been conducted with children older than our sample¹², which could explain the difference in performance on the Epistemic subset.

A limitation of the ToM Scale is that for the Robot Condition, the robots were given proper names. Using a proper name for the robot protagonists featured in the vignettes may have led children to anthropomorphize the robot. Another limitation was that the robot figurines used in the Robot Condition, despite being distinct colors, were replicas of the same robot, NAO. Study 2 addressed these limitations.

Study 2

As Theory of Mind is further developed by age 5, Study 2 aimed to investigate mental state attributions to robots in slightly older children. Although our sample in Study 2 was marginally older, we hypothesized that children would anthropomoprphize robots at a rate equivalent to the 4-year-olds tested in Study 1. In Study 2, different humanoid robots that varied in appearance were used for each ToM Scale item. This methodological change was important, as some extant literature has highlighted how morphology plays a role in children’s anthropomorphism^12,19 and that the robots children interact with in everyday life vary in appearance. Additionally, this made the Robot Condition more similar to the Human Condition, where visibly distinct human figurines that varied in morphology had been used in Study 1. As the parameters between the two conditions were now more equivalent, we expected no difference between ToM Scale performance from the Human and Robot Conditions. Also, a deviation from the ToM Scale’s original procedure was removed, as no proper names were assigned to the robots or the humans; instead, the experimenter labeled them as “a robot” or “a person.” This change was made in order to avoid a bias toward anthropomorphism.

A second false belief item, Location False Belief, was added to the ToM Scale to better assess children’s false belief reasoning. As in Study 1, we expected children to anthropomorphize the robot and did not predict a difference between the conditions on the individual items of the ToM Scale. As interviews are widely used in the literature, and to replicate the results obtained with the Interview Task, children completed the Property Projection Task. This interview assessed animacy attribution across various domains. The Property Projection Task allowed us to interview children about various items instead of just the human or the robot in the Interview task administered in Study 1. It was hypothesized that performance on the ToM Scale would match that reported in Study 1. Additionally, we predicted that children in the Property Projection Task would be unsure about the animacy of robots, as robots are unfamiliar to most children. It was expected that children, regardless of condition, would have higher scores on the ToM Scale (i.e., direct measure) than the Property Projection Task (i.e., indirect measure), as the ToM Scale provides additional context and is more interactive. As in Study 1, parents completed the CSUS.