Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Goals as reward-producing programs

A preprint version of the article is available at arXiv.

Abstract

People are remarkably capable of generating their own goals, beginning with child’s play and continuing into adulthood. Despite considerable empirical and computational work on goals and goal-oriented behaviour, models are still far from capturing the richness of everyday human goals. Here we bridge this gap by collecting a dataset of human-generated playful goals (in the form of scorable, single-player games), modelling them as reward-producing programs and generating novel human-like goals through program synthesis. Reward-producing programs capture the rich semantics of goals through symbolic operations that compose, add temporal constraints and allow program execution on behavioural traces to evaluate progress. To build a generative model of goals, we learn a fitness function over the infinite set of possible goal programs and sample novel goals with a quality-diversity algorithm. Human evaluators found that model-generated goals, when sampled from partitions of program space occupied by human examples, were indistinguishable from human-created games. We also discovered that our model’s internal fitness scores predict games that are evaluated as more fun to play and more human-like.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Goals as reward-producing programs.
Fig. 2: Participants in our behavioural experiment create diverse games reflecting common sense and compositionality.
Fig. 3: GPG model.
Fig. 4: GPG model produces simple, coherent, human-like games.
Fig. 5: GPG model produces interesting, novel goals.

Similar content being viewed by others

Data availability

All data for our study, including raw participant responses in the behavioural experiment, their translations to programs in our DSL and the specification for the DSL, are available via GitHub at https://github.com/guydav/goals-as-reward-producing-programs/ or via Zenodo at https://doi.org/10.5281/zenodo.14238893 (ref. 97).

Code availability

All code for our study, including code used to analyse and generate figures for our behavioural experiment, and the full implementation of our GPG model, are available via GitHub at https://github.com/guydav/goals-as-reward-producing-programs/ or via Zenodo at https://doi.org/10.5281/zenodo.14238893 (ref. 97). Our behavioural data collection experiment is publicly accessible at https://game-generation-public.web.app/. Code for the behavioural experiment is available via GitHub at https://github.com/guydav/game-creation-behavioral-experiment. Our human evaluation experiment is publicly accessible at https://exps.gureckislab.org/e/expert-caring-chemical/#/welcome. Code for the human evaluation experiment is available via GitHub at https://github.com/guydav/game-fitness-judgements.

References

  1. Dweck, C. S. Article commentary: the study of goals in psychology. Psychol. Sci. 3, 165–167 (1992).

    Article  MATH  Google Scholar 

  2. Austin, J. T. & Vancouver, J. B. Goal constructs in psychology: structure, process, and content. Psychol. Bull. 120, 338–375 (1996).

    Article  MATH  Google Scholar 

  3. Elliot, A. J. & Fryer, J. W. in Handbook of Motivation Science Vol. 638 (ed. Shah, J. Y.) 235–250 (The Guilford Press, 2008).

  4. Hyland, M. E. Motivational control theory: an integrative framework. J. Pers. Soc. Psychol. 55, 642–651 (1988).

    Article  MATH  Google Scholar 

  5. Eccles, J. S. & Wigfield, A. Motivational beliefs, values, and goals. Annu. Rev. Psychol. 53, 109–132 (2002).

    Article  MATH  Google Scholar 

  6. Brown, L. V. Psychology of Motivation (Nova Science Publishers, 2007); https://books.google.com/books?id=hzPCuKfpXLMC

  7. Fishbach, A. & Ferguson, M. J. in Social Psychology: Handbook of Basic Principles Vol. 2 (eds Kruglanski, A. W. & Higgins, E. T.) 490–515 (The Guilford Press, 2007).

  8. Pervin, L. A. Goal Concepts in Personality and Social Psychology (Taylor & Francis, 2015); https://books.google.com/books?id=lIXwCQAAQBAJ

  9. Moskowitz, G. B. & Grant, H. The Psychology of Goals Vol. 548 (Guilford Press, 2009).

  10. Molinaro, G. & Collins, A. G. E. A goal-centric outlook on learning. Trends Cogn. Sci. 27, 1150–1164 (2023).

    Article  MATH  Google Scholar 

  11. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 2018).

  12. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

    Article  MATH  Google Scholar 

  13. Chu, J., Tenenbaum, J. B. & Schulz, L. E. In praise of folly: flexible goals and human cognition. Trends Cogn. Sci. 28, 628–642 (2024).

    Article  MATH  Google Scholar 

  14. Chu, J. & Schulz, L. E. Play, curiosity, and cognition. Annu. Rev. Dev. Psychol. 2, 317–343 (2020).

    Article  MATH  Google Scholar 

  15. Lillard, A. S. in Handbook of Child Psychology and Developmental Science Vol. 3 (eds Liben, L. & Mueller, U.) 425–468 (Wiley-Blackwell, 2015).

  16. Andersen, M. M., Kiverstein, J., Miller, M. & Roepstorff, A. Play in predictive minds: a cognitive theory of play. Psychol. Rev. 130, 462–479 (2023).

    Article  Google Scholar 

  17. Oudeyer, P.-Y., Kaplan, F. & Hafner, V. V. Intrinsic motivation systems for autonomous mental development. IEEE Trans. Evol. Comput. 11, 265–286 (2007).

    Article  MATH  Google Scholar 

  18. Nguyen, C. T. Games: Agency as Art (Oxford Univ. Press, 2020).

  19. Kolve, E. et al. AI2-THOR: an interactive 3D environment for visual AI. Preprint at https://arxiv.org/abs/1712.05474 (2017).

  20. Fodor, J. A. The Language of Thought (Harvard Univ. Press, 1979).

  21. Goodman, N. D., Tenenbaum, J. B., Feldman, J. & Griffiths, T. L. A rational analysis of rule-based concept learning. Cogn. Sci. 32, 108–154 (2008).

    Article  MATH  Google Scholar 

  22. Piantadosi, S. T., Tenenbaum, J. B. & Goodman, N. D. Bootstrapping in a language of thought: a formal model of numerical concept learning. Cognition 123, 199–217 (2012).

    Article  MATH  Google Scholar 

  23. Rule, J. S., Tenenbaum, J. B. & Piantadosi, S. T. The child as hacker. Trends Cogn. Sci.24, 900–915 (2020).

    Article  MATH  Google Scholar 

  24. Wong, L. et al. From word models to world models: translating from natural language to the probabilistic language of thought. Preprint at https://arxiv.org/abs/2306.12672 (2023).

  25. Ghallab, M. et al. PDDL—The Planning Domain Definition Language Tech Report CVC TR-98-003/DCS TR-1165 (Yale Center for Computational Vision and Control, 1998).

  26. Chopra, S., Hadsell, R. & LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition 539–546 (IEEE, 2005).

  27. Le-Khac, P. H., Healy, G. & Smeaton, A. F. Contrastive representation learning: a framework and review. IEEE Access 8, 193907–193934 (2020).

    Article  Google Scholar 

  28. Pugh, J. K., Soros, L. B & Stanley, K. O. Quality diversity: a new frontier for evolutionary computation. Front. Robot. AI https://doi.org/10.3389/frobt.2016.00040 (2016).

  29. Chatzilygeroudis, K., Cully, A., Vassiliades, V. & Mouret, J. B. Quality-diversity optimization: a novel branch of stochastic optimization. Springer Optim. Appl. 170, 109–135 (2020).

    MathSciNet  MATH  Google Scholar 

  30. Mouret, J.-B. & Clune, J. Illuminating search spaces by mapping elites. Preprint at https://arxiv.org/abs/1504.04909 (2015).

  31. Ward, T. B. Structured imagination: the role of category structure in exemplar generation. Cogn. Psychol. 27, 1–40 (1994).

    Article  MATH  Google Scholar 

  32. Allen, K. R. et al. Using games to understand the mind. Nat. Hum. Behav. https://doi.org/10.1038/s41562-024-01878-9 (2024).

  33. Liu, M., Zhu, M. & Zhang, W. Goal-conditioned reinforcement learning: problems and solutions. In Proc. 31st International Joint Conference on Artificial Intelligence: Survey Track (ed. De Raedt, L.) 5502–5511 (IJCAI, 2022).

  34. Colas, C., Karch, T., Sigaud, O. & Oudeyer, P.-Y. Autotelic agents with intrinsically motivated goal-conditioned reinforcement learning: a short survey. J. Artif. Intell. Res. 74, 1159–1199 (2022).

    Article  MathSciNet  Google Scholar 

  35. Icarte, R. T., Klassen, T. Q., Valenzano, R. & McIlraith, S. A. Reward machines: exploiting reward function structure in reinforcement learning. J. Artif. Intell. Res. 73, 173–208 (2022).

    Article  MathSciNet  MATH  Google Scholar 

  36. Pell, B. Metagame in Symmetric Chess-Like Games UCAM-CL-TR-277 (Univ. Cambridge, Computer Laboratory, 1992).

  37. Hom, V. & Marks, J. Automatic design of balanced board games. In Proc. AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment Vol. 3 (eds Schaeffer, J. & Mateasvol, M.) 25–30 (AAAI Press, 2007).

  38. Browne, C. & Maire, F. Evolutionary game design. IEEE Trans. Comput. Intell. AI Games 2, 1–16 (IEEE, 2010).

  39. Togelius, J. & Schmidhuber, J. An experiment in automatic game design. In 2008 IEEE Symposium On Computational Intelligence and Games 111–118 (IEEE, 2008).

  40. Smith, A. M., Nelson, M. J. & Mateas, M. Ludocore: a logical game engine for modeling videogames. In Proc. 2010 IEEE Conference on Computational Intelligence and Games 91–98 (IEEE, 2010).

  41. Zook, A. & Riedl, M. Automatic game design via mechanic generation. In Proc. AAAI Conference on Artificial Intelligence Vol. 28, https://doi.org/10.1609/aaai.v28i1.8788 (AAAI Press, 2014).

  42. Khalifa, A., Green, M. C., Perez-Liebana, D. & Togelius, J. General video game rule generation. In 2017 IEEE Conference on Computational Intelligence and Games 170–177 (IEEE, 2017).

  43. Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building machines that learn and think like people. Behav. Brain Sci. 40, e253 (2017).

    Article  MATH  Google Scholar 

  44. Cully, A. Autonomous skill discovery with quality-diversity and unsupervised descriptors. In Proc. Genetic and Evolutionary Computation Conference (ed. López-Ibáñez, M.) 81–89 (Association for Computing Machinery, 2019).

  45. Grillotti, L. & Cully, A. Unsupervised behavior discovery with quality-diversity optimization. IEEE Trans. Evol. Comput. 26, 1539–1552 (2022).

    Article  MATH  Google Scholar 

  46. Ullman, T. D., Spelke, E., Battaglia, P. & Tenenbaum, J. B. Mind games: game engines as an architecture for intuitive physics. Trends Cogn. Sci. 21, 649–665 (2017).

    Article  Google Scholar 

  47. Chen, T., Allen, K. R., Cheyette, S. J., Tenenbaum, J. & Smith, K. A. ‘Just in time’ representations for mental simulation in intuitive physics. In Proc. Annual Meeting of the Cognitive Science Society Vol. 45 (UC Merced, 2023); https://escholarship.org/uc/item/3hq021qs

  48. Tang, H., Key, D. & Ellis, K. WorldCoder, a model-based LLM agent: building world models by writing code and interacting with the environment. Preprint at https://arxiv.org/abs/2402.12275 (2024).

  49. Reed, S. et al. A generalist agent. Trans. Mach. Learn. Res. 1ikK0kHjvj (2022).

  50. Gallouédec, Q., Beeching, E., Romac, C. & Dellandréa, E. Jack of all trades, master of some, a multi-purpose transformer agent. Preprint at https://arxiv.org/abs/2402.09844 (2024).

  51. Florensa, C., Held, D., Geng, X. & Abbeel, P. Automatic goal generation for reinforcement learning agents. In Proc. 35th International Conference on Machine Learning Vol. 80 (eds Dy, J. & Krause, A.) 1515–1528 (PMLR, 2018).

  52. Open Ended Learning Team et al. Open-ended learning leads to generally capable agents. Preprint at https://arxiv.org/abs/2107.12808 (2021).

  53. Du, Y. et al. Guiding pretraining in reinforcement learning with large language models. In Proc. of the 40th International Conference on Machine Learning (eds Krause, A. et al.) 8657–8677 (JMLR, 2023).

  54. Colas, C., Teodorescu, L., Oudeyer, P.-Y., Yuan, X. & Côté, M.-A. Augmenting autotelic agents with large language models. Preprint at https://arxiv.org/abs/2305.12487v1 (2023).

  55. Littman, M. L. et al. Environment-independent task specifications via GLTL. Preprint at http://arxiv.org/abs/1704.04341 (2017).

  56. Leon, B. G., Shanahan, M. & Belardinelli, F. In a nutshell, the human asked for this: latent goals for following temporal specifications. In 10th International Conference on Learning Representations (OpenReview, 2022); https://openreview.net/forum?id=rUwm9wCjURV

  57. Ma, Y. J. et al. Eureka: Human-Level Reward Design via Coding Large Language Models (ICLR, 2023).

  58. Faldor, M., Zhang, J., Cully, A. & Clune, J. OMNI-EPIC: open-endedness via models of human notions of interestingness with environments programmed in code. In 12th International Conference on Learning Representations (OpenReview, 2024); https://openreview.net/forum?id=AgM3MzT99c

  59. Colas, C. et al. Language as a cognitive tool to imagine goals in curiosity-driven exploration. In Proc. 34th International Conference on Neural Information Processing Systems (NIPS ’20) (eds Larochelle, H. et al.) 3761–3774 (Curran Associates, 2020).

  60. Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D. & Meder, B. Generalization guides human exploration in vast decision spaces. Nat. Hum. Behav. 2, 915–924 (2018).

    Article  MATH  Google Scholar 

  61. Ten, A. et al. in The Drive for Knowledge: The Science of Human Information Seeking (eds. Dezza, I. C. et al.) 53–76 (Cambridge Univ. Press, 2022).

  62. Berlyne, D. E. Novelty and curiosity as determinants of exploratory behaviour. Br. J. Psychol. Gen. Sect. 41, 68–80 (1950).

    Article  MATH  Google Scholar 

  63. Gopnik, A. Empowerment as causal learning, causal learning as empowerment: a bridge between Bayesian causal hypothesis testing and reinforcement learning. PhilSci-Archive https://philsci-archive.pitt.edu/23268/ (2024).

  64. Addyman, C. & Mareschal, D. Local redundancy governs infants’ spontaneous orienting to visual-temporal sequences. Child Dev. 84, 1137–1144 (2013).

    Article  Google Scholar 

  65. Du, Y. et al. What can AI learn from human exploration? Intrinsically-motivated humans and agents in open-world exploration. In NeurIPS 2023 Workshop: Information-Theoretic Principles in Cognitive Systems (OpenReview, 2023); https://openreview.net/forum?id=aFEZdGL3gn

  66. Ruggeri, A., Stanciu, O., Pelz, M., Gopnik, A. & Schulz, E. Preschoolers search longer when there is more information to be gained. Dev. Sci. 27, e13411 (2024).

    Article  Google Scholar 

  67. Liquin, E. G., Callaway, F. & Lombrozo, T. Developmental change in what elicits curiosity. In Proc. Annual Meeting of the Cognitive Science Society Vol. 43 (UC Merced, 2021); https://escholarship.org/uc/item/43g7m167

  68. Taffoni, F. et al. Development of goal-directed action selection guided by intrinsic motivations: an experiment with children. Exp. Brain Res. 232, 2167–2177 (2014).

    Article  MATH  Google Scholar 

  69. Ten, A., Kaushik, P., Oudeyer, P.-Y. & Gottlieb, J. Humans monitor learning progress in curiosity-driven exploration. Nat. Commun. 12, 5972 (2021).

    Article  Google Scholar 

  70. Baldassarre, G. et al. Intrinsic motivations and open-ended development in animals, humans, and robots: an overview. Front. Psychol. 5, 985 (2014).

    Article  MATH  Google Scholar 

  71. Spelke, E. S. & Kinzler, K. D. Core knowledge. Dev. Sci. 10, 89–96 (2007).

    Article  MATH  Google Scholar 

  72. Jara-Ettinger, J., Gweon, H., Schulz, L. E. & Tenenbaum, J. B. The naïve utility calculus: computational principles underlying commonsense psychology. Trends Cogn. Sci. 20, 589–604 (2016).

    Article  Google Scholar 

  73. Liu, S., Brooks, N. B. & Spelke, E. S. Origins of the concepts cause, cost, and goal in prereaching infants. Proc. Natl Acad. Sci. USA 116, 17747–17752 (2019).

    Article  Google Scholar 

  74. Jara-Ettinger, J. Theory of mind as inverse reinforcement learning. Curr. Opin. Behav. Sci. 29, 105–110 (2019).

    Article  MATH  Google Scholar 

  75. Arora, S. & Doshi, P. A survey of inverse reinforcement learning: challenges, methods and progress. Artif. Intell. 297, 103500 (2021).

    Article  MathSciNet  MATH  Google Scholar 

  76. Baker, C., Saxe, R. & Tenenbaum, J. Bayesian theory of mind: Modeling joint belief–desire attribution. In Proc. Annual Meeting of the Cognitive Science Society Vol. 33 (UC Merced, 2011); https://escholarship.org/uc/item/5rk7z59q

  77. Velez-Ginorio, J., Siegel, M. H., Tenenbaum, J. B. & Jara-Ettinger, J. Interpreting actions by attributing compositional desires. In Proc. Annual Meeting of the Cognitive Science Society Vol. 39 (eds Gunzelmann, G. et al.) (UC Merced, 2017); https://escholarship.org/uc/item/3qw110xj

  78. Ho, M. K. & Griffiths, T. L. Cognitive science as a source of forward and inverse models of human decisions for robotics and control. Annu. Rev. Control Robot. Auton. Syst. 5, 33–53 (2022).

    Article  MATH  Google Scholar 

  79. Palan, S. & Schitter, C. Prolific.ac—a subject pool for online experiments. J. Behav. Exp. Finance 17, 22–27 (2018).

    Article  MATH  Google Scholar 

  80. Icarte, R. T., Klassen, T., Valenzano, R. & McIlraith, S. Using reward machines for high-level task specification and decomposition in reinforcement learning. In Proc. 35th International Conference on Machine Learning Vol. 80 (eds Dy, J. & Krause, A.) 2107–2116 (PMLR, 2018).

  81. Brants, T., Popat, A. C, Xu, P., Och, F. J. & Dean, J. Large language models in machine translation. In Proc. 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (ed. Eisner, J.) 858–867 (Association for Computational Linguistics, 2007).

  82. Rothe, A., Lake, B. M. & Gureckis, T. M. Question asking as program generation. In Advances in Neural Information Processing Systems 30 (eds Von Luxburg, U. et al.) 1047–1056 (Curran Associates, 2017).

  83. LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M.’A. & Huang, F. J. in Predicting Structured Data (eds Bakir, G. et al.) Ch. 10 (MIT Press, 2006).

  84. van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748v2 7 (2018).

  85. Charity, M., Green, M. C., Khalifa, A. & Togelius, J. Mech-elites: illuminating the mechanic space of GVG-AI. In Proc. 15th International Conference on the Foundations of Digital Games (eds Yannakakis, G. N. et al.) 8 (Association for Computing Machinery, 2020).

  86. GPT-4 Technical Report (OpenAI, 2023).

  87. Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947).

    Article  MathSciNet  MATH  Google Scholar 

  88. Castro, S. Fast Krippendorff: fast computation of Krippendorff’s alpha agreement measure. GitHub https://github.com/pln-fing-udelar/fast-krippendorff (2017).

  89. Radenbush, S. W. & Bryk, A. S. Hierarchical Linear Models. Applications and Data Analysis Methods 2nd edn (Sage Publications, 2002).

  90. Hox, J., Moerbeek, M. & van de Schoot, R. Multilevel Analysis (Techniques and Applications) 3rd edn (Routledge, 2018).

  91. Argesti, A. Categorical Data Analysis 3rd edn (Wiley, 2018).

  92. Greene, W. H. & Hensher, D. A. Modeling Ordered Choices: A Primer (Cambridge Univ. Press, 2010).

  93. Christensen, R. H. B. ordinal—regression models for ordinal data. R package version 2023.12-4 https://CRAN.R-project.org/package=ordinal (2023).

  94. R Core Team. R: A Language and Environment for Statistical Computing Version 4.3.2 https://www.R-project.org/ (R Foundation for Statistical Computing, 2023).

  95. Long, J. A. jtools: analysis and presentation of social scientific data. J. Open Source Softw. 9, 6610 (2024).

    Article  MATH  Google Scholar 

  96. Lenth, R. V. emmeans: estimated marginal means, aka least-squares means. R package version 1.10.0 https://CRAN.R-project.org/package=emmeans (2024).

  97. Davidson, G., Todd, G., Togelius, J., Gureckis, T. M. & Lake, B. M. guydav/goals-as-reward-producing-programs: release for DOI. Zenodo https://doi.org/10.5281/zenodo.14238893 (2024).

Download references

Acknowledgements

G.D. thanks members of the Human and Machine Learning Lab and the Computation and Cognition Lab at NYU for their feedback at various stages of this project. We thank L. Wong for helpful discussions on which questions to prioritize in the human evaluations of our model outputs. We thank O. Timplaru and https://vecteezy.com for the use of the illustration in Fig. 1a. G.D. and B.M.L. are supported by the National Science Foundation (NSF) under NSF Award 1922658. G.T.’s work on this project is supported by the NSF GRFP under grant DGE-2234660. T.M.G.’s work on this project is supported by NSF BCS 2121102.

Author information

Authors and Affiliations

Authors

Contributions

G.D. designed and executed the behavioural experiments and analysed their data. G.D. and G.T. jointly designed and implemented the GPG model and designed the human evaluations. G.D. led human evaluation data collection and analysis. G.D.and G.T. led the writing of the paper. J.T. advised on computational modelling and helped write the paper. T.M.G. and B.M.L. jointly advised all work reported in this manuscript and helped write the paper.

Corresponding author

Correspondence to Guy Davidson.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Cédric Colas and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Online experiment interface.

The main part of the screen presents the AI2-THOR-based experiment room. Below it, we depict the controls. To the right, we show the text prompts for creating a new game (fonts enlarged for visualization). Our experiment is accessible online here.

Extended Data Fig. 2 Common-sense behavioral analyses.

We plot similar information to Fig. 2b, but including additional object categories and predicates.

Extended Data Fig. 3 Our implementation of the Goal Program Generator model fills the archive quickly and finds examples with human-like fitness scores.

Left: Our model rapidly finds exemplars for all archive cells (that is niches induced by our behavioral characteristics), reaching 50% occupancy after 400 generations (out of a total of 8192) and 95% occupancy after 794 generations—the archive is almost full 1/10th of the way through the search process. Middle: Our model reaches human-like fitness scores. After only three generations, the fittest sample in the archive has a higher fitness score than at least one participant-created game. By the end of the search, the mean fitness in the archive is close to the mean fitness of human games. Right: Our model generates the vast majority of its samples within the range of fitness scores occupied by participant-created games, though few samples approach the top of the range.

Extended Data Fig. 4 Human evaluations interface.

For each game, participants viewed the same four images of the environment, followed by the GPT-4 back-translated description of the game (see Human evaluation methods for details). They then answered the two free-response and seven multiple-choice questions on the right. In the web page version, the questions appeared below the game description; we present them side-by-side to save space.

Extended Data Fig. 5 Mixed model result summary.

We summarize the pairwise comparisons made in Table 1. Each panel corresponds to a set of columns in Table 1 and each color to one of the seven human evaluation attributes we consider. We compare the estimated marginal mean scores under the fitted mixed effect models between each pair of game types listed in the panel title. As in Table 1, we use the method of estimated (least-squares) marginal means to compare the three groups of games, accounting for the random effects fitted to particular games and human evaluators.

Extended Data Fig. 6 Proportion of human interactions activating only matched and real games in the same cell.

Each bar corresponds to a pair of corresponding matched and real games. In each bar, we plot the proportion of relevant interactions (state-action traces) that are unique to the matched game (blue), unique to the real game (green), or shared across both (purple). A few games (with the bar mostly or entirely in purple) show high similarity between the corresponding games — under 25% (7/30) share more than half of their relevant interactions. Most games, however, show substantial differences between the sets of relevant interactions, with some showing a higher fraction unique to human games and others to matched model games. The average Jaccard similarity between the sets of relevant interactions for the matched and real game is only 0.347 and the median similarity is 0.180 (identical games would score 1.0, entirely dissimilar games 0).

Extended Data Fig. 7 Mixed model (including fitness) coefficient summary.

We summarize the fitted model coefficients listed in Extended Data Table 2. Each panel corresponds to a particular coefficient in Extended Data Table 2 and each color to one of the seven human evaluation attributes we consider. We plot the fitted coefficient value and a standard error estimated using the Hessian as implemented in the clmmR package. We observe the same effects discussed in Extended Data Table 2.

Extended Data Table 1 Non-parametric significance test results mostly corroborate mixed-model results
Extended Data Table 2 Fitness scores significantly predict several attributes, including understandability and human-likeness

Supplementary information

Supplementary Information

A: Supplementary Fig. 1 mapping from pseudo-code to our DSL. B: analyses about mapping from natural language to our DSL. C: a description of our full fitness feature set. D: algorithm 1 describing our fitness function objective. E: a full description of the MAP-Elites algorithm and our behavioural characteristics in Supplementary Table 1. F: a description of our approach to back-translation from our DSL to natural language. G: Our analysis of model-generated sample edit distance from real games, including Supplementary Fig. 2. H: Supplementary Fig. 3 demonstrating the highest-fitness model-generated games. I: detailed descriptions of our human evaluations data analysis, including Supplementary Tables 2–6 and Supplementary Figs. 4–6. J: our model ablations, including Supplementary Figs. 7–9. K: the consent form from our online experiments. L: a full description of our DSL.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Davidson, G., Todd, G., Togelius, J. et al. Goals as reward-producing programs. Nat Mach Intell 7, 205–220 (2025). https://doi.org/10.1038/s42256-025-00981-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-00981-4

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics