Fig. 2: Overview of WHAM. | Nature

Fig. 2: Overview of WHAM.

From: World and Human Action Models towards gameplay ideation

Fig. 2

We formulate human gameplay as sequences of discrete tokens, alternating between image observations and controller actions. We use zt to refer to all tokens encoding an observation ot at time step t and at for the controller action. Hatted variables denote model predictions. A VQGAN51 tokenizes the images from observation space, \({{\bf{o}}}_{t}\in {{\mathbb{R}}}^{H\times W\times 3}\) (in which H, W and 3 refer to the height, width and number of channels of the video frames, respectively), to a compact discrete latent space \({{\bf{z}}}_{t}\in {\{1,2,..,{V}_{O}\}}^{{d}_{z}}\), for vocabulary size VO and bottleneck size dz. A causal transformer53 is then trained to predict the latent observation and discretized action tokens. The VQGAN encoder/decoder is trained using a reconstruction and perceptual loss61. No explicit delimiter is provided to distinguish whether an observation or action token should be predicted next—the model must infer this from learned position embeddings.

Back to article page