Fig. 3: PLATO uses both a perceptual model and a dynamics model to make per-object predictions.
From: Intuitive physics learning in a deep-learning model inspired by developmental psychology

PLATO consists of two components: the perception module (left) and the dynamics predictor (right). The perception module is used to convert visual input into a set of object codes. These object codes are used by the dynamics modules to predict future frames. a, The perception module takes as input an image x and an associated segmentation mask m1:K. Taking the elementwise product yields a set of images of just the visible parts of each object: x1:K. b, Given an object image-mask pair, the perception module produces an object code zk via an encoder module ϕ. The object code is decoded back into a reconstruction of the object image-mask pair via the decoder module θ. The discrepancy between the reconstruction and the original image-mask pair is used to train the parameters of ϕ and θ such that zk comes to represent informative aspects of each object image-mask pair. c, After training, an entire image can be reconstructed via a set of object codes z1:K by independently running each image-mask pair through ϕ and decoding via θ. d, The dynamics module is trained on sequence data produced by running videos (and their segmentation masks) through the pretrained encoder ϕ. The dynamics module must predict the object codes in the next frame given the object codes in the current frame \({z}_{1:K}^{t}\) and an object buffer of the codes in the preceding frames \({z}_{1:K}^{1:t-1}\). The dynamics module comprises two trainable components: a ‘slotted’ object-based LSTM and an interaction network (IN). Predictions are made by computing interactions from each slot in the LSTM’s previous state (dotted arrow) to every other slot in the LSTM and all input object codes and buffers \({z}_{1:K}^{1:t}\). The resulting interaction is used to make objectwise predictions and updates to the LSTM.