Fig. 1: Pipeline architecture.
From: Realistic morphology-preserving generative modelling of the brain

The VQ-VAE and transformer two-stage training and inference pipeline is shown. During the VQ-VAE training (blue arrow) a Codebook representation is learned in order to minimise the reconstruction loss between the Input image and Output image. For stability a consistency loss is applied to the Codebook elements in regards to the Encoded image. For the Transformer training the autoregressive conditional generation is learned via a cross-entropy loss applied to the Rasterized code. For inference the Transformer is conditioned on the variables of interest and generates on token at a time, once the whole sequence is generated it is reshaped into a Tokenized encoded image and fed through the Codebook to the Decoder in order to obtain the Output image.