Figure 3

The X-shaped variational autoencoder architecture. Each column of nodes is a neural layer, and the circles it contains are neurons, which essentially are some non-linear mathematical combinations of the neurons from the previous layers. The solid lines show that all neurons can be used to compute all neurons in the next layer. The dotted vertical lines indicate that some neurons are not displayed to simplify the illustration. The numbers represent the number of each neuron. The X-shaped variational autoencoder (XVAE) consists of two main parts. First, the encoder (green box), which maps the input data into the smaller latent feature space, and the decoder (red box), which reconstructs the original variables from the latent features. The input data consists of two separate input sets, input S1 containing all numerical variables (blue box) and input S2 containing all categorical variables (orange box). Each input set is first fed into its own hidden layer. The resulting two hidden layers are then combined into another hidden layer that is fed into the encoding layer (green), which also generates the latent features on which the clustering is performed. The encoding layer uses stochastic inference to approximate the latent features as probability distributions, which in this case are Gaussian. Therefore, the encoding layer is separated into the mean and standard deviation of those distributions (not visualised). Next, the decoder starts where the encoding layer feeds into a hidden layer, which then splits into two separate hidden layers, each feeding into its own output layer to reconstruct the original variables. Finally, the reconstruction loss is determined by computing the mean squared error of the numerical variables and the cross-entropy of the categorical variables, which are both scaled by the number of variables of the input data.