Fig. 1: Neural network schematic.

The schematic plot of the simple character-level language model used in this work. The model consists of three main parts: The embedding layer, the LSTM layer, and a dense output layer. The embedding layer is a linear layer which multiplies the one-hot input s(t) by a matrix and produces an embedding vector x(t). The x(t) is then used as the input of LSTM network, in which the forget gate f(t), the input gate i(t), the output gate o(t), and the candidate value \({\tilde{{\bf{c}}}}^{(t)}\) are all controlled by (x(t), h(t−1)). The forget gate and input gate are then used to produce the update equation of cell state c(t). The output gate decides how much information propagates to the next time step. The output layer predicts the probabilities \({\hat{{\bf{y}}}}^{(t)}\) by parametrizing the transformation from h(t) to \(\hat{{\bf{y}}}\) with learned weights Dd and learned biases bd. Finally, we can compute the cross entropy between the predicted probability distribution \({\hat{{\bf{y}}}}^{(t)}\) and the true probability distribution y(t) = s(t+1).