Fig. 1: ALBERT model and the GLUE benchmark datasets.
From: Demonstration of transformer-based ALBERT model on a 14nm analog AI inference chip

a A Transformer model relies on the attention mechanism instead of recurrence for sequence processing. b The structure of the ALBERT-base model with 12 layers of shared weights. Each layer consists of a self-attention block and a feed-forward network (FFN) block, each followed by residual addition (Add) and layer normalization (LayerNorm). The four layer-blocks implemented in hardware include inProj (mapping input activations to queries (Q), keys (K), and values (V)), outProj (mapping attention computation to output activations), and the two fully connected layer-blocks comprising the FFN (FC1, FC2). These four layer-blocks represent over 99% of the weights in the ALBERT-base model. c Seven GLUE benchmark tasks used in this paper have validation data-sets that vary considerably in size (number of examples); all but one are binary classification tasks. d The distribution of the sequence lengths for the validation datasets associated with the seven GLUE tasks. For sequences with 64 tokens, the analog accelerator is performing 98% of the required operations; for longer sequences, this percentage is slightly lower since the fully-connected layer-blocks scale linearly with sequence-length, yet the attention-compute scales quadratically. e Simulated accuracy of the seven GLUE tasks as model weights are quantized from 6-bit precision down to 2-bit (2-bit: grey; 3-bit: teal; 4-bit: lime; 5-bit: orange, 6-bit: blue), revealing significant differences in difficulty and robustness between tasks. An effective precision of 4 bits is sufficient for almost all the tasks.