Figure 2
From: Gun identification from gunshot audios for secure public places using transformer learning

The figure shows the division of an image into patches of size \(32\times 32\). The outputs from the linear projection layer is combined with positional embedding and a learnable class embedding for classification. The above diagram of the transformer encoder was derived from the work of Vaswani et al.43.