Figure 3
From: Gun identification from gunshot audios for secure public places using transformer learning

The diagram shows VIT-32 Model Architecture. It contains 24 transformer encoder blocks. The encoder block is shown in Fig. 2. in details. The arrows show the forward propagation.