Fig. 4

The clustered Transformer structure primarily uses k-means clustering on the Query matrix within the attention scores, then broadcasts the attention scores derived from the cluster centroids to each cluster. This approach approximates the full attention scores using multiple cluster centroids. The yellow, green, red, and blue colours of the Q matrix in the figure represent the clustering categories to which each sample of the data belongs.