Fig. 2
From: Fundamentals for predicting transcriptional regulations from DNA sequence patterns

Conceptual representation of self-attention. A Embedding from original input sequences {x1 … xN} to {h1, …, hN}. This embedding is trainable and required before the first self-attention procedure. B Preparing queries {q1, …, qN}, keys {k1, …, kN}, and values {v1, …, vN} from the embedded vectors, where matrixes, Wq, Wk, and Wv, contain trainable parameters, respectively. The aligned two matrixes represent the matrix multiplication (the same after this). In the second and more self-attention procedures, the output from the previous self-attention procedure can be used instead of the shown embedded matrix. C Calculating attention weights for query i (qi). The aligned vector (on the left) and matrix (on the right) indicate the matrix multiplication (the same after this). The dot indicates the dot product. D Calculating the weighted value vector for the query i from the attention weights and values. E Weighted values for all queries (i = 1, …, N). The square brackets indicate that the collection of vectors (box) is treated as a matrix