I have a more intuitive explanation of Q and K. Q (Query) is like a search query...

I have a more intuitive explanation of Q and K.

Q (Query) is like a search query. K (Key) is like a set of tags or attributes of each word that the query can look for.

Imagine the self-attention scores for the sentence ”The chicken crossed the road because it wanted to get to the other side.”

Let’s look specifically at the word “it”. We can imagine the matrix Q_it as a representation of attributes of the word we want “it” to gain context from. For example, “it” is a pronoun, and therefore gains context from nouns. Therefore, the matrix Q_it might have some representation of “noun” in it, because we want it to take context from nouns.

In other words, one of the goals during the training process of a transformer network is to train a weight W_q to map a word to a matrix representation of attributes of words it gains context from. So W_q should map any word that’s a pronoun to a noun query.

Similarly, K can be thought of as a representation of attributes of each word. So K_chicken should also have a “noun” tag. So one of the goals during training is to train W_k, that maps the latent-space representation of a word (chicken) to a matrix representation (K_chicken) of it’s important attributes (noun).

When we take the dot product of Q and K, what we’re finding is the similarity between those 2 matrices. In other words Q_it * K_chicken should have a high value because “it” is looking for nouns in it’s query, and “chicken” is a noun.

Obviously this is a very human-centric explanation and how the weights W_q and W_k are trained in practice may not align perfectly with human interpretable concepts, but hopefully helps with understanding.