Transformer Self-Attention

0:00:00

Let’s say we’re building a transformer-based model for text generation, and we want to understand how it computes self-attention at each layer.

How does the model mathematically compute self-attention using queries $Q$ , keys $K$ , and values $V$ , and why do we need to apply a masking layer in the decoder when training on sequences?