Transformer Self-Attention
Start Timer
0:00:00
Let’s say we’re building a transformer-based model for text generation, and we want to understand how it computes self-attention at each layer.
How does the model mathematically compute self-attention using queries , keys , and values , and why do we need to apply a masking layer in the decoder when training on sequences?
.
.
.
.
.
.
.
.
.
Comments