Transformer Self-Attention

Start Timer

0:00:00

Upvote
0
Downvote
Save question
Mark as completed
View comments

Let’s say we’re building a transformer-based model for text generation, and we want to understand how it computes self-attention at each layer.

How does the model mathematically compute self-attention using queries QQ, keys KK, and values VV, and why do we need to apply a masking layer in the decoder when training on sequences?

.
.
.
.
.


Comments

Loading comments