Skip to content

Commit

Permalink
Give formulas for the classical and relative attention
Browse files Browse the repository at this point in the history
  • Loading branch information
IvanUkhov committed Jan 15, 2024
1 parent 01e955c commit c9bef50
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions _drafts/2024-02-01-relative-positional-embedding.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,25 @@ present an efficient way of calculation this embedding in decoder blocks, in
which the self-attention is causal. In this article, the approach is generalized
to any attention mechanism, should it be self or cross or full or causal.

The classical attention is formalized as follows:

$$
A = \text{softmax}\left( \frac{QK^{T}}{\sqrt{d_h}} \right) V
$$

where $$K$$, $$V$$, and $$Q$$ are the keys, values, and queries, respectively,
and $$d_h$$ is the dimensionality of attention heads. The relative attention, on
the other hand, obtains one additional term in the numerator:

$$
A = \text{softmax}\left( \frac{QK^{T} + S_\text{rel}}{\sqrt{d_h}} \right) V.
$$

The illustration below, taken from Huang et al. (2018), compares the two
techniques:

![](/assets/images/2024-02-01-relative-position/huang.jpeg)

# References

* Huang et al., “[Music transformer: Generating music with long-term
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c9bef50

Please sign in to comment.