Skip to content

Commit

Permalink
Give a functional summary
Browse files Browse the repository at this point in the history
  • Loading branch information
IvanUkhov committed Jan 16, 2024
1 parent f217a9a commit f411e4e
Showing 1 changed file with 24 additions and 3 deletions.
27 changes: 24 additions & 3 deletions _drafts/2024-02-01-relative-positional-embedding.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ _output_ sequence.
The relative attention obtains one additional term in the numerator:

$$
A = \text{softmax}\left( \frac{QK^T + S}{\sqrt{d_h}} \right) V.
A = \text{softmax}\left( \frac{QK^T + S}{\sqrt{d_h}} \right) V. \tag{1}
$$

In the above, $$S$$ is of shape $$n_s \times n_h \times n_{t_2} \times n_{t_1}$$
Expand Down Expand Up @@ -87,7 +87,8 @@ $$0$$ through $$n_{t_3} - 1$$ inclusively, while the last (position $$n_{t_3} -

Similarly to Huang et al. (2018), we note that multiplying $$Q$$ by $$E$$
results in a matrix that contains all the inner products necessary for
assembling $$S$$ in the general case. For $$t_3 = 4$$, it is as follows:
assembling $$S$$ in the general case. For $$t_3 = 4$$ and dropping the batch and
head dimensions for clearer visualization, the product is as follows:

$$
QE = \left(
Expand Down Expand Up @@ -123,7 +124,7 @@ s_{0 + 3} & s_{1 + 2} & s_{2 + 1} & s_{3 + 0} \\
\right)
$$

and transpose the result
and then transpose the result

$$
S = \left(
Expand All @@ -136,6 +137,26 @@ s_{3 - 3} & s_{3 - 2} & s_{3 - 1} & s_{3 + 0} \\
\right).
$$

More generally, the algorithm can be summarized as follows:

$$
S = \text{transpose}\left(
\text{stack-diagonals}\left(
QE, \, 0, \, n_{t_3} - 1
\right)
\right)
$$

where $$\text{stack-diagonals}$$ is a function taking a tensor and stacking its
diagonals specified by a range with two offsets relative to the main diagonal
from bottom up, and $$\text{transpose}$$ is a function taking a tensor and
permuting its last two dimensions.

The matrix can then be plugged into Equation (1) to complete the calculation. In
case the queries are shorter than the keys and values, which is what happens in
the prediction mode, $$S$$ will have the right amount for rows but the last
columns will be excessive and hence have to be discarded.

# References

* Huang et al., “[Music transformer: Generating music with long-term
Expand Down

0 comments on commit f411e4e

Please sign in to comment.