Skip to content

Commit

Permalink
Explain the mathematics
Browse files Browse the repository at this point in the history
  • Loading branch information
IvanUkhov committed Jan 16, 2024
1 parent ee5ccef commit f217a9a
Showing 1 changed file with 83 additions and 11 deletions.
94 changes: 83 additions & 11 deletions _drafts/2024-02-01-relative-positional-embedding.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ blocks, in which the self-attention is causal. In this article, the approach is
generalized to any attention mechanism, should it be self or cross or full or
causal.

# Background

The classical attention is formalized as follows:

$$
Expand Down Expand Up @@ -47,22 +49,92 @@ interpretation that the embeddings are running from position $$-n_{t_1} + 1$$
arrangement of the inner products between the queries in $$Q$$ and the
embeddings in $$E$$ so as to respect the arrangement in $$QK^T$$.

The direct and more memory efficient ways of calculating $$S$$ in the case of
The original and more memory efficient calculations of $$S$$ in the case of
causal attention, are compared in the illustration below, which is taken from
Huang et al. (2018).

![](/assets/images/2024-02-01-relative-position/huang.jpeg)

The matrix to the very right shows how $$S$$ is arranged. Since it is for the
case of causal attention, the upper triangle above the main diagonal (gray
circles) is irrelevant and can contain arbitrary values, which it does in the
algorithm proposed in Huang et al. (2018). The main diagonal (green circles)
contains the inner products of the queries and the embedding corresponding to
position $$0$$. The first subdiagonal (pink circles) contains the inner products
of the queries except for the first one as it has no past, and the embedding
corresponding to position $$-1$$. And it continues in this way down to
$$-n_{t_1} + 1$$, in which case it is only the last query that is involved,
since it comes last in the input sequence and has the longest past.
The matrix to the very right shows how $$S$$ is arranged. Since it is for causal
attention, the upper triangle above the main diagonal (gray circles) is
irrelevant and can contain arbitrary values, which it does in the algorithm
proposed in Huang et al. (2018). The main diagonal (green circles) contains the
inner products of the queries and the embedding corresponding to position $$0$$.
The first subdiagonal (pink circles) contains the inner products of the queries
except for the first one as it has no past, and the embedding corresponding to
position $$-1$$. And it continues in this way down to $$-n_{t_1} + 1$$, in which
case it is only the last query that is involved, since it comes last in the
sequence and has the longest past.

The memory efficient calculation given in Huang et al. (2018) is limited to
self-attention with causal connectivity, which is what is found in decoder
blocks. It is not suitable for other attention patterns. Therefore, it cannot be
used in, for instance, encoder blocks and decoder blocks with cross-attention,
which usually have non-causal attention. In what follow, the limitation is
lifted.

# Algorithm

Let us extend $$E$$ to be of shape $$d_h \times (2 n_{t_3} - 1)$$ so that it has
an embedding for any relative position not only in the past but also in the
future, with $$n_{t_3}$$ being the maximum relative distance as before. Let us
also interpret $$E$$'s columns as running from position $$n_{t_3} - 1$$
(farthest into the future) to position $$-n_{t_3} + 1$$ (farthest into the
past). For instance, when the output sequence is of length $$t_3$$ (the longest
possible), the first query (position 0) will be “interested” only in columns
$$0$$ through $$n_{t_3} - 1$$ inclusively, while the last (position $$n_{t_3} -
1$$) only in columns $$n_{t_3} - 1$$ through $$2 n_{t_3} - 2$$ inclusively.

Similarly to Huang et al. (2018), we note that multiplying $$Q$$ by $$E$$
results in a matrix that contains all the inner products necessary for
assembling $$S$$ in the general case. For $$t_3 = 4$$, it is as follows:

$$
QE = \left(
\begin{matrix}
s_{0 + 3} & s_{0 + 2} & s_{0 + 1} & s_{0 + 0} & & & \\
& s_{1 + 2} & s_{1 + 1} & s_{1 + 0} & s_{1 - 1} & & \\
& & s_{2 + 1} & s_{2 + 0} & s_{2 - 1} & s_{2 - 2} & \\
& & & s_{3 + 0} & s_{3 - 1} & s_{3 - 2} & s_{3 - 3} \\
\end{matrix}
\right)
$$

where $$s_{i + t}$$ denotes query $$i$$ embedded to look at relative time $$t$$,
that is, the inner product between the query at position $$i$$ and the embedding
corresponding to a relative attention shift of $$t$$, whose embedding is stored
in column $$n_{t_3} - 1 - t$$ of $$E$$. For instance, for $$s_{2 - 1}$$ with
$$t_3 = 4$$ still, the inner product is between row $$2$$ of $$Q$$ and column
$$4 - 1 - (-1) = 4$$ of $$E$$.

The target arrangement is then simply the one where we stack the
“interesting” diagonals of $$QE$$ on top of each other from diagonal $$0$$ (the
main diagonal) at the bottom and diagonal $$t_3 - 1$$ (the rightmost relevant
superdiagonal) at the top

$$
\bar{S} = \left(
\begin{matrix}
s_{0 + 0} & s_{1 - 1} & s_{2 - 2} & s_{3 - 3} \\
s_{0 + 1} & s_{1 + 0} & s_{2 - 1} & s_{3 - 2} \\
s_{0 + 2} & s_{1 + 1} & s_{2 + 0} & s_{3 - 1} \\
s_{0 + 3} & s_{1 + 2} & s_{2 + 1} & s_{3 + 0} \\
\end{matrix}
\right)
$$

and transpose the result

$$
S = \left(
\begin{matrix}
s_{0 + 0} & s_{0 + 1} & s_{0 + 2} & s_{0 + 3} \\
s_{1 - 1} & s_{1 + 0} & s_{1 + 1} & s_{1 + 2} \\
s_{2 - 2} & s_{2 - 1} & s_{2 + 0} & s_{2 + 1} \\
s_{3 - 3} & s_{3 - 2} & s_{3 - 1} & s_{3 + 0} \\
\end{matrix}
\right).
$$

# References

Expand Down

0 comments on commit f217a9a

Please sign in to comment.