Symplectic Attention
There are two ways of constructing symplectic attention: with a matrix-like softmax and a classical softmax.
Both of these approaches are however based on taking the derivative of an expression with respect to its input arguments (similar to gradient layers).
We first start with singlehead attention[1]. We start with the correlation matrix:
\[C = Z^TAZ \implies C_{mn} = \sum_{k\ell}Z_{km}A_{k\ell}Z_{\ell{}n}.\]
Its element-wise derivative is:
\[\frac{\partial{}C_{mn}}{\partial{}Z_{ij}} = \sum_{\ell}(\delta_{jm}A_{i\ell}Z_{\ell{}n} + \delta_{jn}X_{\ell{}m}A_{\ell{}i}).\]
Matrix Softmax
Here we take as a staring point the expression:
\[\Sigma(Z) = \mathrm{log}(1 + \exp(\sum_{m,n}C_{mn})).\]
Its gradient (with respect to $Z$) is:
\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = \frac{1}{1 + \sum{m, n}\exp(C_{mn})}\sum_{m'n'}\exp(C_{m'n'})\sum_{\ell}(\delta_{jm'}A_{i\ell}Z_{\ell{}n'} + \delta_{jn'}X_{\ell{}m'}A_{\ell{}i}) = \frac{1}{1 + \sum_{m,n}\exp(C_{mn})}\{[AX\exp.(C)^T]_{ij} + [A^TX\exp.(C)]_{ij}\}.\]
Note that if A
is a SymmetricMatrix
the expression than simplifies to:
\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = 2\frac{1}{1 + \sum_{m,n}\exp(C_{mn})}[AX\exp.(C)^T]_{ij},\]
or written in matrix notation:
\[\nabla_Z\Sigma(Z) = 2\frac{1}{1 + \sum_{m,n}\exp(C_{mn})}AX\exp.(C).\]
Whether we use a SymmetricMatrix
for $A$ or not can be set with the keyword symmetric
in SymplecticAttention
.
Vector Softmax
Here we take as a staring point the expression:
\[\Sigma(Z) = \sum_{n}\mathrm{log}(1 + \exp(\sum_{m}C_{mn})).\]
We then get:
\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = \sum_n\frac{\exp(C_{jn})}{1 + \sum_m\exp(C_{mn})}[AZ]_{in} + \frac{1}{1 + \sum_m\exp(C_{mj})}[A^TZ\exp(C)]_{ij}.\]
The second term in this expression is equivalent to a standard attention step:
\[\mathrm{TermII:}\qquad A^TZ\mathrm{softmax}(C).\]
The first term is equivalent to:
\[\mathrm{TermI:}\qquad \sum_n [AZ]_{in}[\mathrm{softmax}(C)^T]_{nj} \equiv AZ(\mathrm{softmax}(C))^T.\]
If we again assume that the matrix A
is a SymmetricMatrix
then the expression simplifies to:
\[\nabla_Z\Sigma(Z) = AZ\mathrm{softmax}(C).\]
Library Functions
GeometricMachineLearning.VectorSoftmax
— TypeVectorSoftmax <: AbstractSoftmax
Turn an arbitrary vector into a probability vector with:
\[[\mathrm{softmax}(a)]_i = \frac{e^{a_i}}{\sum_{i'=1}^de^{a_i}}. \]
This is what is most often understood under the name "softmax". MatrixSoftmax
is the matrix version.
GeometricMachineLearning.MatrixSoftmax
— TypeMatrixSoftmax
Like VectorSoftmax
but for matrices:
\[[\mathrm{softmax}(A)]_{ij} = \frac{e^{A_{ij}}}{\sum_{i'=1, j'=1}^{d,\bar{d}}e^{A_{ij}}}. \]
GeometricMachineLearning.SymplecticAttention
— TypeSymplecticAttention
Implements the symplectic attention layers. See LinearSymplecticAttention
.
Keys
It stores the following key:
activation::
AbstractSoftmax
Constructors
See SymplecticAttentionQ
and SymplecticAttentionP
.
Implementation
SymplecticAttention
is similar to MultiHeadAttention
or VolumePreservingAttention
in that it computes the scalar products of all vectors in a sequence of input vectors:
\[C = Q^TAQ,\]
where $Q$ is the $q$-part of an input $Z$ (see QPT
). The matrix $A$ is a weighting that can either be symmetric or skew-symmetric (this can be adjusted with the key-word symmetric::Bool
).
Extended help
The symplectic attention mechanism is derived via computing the gradient of a separable Hamiltonian, as is also done in GSympNet
s.
GeometricMachineLearning.SymplecticAttentionQ
— TypeSymplecticAttentionQ
A constant that is derived from SymplecticAttention
. This only changes the $q$-part of the input.
Constructor
SymplecticAttentionQ(M; symmetric::Bool, activation)
The default for the keywords are true and MatrixSoftmax().
You may want to alter the activation function (either MatrixSoftmax
or VectorSoftmax
), but its almost always better to set the keyword symmetric
to true
.
GeometricMachineLearning.SymplecticAttentionP
— TypeSymplecticAttentionP
A constant that is derived from SymplecticAttention
. This only changes the p
-part of the input.
Constructor
SymplecticAttentionP(M; symmetric::Bool, activation)
The default for the keywords are true and MatrixSoftmax().
You may want to alter the activation function (either MatrixSoftmax
or VectorSoftmax
), but its almost always better to set the keyword symmetric
to true
.
- 1The multihead attention case is in principle not much more difficult, but care has to be taken with regards to the projection spaces.