Symplectic Attention
There are two ways of constructing symplectic attention: with a matrix-like softmax and a classical softmax.
Both of these approaches are however based on taking the derivative of an expression with respect to its input arguments (similar to gradient layers).
We first start with singlehead attention[1]. We start with the correlation matrix:
\[C = Z^TAZ \implies C_{mn} = \sum_{k\ell}Z_{km}A_{k\ell}Z_{\ell{}n}.\]
Its element-wise derivative is:
\[\frac{\partial{}C_{mn}}{\partial{}Z_{ij}} = \sum_{\ell}(\delta_{jm}A_{i\ell}Z_{\ell{}n} + \delta_{jn}Z_{\ell{}m}A_{\ell{}i}).\]
Matrix Softmax
Here we take as a staring point the expression:
\[\Sigma(Z) = \mathrm{log}(1 + \exp(\sum_{m,n}C_{mn})).\]
Its gradient (with respect to $Z$) is:
\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = \frac{1}{1 + \sum{m, n}\exp(C_{mn})}\sum_{m'n'}\exp(C_{m'n'})\sum_{\ell}(\delta_{jm'}A_{i\ell}Z_{\ell{}n'} + \delta_{jn'}Z_{\ell{}m'}A_{\ell{}i}) = \frac{1}{1 + \sum_{m,n}\exp(C_{mn})}\{[AZ\exp.(C)^T]_{ij} + [A^TZ\exp.(C)]_{ij}\}.\]
Note that if A is a SymmetricMatrix the expression than simplifies to:
\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = 2\frac{1}{1 + \sum_{m,n}\exp(C_{mn})}[AZ\exp.(C)^T]_{ij},\]
or written in matrix notation:
\[\nabla_Z\Sigma(Z) = 2\frac{1}{1 + \sum_{m,n}\exp(C_{mn})}AZ\exp.(C).\]
Whether we use a SymmetricMatrix for $A$ or not can be set with the keyword symmetric in SymplecticAttention.
Vector Softmax
Here we take as a staring point the expression:
\[\Sigma(Z) = \sum_{n}\mathrm{log}(1 + \exp(\sum_{m}C_{mn})).\]
We then get:
\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = \sum_n\frac{\exp(C_{jn})}{1 + \sum_m\exp(C_{mn})}[AZ]_{in} + \frac{1}{1 + \sum_m\exp(C_{mj})}[A^TZ\exp(C)]_{ij}.\]
The second term in this expression is equivalent to a standard attention step:
\[\mathrm{TermII:}\qquad A^TZ\mathrm{softmax}_1(C).\]
The first term is equivalent to:
\[\mathrm{TermI:}\qquad \sum_n [AZ]_{in}[\mathrm{softmax}_1(C)^T]_{nj} \equiv AZ(\mathrm{softmax}_1(C))^T.\]
If we again assume that the matrix A is a SymmetricMatrix then the expression simplifies to:
\[\nabla_Z\Sigma(Z) = AZ\mathrm{softmax}_1(C).\]
Note that we used the "one softmax" here instead of the standard softmax. The one softmax has been shown to be favorable to the standard softmax in some cases.
A discussion of the one softmax can be found in [95].
Library Functions
GeometricMachineLearning.VectorSoftmax — Type
VectorSoftmax <: AbstractSoftmaxTurn an arbitrary vector into a probability vector with:
\[[\mathrm{softmax}(a)]_i = \frac{e^{a_i}}{\sum_{i'=1}^de^{a_i}}. \]
This is what is most often understood under the name "softmax". MatrixSoftmax is the matrix version.
GeometricMachineLearning.MatrixSoftmax — Type
MatrixSoftmaxLike VectorSoftmax but for matrices:
\[[\mathrm{softmax}(A)]_{ij} = \frac{e^{A_{ij}}}{\sum_{i'=1, j'=1}^{d,\bar{d}}e^{A_{ij}}}. \]
GeometricMachineLearning.SymplecticAttention — Type
SymplecticAttentionImplements the symplectic attention layers. See LinearSymplecticAttention.
Keys
It stores the following key:
activation::AbstractSoftmax
Constructors
See SymplecticAttentionQ and SymplecticAttentionP.
Implementation
SymplecticAttention is similar to MultiHeadAttention or VolumePreservingAttention in that it computes the scalar products of all vectors in a sequence of input vectors:
\[C = Q^TAQ,\]
where $Q$ is the $q$-part of an input $Z$ (see QPT). The matrix $A$ is a weighting that can either be symmetric or skew-symmetric (this can be adjusted with the key-word symmetric::Bool).
Extended help
The symplectic attention mechanism is derived via computing the gradient of a separable Hamiltonian, as is also done in GSympNets.
GeometricMachineLearning.SymplecticAttentionQ — Type
SymplecticAttentionQA constant that is derived from SymplecticAttention. This only changes the $q$-part of the input.
Constructor
SymplecticAttentionQ(M; symmetric::Bool, activation)The default for the keywords are true and MatrixSoftmax().
You may want to alter the activation function (either MatrixSoftmax or VectorSoftmax), but its almost always better to set the keyword symmetric to true.
GeometricMachineLearning.SymplecticAttentionP — Type
SymplecticAttentionPA constant that is derived from SymplecticAttention. This only changes the p-part of the input.
Constructor
SymplecticAttentionP(M; symmetric::Bool, activation)The default for the keywords are true and MatrixSoftmax().
You may want to alter the activation function (either MatrixSoftmax or VectorSoftmax), but its almost always better to set the keyword symmetric to true.
- 1The multihead attention case is in principle not much more difficult, but care has to be taken with regards to the projection spaces.