Symplectic Attention
There are two ways of constructing symplectic attention: with a matrix-like softmax and a classical softmax.
Both of these approaches are however based on taking the derivative of an expression with respect to its input arguments (similar to gradient layers).
We first start with singlehead attention[1]. We start with the correlation matrix:
\[C = Z^TAZ \implies C_{mn} = \sum_{k\ell}Z_{km}A_{k\ell}Z_{\ell{}n}.\]
Its element-wise derivative is:
\[\frac{\partial{}C_{mn}}{\partial{}Z_{ij}} = \sum_{\ell}(\delta_{jm}A_{i\ell}Z_{\ell{}n} + \delta_{jn}X_{\ell{}m}A_{\ell{}i}).\]
Matrix Softmax
Here we take as a staring point the expression:
\[\Sigma(Z) = \mathrm{log}(1 + \exp(\sum_{m,n}C_{mn})).\]
Its gradient (with respect to $Z$) is:
\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = \frac{1}{1 + \sum{m, n}\exp(C_{mn})}\sum_{m'n'}\exp(C_{m'n'})\sum_{\ell}(\delta_{jm'}A_{i\ell}Z_{\ell{}n'} + \delta_{jn'}X_{\ell{}m'}A_{\ell{}i}) = \frac{1}{1 + \sum_{m,n}\exp(C_{mn})}\{[AX\exp.(C)^T]_{ij} + [A^TX\exp.(C)]_{ij}\}.\]
Note that if A is a SymmetricMatrix the expression than simplifies to:
\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = 2\frac{1}{1 + \sum_{m,n}\exp(C_{mn})}[AX\exp.(C)^T]_{ij},\]
or written in matrix notation:
\[\nabla_Z\Sigma(Z) = 2\frac{1}{1 + \sum_{m,n}\exp(C_{mn})}AX\exp.(C).\]
Whether we use a SymmetricMatrix for $A$ or not can be set with the keyword symmetric in SymplecticAttention.
Vector Softmax
Here we take as a staring point the expression:
\[\Sigma(Z) = \sum_{n}\mathrm{log}(1 + \exp(\sum_{m}C_{mn})).\]
We then get:
\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = \sum_n\frac{\exp(C_{jn})}{1 + \sum_m\exp(C_{mn})}[AZ]_{in} + \frac{1}{1 + \sum_m\exp(C_{mj})}[A^TZ\exp(C)]_{ij}.\]
The second term in this expression is equivalent to a standard attention step:
\[\mathrm{TermII:}\qquad A^TZ\mathrm{softmax}(C).\]
The first term is equivalent to:
\[\mathrm{TermI:}\qquad \sum_n [AZ]_{in}[\mathrm{softmax}(C)^T]_{nj} \equiv AZ(\mathrm{softmax}(C))^T.\]
If we again assume that the matrix A is a SymmetricMatrix then the expression simplifies to:
\[\nabla_Z\Sigma(Z) = AZ\mathrm{softmax}(C).\]
Library Functions
GeometricMachineLearning.VectorSoftmax — TypeVectorSoftmax <: AbstractSoftmaxTurn an arbitrary vector into a probability vector with:
\[[\mathrm{softmax}(a)]_i = \frac{e^{a_i}}{\sum_{i'=1}^de^{a_i}}. \]
This is what is most often understood under the name "softmax". MatrixSoftmax is the matrix version.
GeometricMachineLearning.MatrixSoftmax — TypeMatrixSoftmaxLike VectorSoftmax but for matrices:
\[[\mathrm{softmax}(A)]_{ij} = \frac{e^{A_{ij}}}{\sum_{i'=1, j'=1}^{d,\bar{d}}e^{A_{ij}}}. \]
GeometricMachineLearning.SymplecticAttention — TypeSymplecticAttentionImplements the symplectic attention layers. See LinearSymplecticAttention.
Keys
It stores the following key:
activation::AbstractSoftmax
Constructors
See SymplecticAttentionQ and SymplecticAttentionP.
Implementation
SymplecticAttention is similar to MultiHeadAttention or VolumePreservingAttention in that it computes the scalar products of all vectors in a sequence of input vectors:
\[C = Q^TAQ,\]
where $Q$ is the $q$-part of an input $Z$ (see QPT). The matrix $A$ is a weighting that can either be symmetric or skew-symmetric (this can be adjusted with the key-word symmetric::Bool).
Extended help
The symplectic attention mechanism is derived via computing the gradient of a separable Hamiltonian, as is also done in GSympNets.
GeometricMachineLearning.SymplecticAttentionQ — TypeSymplecticAttentionQA constant that is derived from SymplecticAttention. This only changes the $q$-part of the input.
Constructor
SymplecticAttentionQ(M; symmetric::Bool, activation)The default for the keywords are true and MatrixSoftmax().
You may want to alter the activation function (either MatrixSoftmax or VectorSoftmax), but its almost always better to set the keyword symmetric to true.
GeometricMachineLearning.SymplecticAttentionP — TypeSymplecticAttentionPA constant that is derived from SymplecticAttention. This only changes the p-part of the input.
Constructor
SymplecticAttentionP(M; symmetric::Bool, activation)The default for the keywords are true and MatrixSoftmax().
You may want to alter the activation function (either MatrixSoftmax or VectorSoftmax), but its almost always better to set the keyword symmetric to true.
- 1The multihead attention case is in principle not much more difficult, but care has to be taken with regards to the projection spaces.