Symplectic Attention

There are two ways of constructing symplectic attention: with a matrix-like softmax and a classical softmax.

Both of these approaches are however based on taking the derivative of an expression with respect to its input arguments (similar to gradient layers).

We first start with singlehead attention[1]. We start with the correlation matrix:

\[C = Z^TAZ \implies C_{mn} = \sum_{k\ell}Z_{km}A_{k\ell}Z_{\ell{}n}.\]

Its element-wise derivative is:

\[\frac{\partial{}C_{mn}}{\partial{}Z_{ij}} = \sum_{\ell}(\delta_{jm}A_{i\ell}Z_{\ell{}n} + \delta_{jn}X_{\ell{}m}A_{\ell{}i}).\]

Matrix Softmax

Here we take as a staring point the expression:

\[\Sigma(Z) = \mathrm{log}(1 + \exp(\sum_{m,n}C_{mn})).\]

Its gradient (with respect to $Z$) is:

\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = \frac{1}{1 + \sum{m, n}\exp(C_{mn})}\sum_{m'n'}\exp(C_{m'n'})\sum_{\ell}(\delta_{jm'}A_{i\ell}Z_{\ell{}n'} + \delta_{jn'}X_{\ell{}m'}A_{\ell{}i}) = \frac{1}{1 + \sum_{m,n}\exp(C_{mn})}\{[AX\exp.(C)^T]_{ij} + [A^TX\exp.(C)]_{ij}\}.\]

Note that if A is a SymmetricMatrix the expression than simplifies to:

\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = 2\frac{1}{1 + \sum_{m,n}\exp(C_{mn})}[AX\exp.(C)^T]_{ij},\]

or written in matrix notation:

\[\nabla_Z\Sigma(Z) = 2\frac{1}{1 + \sum_{m,n}\exp(C_{mn})}AX\exp.(C).\]

Whether we use a SymmetricMatrix for $A$ or not can be set with the keyword symmetric in SymplecticAttention.

Vector Softmax

Here we take as a staring point the expression:

\[\Sigma(Z) = \sum_{n}\mathrm{log}(1 + \exp(\sum_{m}C_{mn})).\]

We then get:

\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = \sum_n\frac{\exp(C_{jn})}{1 + \sum_m\exp(C_{mn})}[AZ]_{in} + \frac{1}{1 + \sum_m\exp(C_{mj})}[A^TZ\exp(C)]_{ij}.\]

The second term in this expression is equivalent to a standard attention step:

\[\mathrm{TermII:}\qquad A^TZ\mathrm{softmax}(C).\]

The first term is equivalent to:

\[\mathrm{TermI:}\qquad \sum_n [AZ]_{in}[\mathrm{softmax}(C)^T]_{nj} \equiv AZ(\mathrm{softmax}(C))^T.\]

If we again assume that the matrix A is a SymmetricMatrix then the expression simplifies to:

\[\nabla_Z\Sigma(Z) = AZ\mathrm{softmax}(C).\]

Library Functions

GeometricMachineLearning.VectorSoftmaxType
VectorSoftmax <: AbstractSoftmax

Turn an arbitrary vector into a probability vector with:

\[[\mathrm{softmax}(a)]_i = \frac{e^{a_i}}{\sum_{i'=1}^de^{a_i}}. \]

This is what is most often understood under the name "softmax". MatrixSoftmax is the matrix version.

source
GeometricMachineLearning.SymplecticAttentionType
SymplecticAttention

Implements the symplectic attention layers. See LinearSymplecticAttention.

Keys

It stores the following key:

  • activation::AbstractSoftmax

Constructors

See SymplecticAttentionQ and SymplecticAttentionP.

Implementation

SymplecticAttention is similar to MultiHeadAttention or VolumePreservingAttention in that it computes the scalar products of all vectors in a sequence of input vectors:

\[C = Q^TAQ,\]

where $Q$ is the $q$-part of an input $Z$ (see QPT). The matrix $A$ is a weighting that can either be symmetric or skew-symmetric (this can be adjusted with the key-word symmetric::Bool).

Extended help

The symplectic attention mechanism is derived via computing the gradient of a separable Hamiltonian, as is also done in GSympNets.

source
  • 1The multihead attention case is in principle not much more difficult, but care has to be taken with regards to the projection spaces.