Symplectic Attention

There are two ways of constructing symplectic attention: with a matrix-like softmax and a classical softmax.

Both of these approaches are however based on taking the derivative of an expression with respect to its input arguments (similar to gradient layers).

We first start with singlehead attention^[1]. We start with the correlation matrix:

\[C = Z^TAZ \implies C_{mn} = \sum_{k\ell}Z_{km}A_{k\ell}Z_{\ell{}n}.\]

Its element-wise derivative is:

\[\frac{\partial{}C_{mn}}{\partial{}Z_{ij}} = \sum_{\ell}(\delta_{jm}A_{i\ell}Z_{\ell{}n} + \delta_{jn}X_{\ell{}m}A_{\ell{}i}).\]

Matrix Softmax

Here we take as a staring point the expression:

\[\Sigma(Z) = \mathrm{log}(1 + \exp(\sum_{m,n}C_{mn})).\]

Its gradient (with respect to $Z$) is:

\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = \frac{1}{1 + \sum{m, n}\exp(C_{mn})}\sum_{m'n'}\exp(C_{m'n'})\sum_{\ell}(\delta_{jm'}A_{i\ell}Z_{\ell{}n'} + \delta_{jn'}X_{\ell{}m'}A_{\ell{}i}) = \frac{1}{1 + \sum_{m,n}\exp(C_{mn})}\{[AX\exp.(C)^T]_{ij} + [A^TX\exp.(C)]_{ij}\}.\]

Note that if A is a SymmetricMatrix the expression than simplifies to:

\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = 2\frac{1}{1 + \sum_{m,n}\exp(C_{mn})}[AX\exp.(C)^T]_{ij},\]

or written in matrix notation:

\[\nabla_Z\Sigma(Z) = 2\frac{1}{1 + \sum_{m,n}\exp(C_{mn})}AX\exp.(C).\]

Whether we use a SymmetricMatrix for $A$ or not can be set with the keyword symmetric in SymplecticAttention.

Vector Softmax

Here we take as a staring point the expression:

\[\Sigma(Z) = \sum_{n}\mathrm{log}(1 + \exp(\sum_{m}C_{mn})).\]

We then get:

\[\frac{\partial\Sigma(Z)}{\partial{}Z_{ij}} = \sum_n\frac{\exp(C_{jn})}{1 + \sum_m\exp(C_{mn})}[AZ]_{in} + \frac{1}{1 + \sum_m\exp(C_{mj})}[A^TZ\exp(C)]_{ij}.\]

The second term in this expression is equivalent to a standard attention step:

\[\mathrm{TermII:}\qquad A^TZ\mathrm{softmax}(C).\]

The first term is equivalent to:

\[\mathrm{TermI:}\qquad \sum_n [AZ]_{in}[\mathrm{softmax}(C)^T]_{nj} \equiv AZ(\mathrm{softmax}(C))^T.\]

If we again assume that the matrix A is a SymmetricMatrix then the expression simplifies to:

\[\nabla_Z\Sigma(Z) = AZ\mathrm{softmax}(C).\]

Library Functions

GeometricMachineLearning.VectorSoftmax — Type

VectorSoftmax <: AbstractSoftmax

Turn an arbitrary vector into a probability vector with:

\[[\mathrm{softmax}(a)]_i = \frac{e^{a_i}}{\sum_{i'=1}^de^{a_i}}. \]

This is what is most often understood under the name "softmax". MatrixSoftmax is the matrix version.

source

GeometricMachineLearning.MatrixSoftmax — Type

MatrixSoftmax

Like VectorSoftmax but for matrices:

\[[\mathrm{softmax}(A)]_{ij} = \frac{e^{A_{ij}}}{\sum_{i'=1, j'=1}^{d,\bar{d}}e^{A_{ij}}}. \]

source

GeometricMachineLearning.SymplecticAttention — Type

SymplecticAttention

Implements the symplectic attention layers. See LinearSymplecticAttention.

Keys

It stores the following key:

activation::AbstractSoftmax

Constructors

See SymplecticAttentionQ and SymplecticAttentionP.

Implementation

SymplecticAttention is similar to MultiHeadAttention or VolumePreservingAttention in that it computes the scalar products of all vectors in a sequence of input vectors:

\[C = Q^TAQ,\]

where $Q$ is the $q$-part of an input $Z$ (see QPT). The matrix $A$ is a weighting that can either be symmetric or skew-symmetric (this can be adjusted with the key-word symmetric::Bool).

Extended help

The symplectic attention mechanism is derived via computing the gradient of a separable Hamiltonian, as is also done in GSympNets.

source

GeometricMachineLearning.SymplecticAttentionQ — Type

SymplecticAttentionQ

A constant that is derived from SymplecticAttention. This only changes the $q$-part of the input.

Constructor

SymplecticAttentionQ(M; symmetric::Bool, activation)

The default for the keywords are true and MatrixSoftmax().

You may want to alter the activation function (either MatrixSoftmax or VectorSoftmax), but its almost always better to set the keyword symmetric to true.

source

GeometricMachineLearning.SymplecticAttentionP — Type

SymplecticAttentionP

A constant that is derived from SymplecticAttention. This only changes the p-part of the input.

Constructor

SymplecticAttentionP(M; symmetric::Bool, activation)

The default for the keywords are true and MatrixSoftmax().

You may want to alter the activation function (either MatrixSoftmax or VectorSoftmax), but its almost always better to set the keyword symmetric to true.

source

1The multihead attention case is in principle not much more difficult, but care has to be taken with regards to the projection spaces.