Linear Symplectic Attention
The attention layer introduced here is an extension of the Sympnet gradient layer to the setting where we deal with time series data. We first have to define a notion of symplecticity for multi-step methods.
This definition is essentially taken from [30, 31] and similar to the definition of volume-preservation in [32].
A multi-step method $\times_T\mathbb{R}^{2n}\to\times_T\mathbb{R}^{2n}$ is called symplectic if it preserves the the symplectic product structure.
The symplectic product structure is the following skew-symmetric non-degenerate bilinear form:
\[\mathbb{J}([z^{(1)}, \ldots, z^{(T)}], [\tilde{z}^{(1)}, \ldots, \tilde{z}^{(T)}]) := \sum_{i=1}^T (z^{(i)})^T\tilde{z}^{(i)}.\]
In order to construct a symplectic attention mechanism we extend the principle SympNet gradient layer, i.e. we construct scalar functions that only depend on $[q^{(1)}, \ldots, q^{(T)}]$ or $[p^{(1)}, \ldots, p^{(T)}]$. The specific choice we make here is the following:
\[F(q^{(1)}, q^{(T)}) = \frac{1}{2}\mathrm{Tr}(QAQ^T),\]
where $Q := [q^{(1)}, \ldots, q^{(T)}]$. We therefore have for the gradient:
\[\nabla_Qf = \frac{1}{2}Q(A + A^T) =: Q\bar{A},\]
where ``A\in\mathcal{S}_\mathrm{skew}(T). So the map performs:
\[[q^{(1)}, \ldots, q^{(T)}] \mapsto \left[ \sum_{i=1}^Ta_{1i}q^{(i)}, \ldots, \sum_{i=1}^Ta_{Ti}q^{(i)} \right].\]
Library Functions
GeometricMachineLearning.LinearSymplecticAttention
— TypeImplements the linear symplectic attention layers. Analogous to GradientLayer
it performs mappings that only change the $Q$ or the $P$ part. For more information see LinearSymplecticAttentionQ
and LinearSymplecticAttentionP
.
Constructor
For the constructors simply call
LinearSymplecticAttentionQ(sys_dim, seq_length)
or
LinearSymplecticAttentionP(sys_dim, seq_length)
where sys_dim
is the system dimension and seq_length
is the sequence length.
GeometricMachineLearning.LinearSymplecticAttentionQ
— TypePerforms:
\[\begin{pmatrix} Q \\ P \end{pmatrix} \mapsto \begin{pmatrix} Q + \nabla_PF \\ P \end{pmatrix},\]
where $Q,\, P\in\mathbb{R}^{n\times{}T}$ and $F(P) = \frac{1}{2}\mathrm{Tr}(P A P^T)$.
GeometricMachineLearning.LinearSymplecticAttentionP
— TypePerforms:
\[\begin{pmatrix} Q \\ P \end{pmatrix} \mapsto \begin{pmatrix} Q + \nabla_PF \\ P \end{pmatrix},\]
where $Q,\, P\in\mathbb{R}^{n\times{}T}$ and $F(P) = \frac{1}{2}\mathrm{Tr}(P A P^T)$.