Standard Transformer

The transformer is a relatively modern neural network architecture [54] that has come to dominate the field of natural language processing (NLP, [87]) and replaced the previously dominant long-short term memory cells (LSTM, [83]). Its success is due to a variety of factors:

unlike LSTMs it consists of very simple building blocks and hence is easier to interpret mathematically,
it is very flexible in its application and the data it is fed with do not have to conform to a rigid pattern,
transformers utilize modern hardware (especially GPUs) very effectively.

The transformer architecture is sketched below:

Visualization of the standard transformer. It consists of two components: a mulithead attention layer and a feedforward neural network.

It is nothing more than a combination of a multihead attention layer and a residual neural network^[1] (ResNet).

As was explained when we talked about the attention module, the attention layer performs a convex reweighting of the input sequence:

\[\mathrm{Attention}: Z \equiv [z^{(1)}, \ldots, z^{(T)}] \mapsto [\sum_{i=1}^Tp^{(1)}_iz^{(i)}, \ldots, \sum_{i=1}^Tp^{(T)}_iz^{(i)}] = \mathrm{Attention}(Z),\]

where the coefficients $p^{(i)}$ depend on $Z$ and are learnable. In the case of multihead attention a greater number of these reweighting coefficients are learned, but it is otherwise not much more complicated than single-head attention.

The green arrow in the figure above indicates that this first add connection can be left out. This can be specified via the keyword argument add_connection in MultiHeadAttention layer and the StandardTransformerIntegrator.

We should also note that such transformers have been used for the online phase in reduced order modeling before [85].

Classification Transformer

Instead of using the transformer for integration, it can also be used as a image classifier. In this case it is often referred to as "vision transformer" [88]. In this case we append a ClassificationLayer to the output of the transformer. This will be used in one of the examples.

The Upscaling

When using the transformer one typically also benefits from defining a transformer_dim that is greater than the system dimension and a corresponding upscaling_activation (see the docstring of StandardTransformerIntegrator).

If the transformer dimension is not equal to the system dimension, then we add two more neural network layers. One that maps up to the space whose dimension is the transformer dimension and one that maps down again to the space whose dimension is the system dimension.

In the figure above we call

\[ \Psi^\mathrm{up}:\mathbb{R}^{\mathtt{sys\_dim}}\to\mathbb{R}^{\mathtt{transformer\_dim}}\]

the upscaling layer and

\[ \Psi^\mathrm{down}:\mathbb{R}^{\mathtt{transformer\_dim}}\to\mathbb{R}^{\mathtt{sys\_dim}}\]

the downscaling layer. Both of these layers are dense layers with the activation function for the downscaling layer being the identity (for better expressivity) and the activation function for the upscaling layer can be specified via the keyword upscaling_activation.

GeometricMachineLearning does not have an implementation of such an upscaling for the volume-preserving transformer and the linear symplectic transformer. Symplectic liftings have however recently been discussed to learn higher-dimensional Hamiltonian representations of given data [72].

Library Functions

GeometricMachineLearning.StandardTransformerIntegrator — Type

StandardTransformerIntegrator(sys_dim)

Make an instance of StandardTransformerIntegrator for a specific system dimension.

Here the standard transformer used as an integrator (see TransformerIntegrator).

It is a composition of MultiHeadAttention layers and ResNetLayer layers.

Arguments

The following are optional keyword arguments:

transformer_dim::Int = sys_dim: this is the dimension after the upscaling.
n_blocks::Int = 1 : the number of ResNetLayer blocks.
n_heads::Int = sys_dim: the number of heads in the multihead attention layer.
L::Int = 2: the number of transformer blocks.
upscaling_activation = identity: the activation used in the upscaling layer.
resnet_activation = tanh: the activation used for the ResNetLayer.
attention_activation = GeometricMachineLearning.VectorSoftmax() : the activation used for the MultiHeadAttention layer.
add_connection:Bool = true: specifies if the input should be added to the output.

source

GeometricMachineLearning.Transformer — Function

Transformer(dim, n_heads, L)

Make an instance of the Transformer with n_heads for dimension dim and L blocks.

Arguments

Transformer takes the following optional keyword arguments:

activation = tanh: the activation function used for the ResNetLayer.
Stiefel::Bool = false: if the matrices $P^V$, $P^Q$ and $P^K$ should live on a manifold.
add_connection::Bool = false: if the input should by added to the ouput after the MultiHeadAttention layer.
use_bias::Bool = true: Specifies if the ResNetLayer should use a bias.

source

GeometricMachineLearning.ClassificationTransformer — Type

ClassificationTransformer(dl)

Make an instance of the ClassificationTransformer based on an instance of DataLoader.

This is a transformer neural network for classification purposes. At the moment this is only used for training on MNIST, but can in theory be used for any classification problem.

Arguments

The optional keyword arguments are:

n_heads::Int=7: The number of heads in the MultiHeadAttention (mha) layers.
L::Int=16: The number of transformer blocks.
activation=softmax: The activation function.
Stiefel::Bool=true: Whether the matrices in the mha layers are on the StiefelManifold.
add_connection::Bool=true: Whether the input is appended to the output of the mha layer. (skip connection)

source

GeometricMachineLearning.assign_output_estimate — Function

assign_output_estimate(full_output, prediction_window)

Crop the output to get the correct number of output vectors.

The function assign_output_estimate is closely related to the Transformer. It takes the last prediction_window columns of the output and uses them for the prediction.

i.e.

\[\mathbb{R}^{N\times{}T}\to\mathbb{R}^{N\times\mathtt{pw}}, \begin{bmatrix} z^{(1)}_1 & \cdots & z^{(T)}_1 \\ \cdots & \cdots & \cdots \\ z^{(1)}_n & \cdots & z^{(T})_n \end{bmatrix} \mapsto \begin{bmatrix} z^{(T - \mathtt{pw})}_1 & \cdots & z^{(T)}_1 \\ \cdots & \cdots & \cdots \\ z^{(T - \mathtt{pw})}_n & \cdots & z^{(T})_n\end{bmatrix} \]

If prediction_window is equal to sequence_length, then this is not needed.

source

References

[54]: A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin. Attention is all you need. Advances in neural information processing systems 30 (2017).
[85]: A. Solera-Rico, C. S. Vila, M. Gómez, Y. Wang, A. Almashjary, S. Dawson and R. Vinuesa, $\beta$-Variational autoencoders and transformers for reduced-order modelling of fluid flows, arXiv preprint arXiv:2304.03571 (2023).

1A layer of type GeometricMachineLearning.ResNetLayer is nothing more than a neural network to whose output we again add the input, i.e. every ResNet is of the form $\mathrm{ResNet}(x) = x + \mathcal{NN}(x)$.