Neural Network Optimizers

In this section we present the general Optimizer framework used in GeometricMachineLearning. For more information on the particular steps involved in this consult the documentation on the various optimizer methods such as the gradient optimizer, the momentum optimizer and the Adam optimizer, and the documentation on retractions.

During optimization we aim at changing the neural network parameters in such a way to minimize the loss function. A loss function assigns a scalar value to the weights that parametrize the neural network:

\[ L: \mathbb{P}\to\mathbb{R}_{\geq0},\quad \Theta \mapsto L(\Theta),\]

where $\mathbb{P}$ is the parameter space. We can then phrase the optimization task as:

Definition

Given a neural network $\mathcal{NN}$ parametrized by $\Theta$ and a loss function $L:\mathbb{P}\to\mathbb{R}$ we call an algorithm an iterative optimizer (or simply optimizer) if it performs the following task:

\[\Theta \leftarrow \mathtt{Optimizer}(\Theta, \text{past history}, t),\]

with the aim of decreasing the value $L(\Theta)$ in each optimization step.

The past history of the optimization is stored in a cache (AdamCache, MomentumCache, GradientCache, $\ldots$) in GeometricMachineLearning.

Optimization for neural networks is (almost always) some variation on gradient descent. The most basic form of gradient descent is a discretization of the gradient flow equation:

\[\dot{\Theta} = -\nabla_\Theta{}L,\]

by means of an Euler time-stepping scheme:

\[\Theta^{t+1} = \Theta^{t} - h\nabla_{\Theta^{t}}L,\]

where $\eta$ (the time step of the Euler scheme) is referred to as the learning rate.

This equation can easily be generalized to manifolds with the following two steps:

  1. modify $-\nabla_{\Theta^{t}}L\implies{}-h\mathrm{grad}_{\Theta^{t}}L,$ i.e. replace the Euclidean gradient by a Riemannian gradient and
  2. replace addition with the geodesic map.

To sum up, we then have:

\[\Theta^{t+1} = \mathrm{geodesic}(\Theta^{t}, -h\mathrm{grad}_{\Theta^{t}}L).\]

In practice we very often do not use the geodesic map but approximations thereof. These approximations are called retractions.

Generalization to Homogeneous Spaces

In order to generalize neural network optimizers to homogeneous spaces we utilize their corresponding global tangent space representation $\mathfrak{g}^\mathrm{hor}$.

When introducing the notion of a global tangent space we discussed how an element of the tangent space $T_Y\mathcal{M}$ can be represented in $\mathfrak{g}^\mathrm{hor}$ by performing two mappings:

  1. the first one is the horizontal lift $\Omega$ (see the docstring for GeometricMachineLearning.Ω) and
  2. the second one is performing the adjoint operation[1] of $\lambda(Y),$ the section of $Y$, on $\Omega(\Delta).$

The two steps together are performed as global_rep in GeometricMachineLearning. So we lift to $\mathfrak{g}^\mathrm{hor}$:

\[\mathtt{global\_rep}: T_Y\mathcal{M} \to \mathfrak{g}^\mathrm{hor},\]

and then perform all the steps of the optimizer in $\mathfrak{g}^\mathrm{hor}.$ We can visualize all the steps required in the generalization of the optimizers:

This picture summarizes all steps involved in an optimization step:

  1. map the Euclidean gradient $\nabla{}L\in\mathbb{R}^{N\times{}n}$ that was obtained via automatic differentiation to the Riemannian gradient $\mathrm{grad}L\in{}T_Y\mathcal{M}$ with the function rgrad,
  2. obtain the global tangent space representation of $\mathrm{grad}L$ in $\mathfrak{g}^\mathrm{hor}$ with the function global_rep,
  3. perform an update!; this consists of two steps: (i) update the cache and (ii) output a final velocity,
  4. use this final velocity to update the global section $\Lambda\in{}G,$
  5. use the updated global section to update the neural network weight $\in\mathcal{M}.$ This is done with apply_section.

The cache stores information about previous optimization steps and is dependent on the optimizer. Typically the cache is represented as one or more elements in $\mathfrak{g}^\mathrm{hor}$. Based on this the optimizer method (represented by update! in the figure) computes a final velocity. This final velocity is again an element of $\mathfrak{g}^\mathrm{hor}$. The particular form of the cache and the updating rule depends on which optimizer method we use.

The final velocity is then fed into a retraction[2]. For computational reasons we split the retraction into two steps, referred to as "Retraction" and apply_section above. These two mappings together are equivalent to:

\[\mathrm{retraction}(\Delta) = \mathrm{retraction}(\lambda(Y)B^\Delta{}E) = \lambda(Y)\mathrm{Retraction}(B^\Delta), \]

where $\Delta\in{}T_\mathcal{M}$ and $B^\Delta$ is its representation in $\mathfrak{g}^\mathrm{hor}$ as $B^\Delta = \lambda(Y)^{-1}\Omega(\Delta)\lambda(Y).$

Library Functions

GeometricMachineLearning.OptimizerType
Optimizer(method, cache, step, retraction)

Store the method (e.g. AdamOptimizer with corresponding hyperparameters), the cache (e.g. AdamCache), the optimization step and the retraction.

It takes as input an optimization method and the parameters of a network.

Before one can call Optimizer a OptimizerMethod that stores all the hyperparameters of the optimizer needs to be specified.

Functor

For an instance o of Optimizer, we can call the corresponding functor as:

o(nn, dl, batch, n_epochs, loss)

The arguments are:

  1. nn::NeuralNetwork
  2. dl::DataLoader
  3. batch::Batch
  4. n_epochs::Integer
  5. loss::NetworkLoss

The last argument is optional for many neural network architectures. We have the following defaults:

In addition there is an optional keyword argument that can be supplied to the functor:

  • show_progress=true: This specifies whether a progress bar should be shown during training.

Implementation

Internally the functor for Optimizer calls GlobalSection once at the start and then optimize_for_one_epoch! for each epoch.

source
GeometricMachineLearning.optimize_for_one_epoch!Function
optimize_for_one_epoch!(opt, model, ps, dl, batch, loss, λY)

Sample the data contained in dl according to batch and optimize for these batches.

This step also performs automatic differentiation on loss.

The output of optimize_for_one_epoch! is the average loss over all batches of the epoch:

\[output = \frac{1}{\mathtt{steps\_per\_epoch}}\sum_{t=1}^\mathtt{steps\_per\_epoch}\mathtt{loss}(\theta^{(t-1)}).\]

This is done because any reverse differentiation routine always has two outputs; for Zygote:

loss_value, pullback = Zygote.pullback(ps -> loss(model, ps, input, output), ps)

So we get the value for the loss for free whenever we compute the pullback with AD.

Arguments

All the arguments are mandatory (there are no defaults):

  1. an instance of Optimizer.
  2. the neural network model.
  3. the neural network parameters ps.
  4. the data (i.e. an instance of DataLoader).
  5. batch::Batch: stores batch_size (and optionally seq_length and prediction_window).
  6. loss::NetworkLoss.
  7. the section λY of the parameters ps.

Implementation

Internally this calls optimization_step! for each minibatch.

The number of minibatches can be determined with number_of_batches:

using GeometricMachineLearning
using GeometricMachineLearning: number_of_batches

data = [1, 2, 3, 4, 5]
batch = Batch(2)
dl = DataLoader(data; suppress_info = true)

number_of_batches(dl, batch)

# output

3
source
GeometricMachineLearning.optimization_step!Function
optimization_step!(o, λY, ps, dx)

Update the weights ps based on an Optimizer, a cache and first-order derivatives dx.

optimization_step! is calling update! internally. update! has to be implemented for every OptimizerMethod.

Arguments

All arguments into optimization_step! are mandatory:

  1. o::Optimizer,
  2. λY::NamedTuple: this named tuple has the same keys as ps, but contains GlobalSections,
  3. ps::NamedTuple: the neural network parameters,
  4. dx::NamedTuple: the gradients stores as a NamedTuple.

All the arguments are given as NamedTuples as the neural network weights are stores in that format.

using GeometricMachineLearning
using GeometricMachineLearning: params

l = StiefelLayer(3, 5)
ps = params(NeuralNetwork(Chain(l), Float32)).L1
cache = apply_toNT(MomentumCache, ps)
o = Optimizer(MomentumOptimizer(), cache, 0, geodesic)
λY = GlobalSection(ps)
dx = (weight = rand(Float32, 5, 3), )

# call the optimizer
optimization_step!(o, λY, ps, dx)

_test_nt(x) = typeof(x) <: NamedTuple

_test_nt(λY) & _test_nt(ps) & _test_nt(cache) & _test_nt(dx)

# output

true

Extended help

The derivatives dx here are usually obtained via an AD routine by differentiating a loss function, i.e. dx is $\nabla_xL$.

source

References

[7]
B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
  • 1By the adjoint operation $\mathrm{ad}_A:\mathfrak{g}\to\mathfrak{g}$ for an element $A\in{}G$ we mean $B \mapsto A^{-1}BA$.
  • 2A retraction is an approximation of the geodesic map