Neural Network Optimizers
In this section we present the general Optimizer framework used in GeometricMachineLearning. For more information on the particular steps involved in this consult the documentation on the various optimizer methods such as the gradient optimizer, the momentum optimizer and the Adam optimizer, and the documentation on retractions.
During optimization we aim at changing the neural network parameters in such a way to minimize the loss function. A loss function assigns a scalar value to the weights that parametrize the neural network:
\[ L: \mathbb{P}\to\mathbb{R}_{\geq0},\quad \Theta \mapsto L(\Theta),\]
where $\mathbb{P}$ is the parameter space. We can then phrase the optimization task as:
Given a neural network $\mathcal{NN}$ parametrized by $\Theta$ and a loss function $L:\mathbb{P}\to\mathbb{R}$ we call an algorithm an iterative optimizer (or simply optimizer) if it performs the following task:
\[\Theta \leftarrow \mathtt{Optimizer}(\Theta, \text{past history}, t),\]
with the aim of decreasing the value $L(\Theta)$ in each optimization step.
The past history of the optimization is stored in a cache (AdamCache, MomentumCache, GradientCache, $\ldots$) in GeometricMachineLearning.
Optimization for neural networks is (almost always) some variation on gradient descent. The most basic form of gradient descent is a discretization of the gradient flow equation:
\[\dot{\Theta} = -\nabla_\Theta{}L,\]
by means of an Euler time-stepping scheme:
\[\Theta^{t+1} = \Theta^{t} - h\nabla_{\Theta^{t}}L,\]
where $\eta$ (the time step of the Euler scheme) is referred to as the learning rate.
This equation can easily be generalized to manifolds with the following two steps:
- modify $-\nabla_{\Theta^{t}}L\implies{}-h\mathrm{grad}_{\Theta^{t}}L,$ i.e. replace the Euclidean gradient by a Riemannian gradient and
- replace addition with the geodesic map.
To sum up, we then have:
\[\Theta^{t+1} = \mathrm{geodesic}(\Theta^{t}, -h\mathrm{grad}_{\Theta^{t}}L).\]
In practice we very often do not use the geodesic map but approximations thereof. These approximations are called retractions.
Generalization to Homogeneous Spaces
In order to generalize neural network optimizers to homogeneous spaces we utilize their corresponding global tangent space representation $\mathfrak{g}^\mathrm{hor}$.
When introducing the notion of a global tangent space we discussed how an element of the tangent space $T_Y\mathcal{M}$ can be represented in $\mathfrak{g}^\mathrm{hor}$ by performing two mappings:
- the first one is the horizontal lift $\Omega$ (see the docstring for
GeometricMachineLearning.Ω) and - the second one is performing the adjoint operation[1] of $\lambda(Y),$ the section of $Y$, on $\Omega(\Delta).$
The two steps together are performed as global_rep in GeometricMachineLearning. So we lift to $\mathfrak{g}^\mathrm{hor}$:
\[\mathtt{global\_rep}: T_Y\mathcal{M} \to \mathfrak{g}^\mathrm{hor},\]
and then perform all the steps of the optimizer in $\mathfrak{g}^\mathrm{hor}.$ We can visualize all the steps required in the generalization of the optimizers:

This picture summarizes all steps involved in an optimization step:
- map the Euclidean gradient $\nabla{}L\in\mathbb{R}^{N\times{}n}$ that was obtained via automatic differentiation to the Riemannian gradient $\mathrm{grad}L\in{}T_Y\mathcal{M}$ with the function
rgrad, - obtain the global tangent space representation of $\mathrm{grad}L$ in $\mathfrak{g}^\mathrm{hor}$ with the function
global_rep, - perform an
update!; this consists of two steps: (i) update the cache and (ii) output a final velocity, - use this final velocity to update the global section $\Lambda\in{}G,$
- use the updated global section to update the neural network weight $\in\mathcal{M}.$ This is done with
apply_section.
The cache stores information about previous optimization steps and is dependent on the optimizer. Typically the cache is represented as one or more elements in $\mathfrak{g}^\mathrm{hor}$. Based on this the optimizer method (represented by update! in the figure) computes a final velocity. This final velocity is again an element of $\mathfrak{g}^\mathrm{hor}$. The particular form of the cache and the updating rule depends on which optimizer method we use.
The final velocity is then fed into a retraction[2]. For computational reasons we split the retraction into two steps, referred to as "Retraction" and apply_section above. These two mappings together are equivalent to:
\[\mathrm{retraction}(\Delta) = \mathrm{retraction}(\lambda(Y)B^\Delta{}E) = \lambda(Y)\mathrm{Retraction}(B^\Delta), \]
where $\Delta\in{}T_\mathcal{M}$ and $B^\Delta$ is its representation in $\mathfrak{g}^\mathrm{hor}$ as $B^\Delta = \lambda(Y)^{-1}\Omega(\Delta)\lambda(Y).$
Library Functions
GeometricMachineLearning.Optimizer — TypeOptimizer(method, cache, step, retraction)Store the method (e.g. AdamOptimizer with corresponding hyperparameters), the cache (e.g. AdamCache), the optimization step and the retraction.
It takes as input an optimization method and the parameters of a network.
Before one can call Optimizer a OptimizerMethod that stores all the hyperparameters of the optimizer needs to be specified.
Functor
For an instance o of Optimizer, we can call the corresponding functor as:
o(nn, dl, batch, n_epochs, loss)The arguments are:
nn::NeuralNetworkdl::DataLoaderbatch::Batchn_epochs::Integerloss::NetworkLoss
The last argument is optional for many neural network architectures. We have the following defaults:
- A
TransformerIntegratorusesTransformerLoss. - A
NeuralNetworkIntegratorusesFeedForwardLoss(this loss is defined inAbstractNeuralNetworks). - An
AutoEncoderusesAutoEncoderLoss.
In addition there is an optional keyword argument that can be supplied to the functor:
show_progress=true: This specifies whether a progress bar should be shown during training.
Implementation
Internally the functor for Optimizer calls GlobalSection once at the start and then optimize_for_one_epoch! for each epoch.
GeometricMachineLearning.optimize_for_one_epoch! — Functionoptimize_for_one_epoch!(opt, model, ps, dl, batch, loss, λY)Sample the data contained in dl according to batch and optimize for these batches.
This step also performs automatic differentiation on loss.
The output of optimize_for_one_epoch! is the average loss over all batches of the epoch:
\[output = \frac{1}{\mathtt{steps\_per\_epoch}}\sum_{t=1}^\mathtt{steps\_per\_epoch}\mathtt{loss}(\theta^{(t-1)}).\]
This is done because any reverse differentiation routine always has two outputs; for Zygote:
loss_value, pullback = Zygote.pullback(ps -> loss(model, ps, input, output), ps)So we get the value for the loss for free whenever we compute the pullback with AD.
Arguments
All the arguments are mandatory (there are no defaults):
- an instance of
Optimizer. - the neural network model.
- the neural network parameters
ps. - the data (i.e. an instance of
DataLoader). batch::Batch: storesbatch_size(and optionallyseq_lengthandprediction_window).loss::NetworkLoss.- the section
λYof the parametersps.
Implementation
Internally this calls optimization_step! for each minibatch.
The number of minibatches can be determined with number_of_batches:
using GeometricMachineLearning
using GeometricMachineLearning: number_of_batches
data = [1, 2, 3, 4, 5]
batch = Batch(2)
dl = DataLoader(data; suppress_info = true)
number_of_batches(dl, batch)
# output
3GeometricMachineLearning.optimization_step! — Functionoptimization_step!(o, λY, ps, dx)Update the weights ps based on an Optimizer, a cache and first-order derivatives dx.
optimization_step! is calling update! internally. update! has to be implemented for every OptimizerMethod.
Arguments
All arguments into optimization_step! are mandatory:
o::Optimizer,λY::NamedTuple: this named tuple has the same keys asps, but containsGlobalSections,ps::NamedTuple: the neural network parameters,dx::Union{NamedTuple, NeuralNetworkParameters}: the gradients stores as a NamedTuple.
All the arguments are given as NamedTuples as the neural network weights are stores in that format.
using GeometricMachineLearning
using GeometricMachineLearning: params
l = StiefelLayer(3, 5)
ps = params(NeuralNetwork(Chain(l), Float32)).L1
cache = apply_toNT(MomentumCache, ps)
o = Optimizer(MomentumOptimizer(), cache, 0, geodesic)
λY = GlobalSection(ps)
dx = (weight = rand(Float32, 5, 3), )
# call the optimizer
optimization_step!(o, λY, ps, dx)
_test_nt(x) = typeof(x) <: NamedTuple
_test_nt(λY) & _test_nt(ps) & _test_nt(cache) & _test_nt(dx)
# output
trueExtended help
The derivatives dx here are usually obtained via an AD routine by differentiating a loss function, i.e. dx is $\nabla_xL$.
References
- [7]
- B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
- 1By the adjoint operation $\mathrm{ad}_A:\mathfrak{g}\to\mathfrak{g}$ for an element $A\in{}G$ we mean $B \mapsto A^{-1}BA$.
- 2A retraction is an approximation of the geodesic map