Neural Network Optimizers
In this section we present the general Optimizer framework used in GeometricMachineLearning
. For more information on the particular steps involved in this consult the documentation on the various optimizer methods such as the gradient optimizer, the momentum optimizer and the Adam optimizer, and the documentation on retractions.
During optimization we aim at changing the neural network parameters in such a way to minimize the loss function. A loss function assigns a scalar value to the weights that parametrize the neural network:
\[ L: \mathbb{P}\to\mathbb{R}_{\geq0},\quad \Theta \mapsto L(\Theta),\]
where $\mathbb{P}$ is the parameter space. We can then phrase the optimization task as:
Given a neural network $\mathcal{NN}$ parametrized by $\Theta$ and a loss function $L:\mathbb{P}\to\mathbb{R}$ we call an algorithm an iterative optimizer (or simply optimizer) if it performs the following task:
\[\Theta \leftarrow \mathtt{Optimizer}(\Theta, \text{past history}, t),\]
with the aim of decreasing the value $L(\Theta)$ in each optimization step.
The past history of the optimization is stored in a cache (AdamCache
, MomentumCache
, GradientCache
, $\ldots$) in GeometricMachineLearning
.
Optimization for neural networks is (almost always) some variation on gradient descent. The most basic form of gradient descent is a discretization of the gradient flow equation:
\[\dot{\Theta} = -\nabla_\Theta{}L,\]
by means of an Euler time-stepping scheme:
\[\Theta^{t+1} = \Theta^{t} - h\nabla_{\Theta^{t}}L,\]
where $\eta$ (the time step of the Euler scheme) is referred to as the learning rate.
This equation can easily be generalized to manifolds with the following two steps:
- modify $-\nabla_{\Theta^{t}}L\implies{}-h\mathrm{grad}_{\Theta^{t}}L,$ i.e. replace the Euclidean gradient by a Riemannian gradient and
- replace addition with the geodesic map.
To sum up, we then have:
\[\Theta^{t+1} = \mathrm{geodesic}(\Theta^{t}, -h\mathrm{grad}_{\Theta^{t}}L).\]
In practice we very often do not use the geodesic map but approximations thereof. These approximations are called retractions.
Generalization to Homogeneous Spaces
In order to generalize neural network optimizers to homogeneous spaces we utilize their corresponding global tangent space representation $\mathfrak{g}^\mathrm{hor}$.
When introducing the notion of a global tangent space we discussed how an element of the tangent space $T_Y\mathcal{M}$ can be represented in $\mathfrak{g}^\mathrm{hor}$ by performing two mappings:
- the first one is the horizontal lift $\Omega$ (see the docstring for
GeometricMachineLearning.Ω
) and - the second one is performing the adjoint operation[1] of $\lambda(Y),$ the section of $Y$, on $\Omega(\Delta).$
The two steps together are performed as global_rep
in GeometricMachineLearning.
So we lift to $\mathfrak{g}^\mathrm{hor}$:
\[\mathtt{global\_rep}: T_Y\mathcal{M} \to \mathfrak{g}^\mathrm{hor},\]
and then perform all the steps of the optimizer in $\mathfrak{g}^\mathrm{hor}.$ We can visualize all the steps required in the generalization of the optimizers:
This picture summarizes all steps involved in an optimization step:
- map the Euclidean gradient $\nabla{}L\in\mathbb{R}^{N\times{}n}$ that was obtained via automatic differentiation to the Riemannian gradient $\mathrm{grad}L\in{}T_Y\mathcal{M}$ with the function
rgrad
, - obtain the global tangent space representation of $\mathrm{grad}L$ in $\mathfrak{g}^\mathrm{hor}$ with the function
global_rep
, - perform an
update!
; this consists of two steps: (i) update the cache and (ii) output a final velocity, - use this final velocity to update the global section $\Lambda\in{}G,$
- use the updated global section to update the neural network weight $\in\mathcal{M}.$ This is done with
apply_section
.
The cache
stores information about previous optimization steps and is dependent on the optimizer. Typically the cache is represented as one or more elements in $\mathfrak{g}^\mathrm{hor}$. Based on this the optimizer method (represented by update!
in the figure) computes a final velocity. This final velocity is again an element of $\mathfrak{g}^\mathrm{hor}$. The particular form of the cache and the updating rule depends on which optimizer method we use.
The final velocity is then fed into a retraction[2]. For computational reasons we split the retraction into two steps, referred to as "Retraction" and apply_section
above. These two mappings together are equivalent to:
\[\mathrm{retraction}(\Delta) = \mathrm{retraction}(\lambda(Y)B^\Delta{}E) = \lambda(Y)\mathrm{Retraction}(B^\Delta), \]
where $\Delta\in{}T_\mathcal{M}$ and $B^\Delta$ is its representation in $\mathfrak{g}^\mathrm{hor}$ as $B^\Delta = \lambda(Y)^{-1}\Omega(\Delta)\lambda(Y).$
Library Functions
GeometricMachineLearning.Optimizer
— TypeOptimizer(method, cache, step, retraction)
Store the method
(e.g. AdamOptimizer
with corresponding hyperparameters), the cache
(e.g. AdamCache
), the optimization step and the retraction.
It takes as input an optimization method and the parameters of a network.
Before one can call Optimizer
a OptimizerMethod
that stores all the hyperparameters of the optimizer needs to be specified.
Functor
For an instance o
of Optimizer
, we can call the corresponding functor as:
o(nn, dl, batch, n_epochs, loss)
The arguments are:
nn::NeuralNetwork
dl::
DataLoader
batch::
Batch
n_epochs::Integer
loss::NetworkLoss
The last argument is optional for many neural network architectures. We have the following defaults:
- A
TransformerIntegrator
usesTransformerLoss
. - A
NeuralNetworkIntegrator
usesFeedForwardLoss
. - An
AutoEncoder
usesAutoEncoderLoss
.
In addition there is an optional keyword argument that can be supplied to the functor:
show_progress=true
: This specifies whether a progress bar should be shown during training.
Implementation
Internally the functor for Optimizer
calls GlobalSection
once at the start and then optimize_for_one_epoch!
for each epoch.
GeometricMachineLearning.optimize_for_one_epoch!
— Functionoptimize_for_one_epoch!(opt, model, ps, dl, batch, loss, λY)
Sample the data contained in dl
according to batch
and optimize for these batches.
This step also performs automatic differentiation on loss
.
The output of optimize_for_one_epoch!
is the average loss over all batches of the epoch:
\[output = \frac{1}{\mathtt{steps\_per\_epoch}}\sum_{t=1}^\mathtt{steps\_per\_epoch}\mathtt{loss}(\theta^{(t-1)}).\]
This is done because any reverse differentiation routine always has two outputs; for Zygote
:
loss_value, pullback = Zygote.pullback(ps -> loss(model, ps, input, output), ps)
So we get the value for the loss for free whenever we compute the pullback with AD.
Arguments
All the arguments are mandatory (there are no defaults):
- an instance of
Optimizer
. - the neural network model.
- the neural network parameters
ps
. - the data (i.e. an instance of
DataLoader
). batch
::Batch
: storesbatch_size
(and optionallyseq_length
andprediction_window
).loss::NetworkLoss
.- the section
λY
of the parametersps
.
Implementation
Internally this calls optimization_step!
for each minibatch.
The number of minibatches can be determined with number_of_batches
:
using GeometricMachineLearning
using GeometricMachineLearning: number_of_batches
data = [1, 2, 3, 4, 5]
batch = Batch(2)
dl = DataLoader(data; suppress_info = true)
number_of_batches(dl, batch)
# output
3
GeometricMachineLearning.optimization_step!
— Functionoptimization_step!(o, λY, ps, dx)
Update the weights ps
based on an Optimizer
, a cache
and first-order derivatives dx
.
optimization_step!
is calling update!
internally. update!
has to be implemented for every OptimizerMethod
.
Arguments
All arguments into optimization_step!
are mandatory:
o::
Optimizer
,λY::NamedTuple
: this named tuple has the same keys asps
, but containsGlobalSection
s,ps::NamedTuple
: the neural network parameters,dx::NamedTuple
: the gradients stores as a NamedTuple.
All the arguments are given as NamedTuple
s as the neural network weights are stores in that format.
using GeometricMachineLearning
using GeometricMachineLearning: params
l = StiefelLayer(3, 5)
ps = params(NeuralNetwork(Chain(l), Float32)).L1
cache = apply_toNT(MomentumCache, ps)
o = Optimizer(MomentumOptimizer(), cache, 0, geodesic)
λY = GlobalSection(ps)
dx = (weight = rand(Float32, 5, 3), )
# call the optimizer
optimization_step!(o, λY, ps, dx)
_test_nt(x) = typeof(x) <: NamedTuple
_test_nt(λY) & _test_nt(ps) & _test_nt(cache) & _test_nt(dx)
# output
true
Extended help
The derivatives dx
here are usually obtained via an AD routine by differentiating a loss function, i.e. dx
is $\nabla_xL$.
References
- [7]
- B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
- 1By the adjoint operation $\mathrm{ad}_A:\mathfrak{g}\to\mathfrak{g}$ for an element $A\in{}G$ we mean $B \mapsto A^{-1}BA$.
- 2A retraction is an approximation of the geodesic map