Neural Network Optimizers
In this section we present the general Optimizer framework used in GeometricMachineLearning
. For more information on the particular steps involved in this consult the documentation on the various optimizer methods such as the momentum optimizer and the Adam optimizer, and the documentation on retractions.
During optimization we aim at changing the neural network parameters in such a way to minimize the loss function. So if we express the loss function $L$ as a function of the neural network weights $\Theta$ in a parameter space $\mathbb{P}$ we can phrase the task as:
Given a neural network $\mathcal{NN}$ parametrized by $\Theta$ and a loss function $L:\mathbb{P}\to\mathbb{R}$ we call an algorithm an iterative optimizer (or simply optimizer) if it performs the following task:
\[\Theta \leftarrow \mathtt{Optimizer}(\Theta, \text{past history}, t),\]
with the aim of decreasing the value $L(\Theta)$ in each optimization step.
The past history of the optimization is stored in a cache (AdamCache
, MomentumCache
, GradientCache
etc.) in GeometricMachineLearning
.
Optimization for neural networks is (almost always) some variation on gradient descent. The most basic form of gradient descent is a discretization of the gradient flow equation:
\[\dot{\Theta} = -\nabla_\Theta{}L,\]
by means of an Euler time-stepping scheme:
\[\Theta^{t+1} = \Theta^{t} - h\nabla_{\Theta^{t}}L,\]
where $\eta$ (the time step of the Euler scheme) is referred to as the learning rate.
This equation can easily be generalized to manifolds by replacing the Euclidean gradient $\nabla_{\Theta^{t}}L$ by a Riemannian gradient $-h\mathrm{grad}_{\Theta^{t}}L$ and addition by $-h\nabla_{\Theta^{t}}L$ with the exponential map of $-h\mathrm{grad}_{\theta^{t}}L$. In practice we often use approximations ot the exponential map however. These are called retractions.
Generalization to Homogeneous Spaces
In order to generalize neural network optimizers to homogeneous spaces we utilize their corresponding global tangent space representation $\mathfrak{g}^\mathrm{hor}$.
When introducing the notion of a global tangent space we discussed how an element of the tangent space $T_Y\mathcal{M}$ can be represented in $\mathfrak{g}^\mathrm{hor}$ by performing two mappings: the first one is the horizontal lift $\Omega$ (see the docstring for GeometricMachineLearning.Ω
) and the second one is the adjoint operation[1] with the lift of $Y$ called $\lambda(Y)$. We can visualize the steps required in performing this generalization:
The cache
stores information about previous optimization steps and is dependent on the optimizer. In general the cache is represented as one or more elements in $\mathfrak{g}^\mathrm{hor}$. Based on this the optimizer method (represented by update!
in the figure) computes a final velocity. This final velocity is again an element of $\mathfrak{g}^\mathrm{hor}$.
The final velocity is then fed into a retraction[2]. For computational reasons we split the retraction into two steps, referred to as "Retraction" and apply_section
above. These two mappings together are equivalent to:
\[\mathrm{retraction}(\Delta) = \mathrm{retraction}(\lambda(Y)B^\Delta{}E) = \lambda(Y)\mathrm{Retraction}(B^\Delta), \]
where $\Delta\in{}T_\mathcal{M}$ and $B^\Delta$ is its representation in $\mathfrak{g}^\mathrm{hor}$ as $B^\Delta = \lambda(Y)^{-1}\Omega(\Delta)\lambda(Y).$
Library Functions
GeometricMachineLearning.Optimizer
— TypeOptimizer(method, cache, step, retraction)
Store the method
(e.g. AdamOptimizer
with corresponding hyperparameters), the cache
(e.g. AdamCache
), the optimization step and the retraction.
It takes as input an optimization method and the parameters of a network.
For technical reasons we first specify an OptimizerMethod
that stores all the hyperparameters of the optimizer.
AbstractNeuralNetworks.update!
— Functionupdate!(o, cache, dx::AbstractArray)
Update the cache
based on the gradient information dx
, compute the final velocity and store it in dx
.
The optimizer o
is needed because some updating schemes (such as AdamOptimizer
) also need information on the current time step.
update!(o, cache, B)
First update the cache
and then update the array B
based on the optimizer o
.
Note that $B\in\mathfrak{g}^\mathrm{hor}$ in general.
update!(o::Optimizer{<:BFGSOptimizer}, C, B)
Peform an update with the BFGS optimizer.
C
is the cache, B
contains the gradient information (the output of global_rep
in general).
First we compute the final velocity with
vecS = -o.method.η * C.H * vec(B)
and then we update H
C.H .= (𝕀 - ρ * SY) * C.H * (𝕀 - ρ * SY') + ρ * vecS * vecS'
where SY
is vecS * Y'
and 𝕀
is the idendity.
Implementation
For stability we use δ
for computing ρ
:
ρ = 1. / (vecS' * Y + o.method.δ)
This is similar to the AdamOptimizer
Extend Help
If we have weights on a Manifold
than the updates are slightly more difficult. In this case the vec
operation has to be generalized to the corresponding global tangent space.
References
- [48]
- B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
- 1By the adjoint operation $\mathrm{ad}_A:\mathfrak{g}\to\mathfrak{g}$ for an element $A\in{}G$ we mean $B \mapsto A^{-1}BA$.
- 2A retraction is an approximation of the exponential map