Neural Network Optimizers

In this section we present the general Optimizer framework used in GeometricMachineLearning. For more information on the particular steps involved in this consult the documentation on the various optimizer methods such as the momentum optimizer and the Adam optimizer, and the documentation on retractions.

During optimization we aim at changing the neural network parameters in such a way to minimize the loss function. So if we express the loss function $L$ as a function of the neural network weights $\Theta$ in a parameter space $\mathbb{P}$ we can phrase the task as:

Definition

Given a neural network $\mathcal{NN}$ parametrized by $\Theta$ and a loss function $L:\mathbb{P}\to\mathbb{R}$ we call an algorithm an iterative optimizer (or simply optimizer) if it performs the following task:

\[\Theta \leftarrow \mathtt{Optimizer}(\Theta, \text{past history}, t),\]

with the aim of decreasing the value $L(\Theta)$ in each optimization step.

The past history of the optimization is stored in a cache (AdamCache, MomentumCache, GradientCache etc.) in GeometricMachineLearning.

Optimization for neural networks is (almost always) some variation on gradient descent. The most basic form of gradient descent is a discretization of the gradient flow equation:

\[\dot{\Theta} = -\nabla_\Theta{}L,\]

by means of an Euler time-stepping scheme:

\[\Theta^{t+1} = \Theta^{t} - h\nabla_{\Theta^{t}}L,\]

where $\eta$ (the time step of the Euler scheme) is referred to as the learning rate.

This equation can easily be generalized to manifolds by replacing the Euclidean gradient $\nabla_{\Theta^{t}}L$ by a Riemannian gradient $-h\mathrm{grad}_{\Theta^{t}}L$ and addition by $-h\nabla_{\Theta^{t}}L$ with the exponential map of $-h\mathrm{grad}_{\theta^{t}}L$. In practice we often use approximations ot the exponential map however. These are called retractions.

Generalization to Homogeneous Spaces

In order to generalize neural network optimizers to homogeneous spaces we utilize their corresponding global tangent space representation $\mathfrak{g}^\mathrm{hor}$.

When introducing the notion of a global tangent space we discussed how an element of the tangent space $T_Y\mathcal{M}$ can be represented in $\mathfrak{g}^\mathrm{hor}$ by performing two mappings: the first one is the horizontal lift $\Omega$ (see the docstring for GeometricMachineLearning.Ω) and the second one is the adjoint operation[1] with the lift of $Y$ called $\lambda(Y)$. We can visualize the steps required in performing this generalization:

The cache stores information about previous optimization steps and is dependent on the optimizer. In general the cache is represented as one or more elements in $\mathfrak{g}^\mathrm{hor}$. Based on this the optimizer method (represented by update! in the figure) computes a final velocity. This final velocity is again an element of $\mathfrak{g}^\mathrm{hor}$.

The final velocity is then fed into a retraction[2]. For computational reasons we split the retraction into two steps, referred to as "Retraction" and apply_section above. These two mappings together are equivalent to:

\[\mathrm{retraction}(\Delta) = \mathrm{retraction}(\lambda(Y)B^\Delta{}E) = \lambda(Y)\mathrm{Retraction}(B^\Delta), \]

where $\Delta\in{}T_\mathcal{M}$ and $B^\Delta$ is its representation in $\mathfrak{g}^\mathrm{hor}$ as $B^\Delta = \lambda(Y)^{-1}\Omega(\Delta)\lambda(Y).$

Library Functions

GeometricMachineLearning.OptimizerType
Optimizer(method, cache, step, retraction)

Store the method (e.g. AdamOptimizer with corresponding hyperparameters), the cache (e.g. AdamCache), the optimization step and the retraction.

It takes as input an optimization method and the parameters of a network.

For technical reasons we first specify an OptimizerMethod that stores all the hyperparameters of the optimizer.

source
AbstractNeuralNetworks.update!Function
update!(o, cache, dx::AbstractArray)

Update the cache based on the gradient information dx, compute the final velocity and store it in dx.

The optimizer o is needed because some updating schemes (such as AdamOptimizer) also need information on the current time step.

source
update!(o, cache, B)

First update the cache and then update the array B based on the optimizer o.

Note that $B\in\mathfrak{g}^\mathrm{hor}$ in general.

source
update!(o::Optimizer{<:BFGSOptimizer}, C, B)

Peform an update with the BFGS optimizer.

C is the cache, B contains the gradient information (the output of global_rep in general).

First we compute the final velocity with

    vecS = -o.method.η * C.H * vec(B)

and then we update H

    C.H .= (𝕀 - ρ * SY) * C.H * (𝕀 - ρ * SY') + ρ * vecS * vecS'

where SY is vecS * Y' and 𝕀 is the idendity.

Implementation

For stability we use δ for computing ρ:

    ρ = 1. / (vecS' * Y + o.method.δ)

This is similar to the AdamOptimizer

Extend Help

If we have weights on a Manifold than the updates are slightly more difficult. In this case the vec operation has to be generalized to the corresponding global tangent space.

source

References

[48]
B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
  • 1By the adjoint operation $\mathrm{ad}_A:\mathfrak{g}\to\mathfrak{g}$ for an element $A\in{}G$ we mean $B \mapsto A^{-1}BA$.
  • 2A retraction is an approximation of the exponential map