Standard Neural Network Optimizers

In this section we discuss optimization methods that are often used in training neural networks. The BFGS optimizer may also be viewed as a standard neural network optimizer but is treated in a separate section because of its complexity. From a perspective of manifolds the optimizer methods outlined here operate on $\mathfrak{g}^\mathrm{hor}$ only. Each of them has a cache associated with it[1] and this cache is updated with the function update!. The precise role of this function is described below.

The Gradient Optimizer

The gradient optimizer is the simplest optimization algorithm used to train neural networks. It was already briefly discussed when we introduced Riemannian manifolds.

It simply does:

\[\mathrm{weight} \leftarrow \mathrm{weight} + (-\eta\cdot\mathrm{gradient}),\]

where addition has to be replaced with appropriate operations in the manifold case[2].

When calling GradientOptimizer we can specify a learning rate $\eta$ (or use the default).

const η = 0.01
method = GradientOptimizer(η)
GradientOptimizer{Float64}(0.01)

In order to use the optimizer we need an instance of Optimizer that is called with the method and the weights of the neural network:

weight = (A = zeros(4, 4), )
o = Optimizer(method, weight)
Optimizer{GradientOptimizer{Float64}, @NamedTuple{A::GradientCache{Float64}}, typeof(cayley)}(GradientOptimizer{Float64}(0.01), (A = GradientCache{Float64}(),), 0, GeometricMachineLearning.cayley)

If we operate on a derivative with update! this will compute a final velocity that is then used to compute a retraction (or simply perform addition if we do not deal with a manifold):

dx = (A = one(weight.A), )
update!(o, o.cache, dx)

dx.A
4×4 Matrix{Float64}:
 -0.01  -0.0   -0.0   -0.0
 -0.0   -0.01  -0.0   -0.0
 -0.0   -0.0   -0.01  -0.0
 -0.0   -0.0   -0.0   -0.01

So what has happened here is that the gradient dx was simply multiplied with $-\eta$ as the cache of the gradient optimizer is trivial.

The Momentum Optimizer

The momentum optimizer is similar to the gradient optimizer but further stores past information as first moments. We let these first moments decay with a decay parameter $\alpha$:

\[\mathrm{weights} \leftarrow \mathrm{weights} + (\alpha\cdot\mathrm{moment} - \eta\cdot\mathrm{gradient}),\]

where addition has to be replaced with appropriate operations in the manifold case.

In the case of the momentum optimizer the cache is non-trivial:

const α = 0.5
method = MomentumOptimizer(η, α)
o = Optimizer(method, weight)

o.cache.A # the cache is stored for each array in `weight` (which is a `NamedTuple`)
`MomentumCache` that currently stores `B`as  ...
4×4 Matrix{Float64}:
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

But as the cache is initialized with zeros it will lead to the same result as the gradient optimizer in the first iteration:

dx = (A = one(weight.A), )

update!(o, o.cache, dx)

dx.A
4×4 Matrix{Float64}:
 -0.01  -0.0   -0.0   -0.0
 -0.0   -0.01  -0.0   -0.0
 -0.0   -0.0   -0.01  -0.0
 -0.0   -0.0   -0.0   -0.01

The cache has changed however:

o.cache.A
`MomentumCache` that currently stores `B`as  ...
4×4 Matrix{Float64}:
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0
 0.0  0.0  0.0  1.0

If we have weights on manifolds calling Optimizer will automatically allocate the correct cache on $\mathfrak{g}^\mathrm{hor}$:

weight = (Y = rand(StiefelManifold, 5, 3), )

Optimizer(method, weight).cache.Y
`MomentumCache` that currently stores `B`as  ...
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0

So if the weight is $Y\in{}St(n,N)$ the corresponding cache is initialized as the zero element on $\mathfrak{g}^\mathrm{hor}\subset\mathbb{R}^{N\times{}N}$ as this is the global tangent space representation corresponding to the StiefelManifold.

The Adam Optimizer

The Adam Optimizer is one of the most widely neural network optimizers. The cache of the Adam optimizer consists of first and second moments. The first moments $B_1$, similar to the momentum optimizer, store linear information about the current and previous gradients, and the second moments $B_2$ store quadratic information about current and previous gradients. These second moments can be interpreted as approximating the curvature of the optimization landscape.

If all the weights are on a vector space, then we directly compute updates for $B_1$ and $B_2$:

  1. $B_1 \gets ((\rho_1 - \rho_1^t)/(1 - \rho_1^t))\cdot{}B_1 + (1 - \rho_1)/(1 - \rho_1^t)\cdot{}\nabla{}L,$
  2. $B_2 \gets ((\rho_2 - \rho_1^t)/(1 - \rho_2^t))\cdot{}B_2 + (1 - \rho_2)/(1 - \rho_2^t)\cdot\nabla{}L\odot\nabla{}L,$

where $\odot:\mathbb{R}^n\times\mathbb{R}^n\to\mathbb{R}^n$ is the Hadamard product: $[a\odot{}b]_i = a_ib_i.$ $\rho_1$ and $\rho_2$ are hyperparameters. Their defaults, $\rho_1=0.9$ and $\rho_2=0.99$, are taken from [43, page 301]. After having updated the cache (i.e. $B_1$ and $B_2$) we compute a velocity with which the parameters of the network are then updated:

  • $W_t\gets -\eta{}B_1/\sqrt{B_2 + \delta},$
  • $Y^{(t+1)} \gets Y^{(t)} + W^{(t)},$

where the last addition has to be replaced with appropriate operations when dealing with manifolds. Further $\eta$ is the learning rate and $\delta$ is a small constant that is added for stability. The division, square root and addition in the computation of $W_t$ are performed element-wise.

In the following we show a schematic update that Adam performs for the case when no elements are on manifolds (also compare this figure with the general optimization framework):

We demonstrate the Adam cache on the same example from before:

const ρ₁ = 0.9
const ρ₂ = 0.99
const δ = 1e-8

method = AdamOptimizer(η, ρ₁, ρ₂, δ)
o = Optimizer(method, weight)

o.cache.Y
`AdamCache` that currently stores `B₁` as ...
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0
and `B₂` as ...
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0

Weights on Manifolds

The problem with generalizing Adam to manifolds is that the Hadamard product $\odot$ as well as the other element-wise operations ($/$, $\sqrt{}$ and $+$) lack a clear geometric interpretation. In GeometricMachineLearning we get around this issue by utilizing the global tangent space representation. A similar approach is shown in [8].

The Adam Optimizer with Decay

The Adam optimizer with decay is similar to the standard Adam optimizer with the difference that the learning rate $\eta$ decays exponentially. We start with a relatively high learning rate $\eta_1$ (e.g. $10^{-2}$) and end with a low learning rate $\eta_2$ (e.g. $10^{-8}$). If we want to use this optimizer we have to tell it beforehand how many epochs we train for such that it can adjust the learning rate decay accordingly:

const η₁ = 1e-2
const η₂ = 1e-6
const n_epochs = 1000

method = AdamOptimizerWithDecay(n_epochs, η₁, η₂, ρ₁, ρ₂, δ)
o = Optimizer(method, weight)

The cache is however exactly the same as for the Adam optimizer:

o.cache.Y
`AdamCache` that currently stores `B₁` as ...
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0
and `B₂` as ...
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0

Library Functions

GeometricMachineLearning.GradientOptimizerType
GradientOptimizer(η)

Make an instance of a gradient optimizer.

This is the simplest neural network optimizer. It has no cache and computes the final velocity as:

\[ \mathrm{velocity} \gets - \eta\nabla_\mathrm{weight}L.\]

Implementation

The operations are done as memory efficiently as possible. This means the provided $\nabla_WL$ is mutated via:

rmul!(∇L, -method.η)
source
GeometricMachineLearning.MomentumOptimizerType
MomentumOptimizer(η, α)

Make an instance of the momentum optimizer.

The momentum optimizer is similar to the GradientOptimizer. It however has a nontrivial cache that stores past history (see MomentumCache). The cache is updated via:

\[ B^{\mathrm{cache}} \gets \alpha{}B^{\mathrm{cache}} + \nabla_\mathrm{weights}L\]

and then the final velocity is computed as

\[ \mathrm{velocity} \gets - \eta{}B^{\mathrm{cache}}.\]

Implementation

To save memory the velocity is stored in the input $\nabla_WL$. This is similar to the case of the GradientOptimizer.

source
GeometricMachineLearning.AdamOptimizerType
AdamOptimizer(η, ρ₁, ρ₂, δ)

Make an instance of the Adam Optimizer.

Here the cache consists of first and second moments that are updated as

\[B_1 \gets ((\rho_1 - \rho_1^t)/(1 - \rho_1^t))\cdot{}B_1 + (1 - \rho_1)/(1 - \rho_1^t)\cdot{}\nabla{}L,\]

and

\[B_2 \gets ((\rho_2 - \rho_1^t)/(1 - \rho_2^t))\cdot{}B_2 + (1 - \rho_2)/(1 - \rho_2^t)\cdot\nabla{}L\odot\nabla{}L.\]

The final velocity is computed as:

\[\mathrm{velocity} \gets -\eta{}B_1/\sqrt{B_2 + \delta}.\]

Implementation

The velocity is stored in the input to save memory:

mul!(B, -o.method.η, /ᵉˡᵉ(C.B₁, scalar_add(racᵉˡᵉ(C.B₂), o.method.δ)))

where B is the input to the [update!] function.

The algorithm and suggested defaults are taken from [43, page 301].

source
GeometricMachineLearning.AdamOptimizerWithDecayType
AdamOptimizerWithDecay(n_epochs, η₁=1f-2, η₂=1f-6, ρ₁=9f-1, ρ₂=9.9f-1, δ=1f-8)

Make an instance of the Adam Optimizer with weight decay.

All except the first argument (the number of epochs) have defaults.

The difference to the standard AdamOptimizer is that we change the learning reate $\eta$ in each step. Apart from the time dependency of $\eta$ the two algorithms are however equivalent. $\eta(0)$ starts with a high value $\eta_1$ and then exponentially decrease until it reaches $\eta_2$ with

\[ \eta(t) = \gamma^t\eta_1,\]

where $\gamma = \exp(\log(\eta_1 / \eta_2) / \mathtt{n\_epochs}).$

source
GeometricMachineLearning.AdamCacheType
AdamCache(Y)

Store the first and second moment for Y (initialized as zeros).

First and second moments are called B₁ and B₂.

If the cache is called with an instance of a homogeneous space, e.g. the StiefelManifold $St(n,N)$ it initializes the moments as elements of $\mathfrak{g}^\mathrm{hor}$ (StiefelLieAlgHorMatrix).

Examples

using GeometricMachineLearning

Y = rand(StiefelManifold, 5, 3)
AdamCache(Y).B₁

# output

5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0
source
AbstractNeuralNetworks.update!Method
update!(o, cache, B)

Update the cache and output a final velocity that is stored in B.

Note that $B\in\mathfrak{g}^\mathrm{hor}$ in general.

In the manifold case the final velocity is the input to a retraction.

source

References

[43]
I. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).
  • 1In the case of the gradient optimizer this cache is trivial.
  • 2In the manifold case the expression $-\eta\cdot\mathrm{gradient}$ is an element of the global tangent space $\mathfrak{g}^\mathrm{hor}$ and a retraction maps from $\mathfrak{g}^\mathrm{hor}$. We then still have to compose it with the updated global section $\Lambda^{(t)}$.