Standard Neural Network Optimizers

In this section we discuss optimization methods that are often used in training neural networks. The BFGS optimizer may also be viewed as a standard neural network optimizer but is treated in a separate section because of its complexity. From a perspective of manifolds the optimizer methods outlined here operate on $\mathfrak{g}^\mathrm{hor}$ only. Each of them has a cache associated with it[1] and this cache is updated with the function update!. The precise role of this function is described below.

The Gradient Optimizer

The gradient optimizer is the simplest optimization algorithm used to train neural networks. It was already briefly discussed when we introduced Riemannian manifolds.

It simply does:

\[\mathrm{weight} \leftarrow \mathrm{weight} + (-\eta\cdot\mathrm{gradient}),\]

where addition has to be replaced with appropriate operations in the manifold case[2].

When calling GradientOptimizer we can specify a learning rate $\eta$ (or use the default).

using GeometricMachineLearning

const η = 0.01
method = GradientOptimizer(η)
GradientOptimizer{Float64}(0.01)

In order to use the optimizer we need an instance of Optimizer that is called with the method and the weights of the neural network:

weight = (A = zeros(10, 10), )
o = Optimizer(method, weight)
Optimizer{GradientOptimizer{Float64}, @NamedTuple{A::GradientCache{Float64}}, typeof(cayley)}(GradientOptimizer{Float64}(0.01), (A = GradientCache{Float64}(),), 0, GeometricMachineLearning.cayley)

If we operate on a derivative with update! this will compute a final velocity that is then used to compute a retraction (or simply perform addition if we do not deal with a manifold):

dx = (A = one(weight.A), )
update!(o, o.cache, dx)

dx.A
10×10 Matrix{Float64}:
 -0.01  -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0
 -0.0   -0.01  -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0
 -0.0   -0.0   -0.01  -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0
 -0.0   -0.0   -0.0   -0.01  -0.0   -0.0   -0.0   -0.0   -0.0   -0.0
 -0.0   -0.0   -0.0   -0.0   -0.01  -0.0   -0.0   -0.0   -0.0   -0.0
 -0.0   -0.0   -0.0   -0.0   -0.0   -0.01  -0.0   -0.0   -0.0   -0.0
 -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.01  -0.0   -0.0   -0.0
 -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.01  -0.0   -0.0
 -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.01  -0.0
 -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.0   -0.01

So what has happened here is that the gradient dx was simply multiplied with $-\eta$ as the cache of the gradient optimizer is trivial.

The Momentum Optimizer

The momentum optimizer is similar to the gradient optimizer but further stores past information as first moments. We let these first moments decay with a decay parameter $\alpha$:

\[\mathrm{weights} \leftarrow \mathrm{weights} + (\alpha\cdot\mathrm{moment} - \eta\cdot\mathrm{gradient}),\]

where addition has to be replaced with appropriate operations in the manifold case.

In the case of the momentum optimizer the cache is non-trivial:

const α = 0.5
method = MomentumOptimizer(η, α)
o = Optimizer(method, weight)

o.cache.A # the cache is stored for each array in `weight` (which is a `NamedTuple`)
`MomentumCache` that currently stores `B`as  ...
10×10 Matrix{Float64}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

But as the cache is initialized with zeros it will lead to the same result as the gradient optimizer in the first iteration:

dx = (A = one(weight.A), )

update!(o, o.cache, dx)

dx
(A = [-0.01 -0.0 … -0.0 -0.0; -0.0 -0.01 … -0.0 -0.0; … ; -0.0 -0.0 … -0.01 -0.0; -0.0 -0.0 … -0.0 -0.01],)

The cache has changed however:

o.cache
(A = MomentumCache{Float64, Matrix{Float64}}([1.0 0.0 … 0.0 0.0; 0.0 1.0 … 0.0 0.0; … ; 0.0 0.0 … 1.0 0.0; 0.0 0.0 … 0.0 1.0]),)

If we have weights on manifolds calling Optimizer will automatically allocate the correct cache on $\mathfrak{g}^\mathrm{hor}$:

weight = (Y = rand(StiefelManifold, 10, 5), )

Optimizer(method, weight).cache.Y
`MomentumCache` that currently stores `B`as  ...
10×10 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0

The Adam Optimizer

The Adam Optimizer is one of the most widely neural network optimizers. The cache of the Adam optimizer consists of first and second moments. The first moments $B_1$, similar to the momentum optimizer, store linear information about the current and previous gradients, and the second moments $B_2$ store quadratic information about current and previous gradients (all computed from a first-order gradient).

If all the weights are on a vector space, then we directly compute updates for $B_1$ and $B_2$:

  1. $B_1 \gets ((\rho_1 - \rho_1^t)/(1 - \rho_1^t))\cdot{}B_1 + (1 - \rho_1)/(1 - \rho_1^t)\cdot{}\nabla{}L,$
  2. $B_2 \gets ((\rho_2 - \rho_1^t)/(1 - \rho_2^t))\cdot{}B_2 + (1 - \rho_2)/(1 - \rho_2^t)\cdot\nabla{}L\odot\nabla{}L,$

where $\odot:\mathbb{R}^n\times\mathbb{R}^n\to\mathbb{R}^n$ is the Hadamard product: $[a\odot{}b]_i = a_ib_i.$ $\rho_1$ and $\rho_2$ are hyperparameters. Their defaults, $\rho_1=0.9$ and $\rho_2=0.99$, are taken from [20, page 301]. After having updated the cache (i.e. $B_1$ and $B_2$) we compute a velocity with which the parameters of the network are then updated:

  • $W_t\gets -\eta{}B_1/\sqrt{B_2 + \delta},$
  • $Y^{(t+1)} \gets Y^{(t)} + W^{(t)},$

where the last addition has to be replaced with appropriate operations when dealing with manifolds. Further $\eta$ is the learning rate and $\delta$ is a small constant that is added for stability. The division, square root and addition in the computation of $W_t$ are performed element-wise.

In the following we show a schematic update that Adam performs for the case when no elements are on manifolds (also compare this figure with the general optimization framework):

We demonstrate the Adam cache on the same example from before:

const ρ₁ = 0.9
const ρ₂ = 0.99
const δ = 1e-8

method = AdamOptimizer(η, ρ₁, ρ₂, δ)
o = Optimizer(method, weight)

o.cache.Y
`AdamCache` that currently stores `B₁` as ...
10×10 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
and `B₂` as ...
10×10 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0

Weights on manifolds

The problem with generalizing Adam to manifolds is that the Hadamard product $\odot$ as well as the other element-wise operations ($/$, $\sqrt{}$ and $+$ in step 3 above) lack a clear geometric interpretation. In GeometricMachineLearning we get around this issue by utilizing a so-called global tangent space representation. A similar approach is shown in [21].

Remark

The optimization framework presented here manages to generalize the Adam optimizer to manifolds without knowing an underlying differential equation. From a mathematical perspective this is not really satisfactory because we would ideally want the optimizers to emerge as a discretization of a differential equation as in the case of the gradient and the momentum optimizer to better interpret them.

The Adam Optimizer with Decay

The Adam optimizer with decay is similar to the standard Adam optimizer with the difference that the learning rate $\eta$ decays exponentially. We start with a relatively high learning rate $\eta_1$ (e.g. $10^{-2}$) and end with a low learning rate $\eta_2$ (e.g. $10^{-8}$). If we want to use this optimizer we have to tell it beforehand how many epochs we train for such that it can adjust the learning rate decay accordingly:

const η₁ = 1e-2
const η₂ = 1e-6
const n_epochs = 1000

method = AdamOptimizerWithDecay(n_epochs, η₁, η₂, ρ₁, ρ₂, δ)
o = Optimizer(method, weight)

The cache is however exactly the same as for the Adam optimizer:

    o.cache.Y
`AdamCache` that currently stores `B₁` as ...
10×10 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
and `B₂` as ...
10×10 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
 0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0  -0.0  -0.0  -0.0  -0.0  -0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
 0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0

Library Functions

GeometricMachineLearning.GradientOptimizerType
GradientOptimizer(η)

Make an instance of a gradient optimizer.

This is the simplest neural network optimizer. It has no cache and computes the final velocity as:

\[ \mathrm{velocity} \gets - \eta\nabla_\mathrm{weight}L.\]

Implementation

The operations are done as memory efficiently as possible. This means the provided $\nabla_WL$ is mutated via:

    rmul!(∇L, -method.η)
source
GeometricMachineLearning.MomentumOptimizerType
MomentumOptimizer(η, α)

Make an instance of the momentum optimizer.

The momentum optimizer is similar to the GradientOptimizer. It however has a nontrivial cache that stores past history (see MomentumCache). The cache is updated via:

\[ B^{\mathrm{cache}} \gets \alpha{}B^{\mathrm{cache}} + \nabla_\mathrm{weights}L\]

and then the final velocity is computed as

\[ \mathrm{velocity} \gets - \eta{}B^{\mathrm{cache}}.\]

Or the riemannian manifold equivalent, if applicable.

Implementation

To save memory the velocity is stored in the input $\nabla_WL$. This is similar to the case of the GradientOptimizer.

source
GeometricMachineLearning.AdamOptimizerType
AdamOptimizer(η, ρ₁, ρ₂, δ)

Make an instance of the Adam Optimizer.

Here the cache consists of first and second moments that are updated as

\[B_1 \gets ((\rho_1 - \rho_1^t)/(1 - \rho_1^t))\cdot{}B_1 + (1 - \rho_1)/(1 - \rho_1^t)\cdot{}\nabla{}L,\]

and

\[B_2 \gets ((\rho_2 - \rho_1^t)/(1 - \rho_2^t))\cdot{}B_2 + (1 - \rho_2)/(1 - \rho_2^t)\cdot\nabla{}L\odot\nabla{}L.\]

The final velocity is computed as:

\[\mathrm{velocity} \gets -\eta{}B_1/\sqrt{B_2 + \delta}.\]

Implementation

The velocity is stored in the input to save memory.

Algorithm and suggested defaults are taken from [20, page 301].

source
GeometricMachineLearning.AdamOptimizerWithDecayType
AdamOptimizerWithDecay(n_epochs, η₁=1f-2, η₂=1f-6, ρ₁=9f-1, ρ₂=9.9f-1, δ=1f-8)

Make an instance of the Adam Optimizer with weight decay.

All except the first argument (the number of epochs) have defaults.

The difference to the standard AdamOptimizer is that we change the learning reate $\eta$ in each step. Apart from the time dependency of $\eta$ the two algorithms are however equivalent! $\eta(0)$ starts with a high value $\eta_1$ and then exponentially decrease until it reaches $\eta_2$ with

\[ \eta(t) = \gamma^t\eta_1,\]

where $\gamma = \exp(\log(\eta_1 / \eta_2) / \mathtt{n\_epochs}).$

source
GeometricMachineLearning.init_optimizer_cacheFunction
init_optimizer_cache(method, x)

Initialize the optimizer cache based on input x for the given method.

source
init_optimizer_cache(method, x)

Initialize the cache corresponding to the weights x for a specific method.

Implementation

Wrapper for the functions setup_adam_cache, setup_momentum_cache, setup_gradient_cache, setup_bfgs_cache. These appear outside of optimizer_caches.jl because the OptimizerMethods first have to be defined.

source
AbstractNeuralNetworks.update!Function
update!(o, cache, dx::AbstractArray)

Update the cache based on the gradient information dx, compute the final velocity and store it in dx.

The optimizer o is needed because some updating schemes (such as AdamOptimizer) also need information on the current time step.

source
update!(o, cache, B)

First update the cache and then update the array B based on the optimizer o.

Note that $B\in\mathfrak{g}^\mathrm{hor}$ in general.

source
update!(o::Optimizer{<:BFGSOptimizer}, C, B)

Peform an update with the BFGS optimizer.

C is the cache, B contains the gradient information (the output of global_rep in general).

First we compute the final velocity with

    vecS = -o.method.η * C.H * vec(B)

and then we update H

    C.H .= (𝕀 - ρ * SY) * C.H * (𝕀 - ρ * SY') + ρ * vecS * vecS'

where SY is vecS * Y' and 𝕀 is the idendity.

Implementation

For stability we use δ for computing ρ:

    ρ = 1. / (vecS' * Y + o.method.δ)

This is similar to the AdamOptimizer

Extend Help

If we have weights on a Manifold than the updates are slightly more difficult. In this case the vec operation has to be generalized to the corresponding global tangent space.

source

References

[20]
I. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).
  • 1In the case of the gradient optimizer this cache is trivial.
  • 2In the manifold case the expression $-\eta\cdot\mathrm{gradient}$ is an element of the global tangent space $\mathfrak{g}^\mathrm{hor}$ and a retraction maps from $\mathfrak{g}^\mathrm{hor}$. We then still have to compose it with the updated global section $\Lambda^{(t)}$.