Standard Neural Network Optimizers
In this section we discuss optimization methods that are often used in training neural networks. The BFGS optimizer may also be viewed as a standard neural network optimizer but is treated in a separate section because of its complexity. From a perspective of manifolds the optimizer methods outlined here operate on $\mathfrak{g}^\mathrm{hor}$ only. Each of them has a cache associated with it[1] and this cache is updated with the function update!
. The precise role of this function is described below.
The Gradient Optimizer
The gradient optimizer is the simplest optimization algorithm used to train neural networks. It was already briefly discussed when we introduced Riemannian manifolds.
It simply does:
\[\mathrm{weight} \leftarrow \mathrm{weight} + (-\eta\cdot\mathrm{gradient}),\]
where addition has to be replaced with appropriate operations in the manifold case[2].
When calling GradientOptimizer
we can specify a learning rate $\eta$ (or use the default).
const η = 0.01
method = GradientOptimizer(η)
GradientOptimizer{Float64}(0.01)
In order to use the optimizer we need an instance of Optimizer
that is called with the method and the weights of the neural network:
weight = (A = zeros(4, 4), )
o = Optimizer(method, weight)
Optimizer{GradientOptimizer{Float64}, @NamedTuple{A::GradientCache{Float64}}, typeof(cayley)}(GradientOptimizer{Float64}(0.01), (A = GradientCache{Float64}(),), 0, GeometricMachineLearning.cayley)
If we operate on a derivative with update!
this will compute a final velocity that is then used to compute a retraction (or simply perform addition if we do not deal with a manifold):
dx = (A = one(weight.A), )
update!(o, o.cache, dx)
dx.A
4×4 Matrix{Float64}:
-0.01 -0.0 -0.0 -0.0
-0.0 -0.01 -0.0 -0.0
-0.0 -0.0 -0.01 -0.0
-0.0 -0.0 -0.0 -0.01
So what has happened here is that the gradient dx
was simply multiplied with $-\eta$ as the cache of the gradient optimizer is trivial.
The Momentum Optimizer
The momentum optimizer is similar to the gradient optimizer but further stores past information as first moments. We let these first moments decay with a decay parameter $\alpha$:
\[\mathrm{weights} \leftarrow \mathrm{weights} + (\alpha\cdot\mathrm{moment} - \eta\cdot\mathrm{gradient}),\]
where addition has to be replaced with appropriate operations in the manifold case.
In the case of the momentum optimizer the cache is non-trivial:
const α = 0.5
method = MomentumOptimizer(η, α)
o = Optimizer(method, weight)
o.cache.A # the cache is stored for each array in `weight` (which is a `NamedTuple`)
`MomentumCache` that currently stores `B`as ...
4×4 Matrix{Float64}:
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
But as the cache is initialized with zeros it will lead to the same result as the gradient optimizer in the first iteration:
dx = (A = one(weight.A), )
update!(o, o.cache, dx)
dx.A
4×4 Matrix{Float64}:
-0.01 -0.0 -0.0 -0.0
-0.0 -0.01 -0.0 -0.0
-0.0 -0.0 -0.01 -0.0
-0.0 -0.0 -0.0 -0.01
The cache has changed however:
o.cache.A
`MomentumCache` that currently stores `B`as ...
4×4 Matrix{Float64}:
1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 1.0
If we have weights on manifolds calling Optimizer
will automatically allocate the correct cache on $\mathfrak{g}^\mathrm{hor}$:
weight = (Y = rand(StiefelManifold, 5, 3), )
Optimizer(method, weight).cache.Y
`MomentumCache` that currently stores `B`as ...
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
0.0 -0.0 -0.0 -0.0 -0.0
0.0 0.0 -0.0 -0.0 -0.0
0.0 0.0 0.0 -0.0 -0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
So if the weight is $Y\in{}St(n,N)$ the corresponding cache is initialized as the zero element on $\mathfrak{g}^\mathrm{hor}\subset\mathbb{R}^{N\times{}N}$ as this is the global tangent space representation corresponding to the StiefelManifold.
The Adam Optimizer
The Adam Optimizer is one of the most widely neural network optimizers. The cache of the Adam optimizer consists of first and second moments. The first moments $B_1$, similar to the momentum optimizer, store linear information about the current and previous gradients, and the second moments $B_2$ store quadratic information about current and previous gradients. These second moments can be interpreted as approximating the curvature of the optimization landscape.
If all the weights are on a vector space, then we directly compute updates for $B_1$ and $B_2$:
- $B_1 \gets ((\rho_1 - \rho_1^t)/(1 - \rho_1^t))\cdot{}B_1 + (1 - \rho_1)/(1 - \rho_1^t)\cdot{}\nabla{}L,$
- $B_2 \gets ((\rho_2 - \rho_1^t)/(1 - \rho_2^t))\cdot{}B_2 + (1 - \rho_2)/(1 - \rho_2^t)\cdot\nabla{}L\odot\nabla{}L,$
where $\odot:\mathbb{R}^n\times\mathbb{R}^n\to\mathbb{R}^n$ is the Hadamard product: $[a\odot{}b]_i = a_ib_i.$ $\rho_1$ and $\rho_2$ are hyperparameters. Their defaults, $\rho_1=0.9$ and $\rho_2=0.99$, are taken from [43, page 301]. After having updated the cache
(i.e. $B_1$ and $B_2$) we compute a velocity with which the parameters of the network are then updated:
- $W_t\gets -\eta{}B_1/\sqrt{B_2 + \delta},$
- $Y^{(t+1)} \gets Y^{(t)} + W^{(t)},$
where the last addition has to be replaced with appropriate operations when dealing with manifolds. Further $\eta$ is the learning rate and $\delta$ is a small constant that is added for stability. The division, square root and addition in the computation of $W_t$ are performed element-wise.
In the following we show a schematic update that Adam performs for the case when no elements are on manifolds (also compare this figure with the general optimization framework):
We demonstrate the Adam cache on the same example from before:
const ρ₁ = 0.9
const ρ₂ = 0.99
const δ = 1e-8
method = AdamOptimizer(η, ρ₁, ρ₂, δ)
o = Optimizer(method, weight)
o.cache.Y
`AdamCache` that currently stores `B₁` as ...
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
0.0 -0.0 -0.0 -0.0 -0.0
0.0 0.0 -0.0 -0.0 -0.0
0.0 0.0 0.0 -0.0 -0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
and `B₂` as ...
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
0.0 -0.0 -0.0 -0.0 -0.0
0.0 0.0 -0.0 -0.0 -0.0
0.0 0.0 0.0 -0.0 -0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
Weights on Manifolds
The problem with generalizing Adam to manifolds is that the Hadamard product $\odot$ as well as the other element-wise operations ($/$, $\sqrt{}$ and $+$) lack a clear geometric interpretation. In GeometricMachineLearning
we get around this issue by utilizing the global tangent space representation. A similar approach is shown in [8].
The Adam Optimizer with Decay
The Adam optimizer with decay is similar to the standard Adam optimizer with the difference that the learning rate $\eta$ decays exponentially. We start with a relatively high learning rate $\eta_1$ (e.g. $10^{-2}$) and end with a low learning rate $\eta_2$ (e.g. $10^{-8}$). If we want to use this optimizer we have to tell it beforehand how many epochs we train for such that it can adjust the learning rate decay accordingly:
const η₁ = 1e-2
const η₂ = 1e-6
const n_epochs = 1000
method = AdamOptimizerWithDecay(n_epochs, η₁, η₂, ρ₁, ρ₂, δ)
o = Optimizer(method, weight)
The cache is however exactly the same as for the Adam optimizer:
o.cache.Y
`AdamCache` that currently stores `B₁` as ...
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
0.0 -0.0 -0.0 -0.0 -0.0
0.0 0.0 -0.0 -0.0 -0.0
0.0 0.0 0.0 -0.0 -0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
and `B₂` as ...
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
0.0 -0.0 -0.0 -0.0 -0.0
0.0 0.0 -0.0 -0.0 -0.0
0.0 0.0 0.0 -0.0 -0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
Library Functions
GeometricMachineLearning.OptimizerMethod
— TypeOptimizerMethod
Each Optimizer
has to be called with an OptimizerMethod
. This specifies how the neural network weights are updated in each optimization step.
GeometricMachineLearning.GradientOptimizer
— TypeGradientOptimizer(η)
Make an instance of a gradient optimizer.
This is the simplest neural network optimizer. It has no cache and computes the final velocity as:
\[ \mathrm{velocity} \gets - \eta\nabla_\mathrm{weight}L.\]
Implementation
The operations are done as memory efficiently as possible. This means the provided $\nabla_WL$ is mutated via:
rmul!(∇L, -method.η)
GeometricMachineLearning.MomentumOptimizer
— TypeMomentumOptimizer(η, α)
Make an instance of the momentum optimizer.
The momentum optimizer is similar to the GradientOptimizer
. It however has a nontrivial cache that stores past history (see MomentumCache
). The cache is updated via:
\[ B^{\mathrm{cache}} \gets \alpha{}B^{\mathrm{cache}} + \nabla_\mathrm{weights}L\]
and then the final velocity is computed as
\[ \mathrm{velocity} \gets - \eta{}B^{\mathrm{cache}}.\]
Implementation
To save memory the velocity is stored in the input $\nabla_WL$. This is similar to the case of the GradientOptimizer
.
GeometricMachineLearning.AdamOptimizer
— TypeAdamOptimizer(η, ρ₁, ρ₂, δ)
Make an instance of the Adam Optimizer.
Here the cache consists of first and second moments that are updated as
\[B_1 \gets ((\rho_1 - \rho_1^t)/(1 - \rho_1^t))\cdot{}B_1 + (1 - \rho_1)/(1 - \rho_1^t)\cdot{}\nabla{}L,\]
and
\[B_2 \gets ((\rho_2 - \rho_1^t)/(1 - \rho_2^t))\cdot{}B_2 + (1 - \rho_2)/(1 - \rho_2^t)\cdot\nabla{}L\odot\nabla{}L.\]
The final velocity is computed as:
\[\mathrm{velocity} \gets -\eta{}B_1/\sqrt{B_2 + \delta}.\]
Implementation
The velocity is stored in the input to save memory:
mul!(B, -o.method.η, /ᵉˡᵉ(C.B₁, scalar_add(racᵉˡᵉ(C.B₂), o.method.δ)))
where B
is the input to the [update!
] function.
The algorithm and suggested defaults are taken from [43, page 301].
GeometricMachineLearning.AdamOptimizerWithDecay
— TypeAdamOptimizerWithDecay(n_epochs, η₁=1f-2, η₂=1f-6, ρ₁=9f-1, ρ₂=9.9f-1, δ=1f-8)
Make an instance of the Adam Optimizer with weight decay.
All except the first argument (the number of epochs) have defaults.
The difference to the standard AdamOptimizer
is that we change the learning reate $\eta$ in each step. Apart from the time dependency of $\eta$ the two algorithms are however equivalent. $\eta(0)$ starts with a high value $\eta_1$ and then exponentially decrease until it reaches $\eta_2$ with
\[ \eta(t) = \gamma^t\eta_1,\]
where $\gamma = \exp(\log(\eta_1 / \eta_2) / \mathtt{n\_epochs}).$
GeometricMachineLearning.AbstractCache
— TypeAbstractCache
AbstractCache
has subtypes: AdamCache
, MomentumCache
, GradientCache
and BFGSCache
.
All of them can be initialized with providing an array (also supporting manifold types).
GeometricMachineLearning.GradientCache
— TypeGradientCache(Y)
Do not store anything.
The cache for the GradientOptimizer
does not consider past information.
GeometricMachineLearning.MomentumCache
— TypeMomentumCache(Y)
Store the moment for Y
(initialized as zeros).
The moment is called B
.
If the cache is called with an instance of a Manifold
it initializes the moments as elements of $\mathfrak{g}^\mathrm{hor}$ (AbstractLieAlgHorMatrix
).
See AdamCache
.
GeometricMachineLearning.AdamCache
— TypeAdamCache(Y)
Store the first and second moment for Y
(initialized as zeros).
First and second moments are called B₁
and B₂
.
If the cache is called with an instance of a homogeneous space, e.g. the StiefelManifold
$St(n,N)$ it initializes the moments as elements of $\mathfrak{g}^\mathrm{hor}$ (StiefelLieAlgHorMatrix
).
Examples
using GeometricMachineLearning
Y = rand(StiefelManifold, 5, 3)
AdamCache(Y).B₁
# output
5×5 StiefelLieAlgHorMatrix{Float64, SkewSymMatrix{Float64, Vector{Float64}}, Matrix{Float64}}:
0.0 -0.0 -0.0 -0.0 -0.0
0.0 0.0 -0.0 -0.0 -0.0
0.0 0.0 0.0 -0.0 -0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
AbstractNeuralNetworks.update!
— Methodupdate!(o, cache, B)
Update the cache
and output a final velocity that is stored in B
.
Note that $B\in\mathfrak{g}^\mathrm{hor}$ in general.
In the manifold case the final velocity is the input to a retraction.
References
- [43]
- I. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).
- 1In the case of the gradient optimizer this cache is trivial.
- 2In the manifold case the expression $-\eta\cdot\mathrm{gradient}$ is an element of the global tangent space $\mathfrak{g}^\mathrm{hor}$ and a retraction maps from $\mathfrak{g}^\mathrm{hor}$. We then still have to compose it with the updated global section $\Lambda^{(t)}$.