The BFGS Optimizer

The presentation shown here is largely taken from [44, chapters 3 and 6] with a derivation based on an online comment [45]. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a second order optimizer that can be also be used to train a neural network.

It is a version of a quasi-Newton method and is therefore especially suited for convex problems. As is the case with any other (quasi-)Newton method[1] the BFGS algorithm approximates the objective with a quadratic function in each optimization step:

\[m^{(k)}(x) = L(x^{(k)}) + (\nabla_{x^{(k)}}L)^T(x - x^{(k)}) + \frac{1}{2}(x - x^{(k)})^TR^{(k)}(x - x^{(k)}),\]

where $R^{(k)}$ is referred to as the approximate Hessian. We further require $R^{(k)}$ to be symmetric and positive definite. Differentiating the above expression and setting the derivative to zero gives us:

\[\nabla_xm^{(k)} = \nabla_{x^{(k)}}L + R^{(k)}(x - x^{(k)}) = 0,\]

or written differently:

\[x - x^{(k)} = -(R^{(k)})^{-1}\nabla_{x^{(k)}}L.\]

This value we will from now on call $p^{(k)} := -H^{(k)}\nabla_{x_k}L$ with $H^{(k)} := (R^{(k)})^{-1}$ and refer to as the search direction. The new iterate then is:

\[x^{(k+1)} = x^{(k)} + \eta^{(k)}p^{(k)},\]

where $\eta^{(k)}$ is the step length. Techniques that describe how to pick an appropriate $\eta^{(k)}$ are called line-search methods and are discussed below. First we discuss what requirements we impose on $R^{(k)}$. A first reasonable condition would be to require the gradient of $m^{(k)}$ to be equal to that of $L$ at the points $x^{(k-1)}$ and $x^{(k)}$:

\[\begin{aligned} \nabla_{x^{(k)}}m^{(k)} & = \nabla_{x^{(k)}}L + R^{(k)}(x^{(k)} - x^{(k)}) & \overset{!}{=} & \nabla_{x^{(k)}}L & \text{ and } \\ \nabla_{x^{(k-1)}}m^{(k)} & = \nabla_{x^{(k)}}L + R^{(k)}(x^{(k-1)} - x^{(k)}) & \overset{!}{=} & \nabla_{x^{(k-1)}}L. & \end{aligned}\]

The first one of these conditions is automatically satisfied. The second one can be rewritten as:

\[x^{(k)} - x^{(k-1)} \overset{!}{=} H^{(k)}(\nabla_{x^{(k)}}L - \nabla_{x^{(k-1)}}L). \]

The following notations are often used:

\[s^{(k-1)} := \eta^{(k-1)}p^{(k-1)} := x^{(k)} - x^{(k-1)} \quad\text{ and }\quad y^{(k-1)} := \nabla_{x^(k)}L - \nabla_{x^{(k-1)}}L.\]

The condition mentioned above then becomes:

\[s^{(k-1)} \overset{!}{=} H^{(k)}y^{(k-1)},\]

and we call it the secant equation.

In order to pick the ideal $H^{(k)}$ we solve the following problem:

\[\begin{aligned} & \min_H & & ||H - H^{(k-1)}||_W \\ & \text{s.t.} & & H = H^T\quad\text{and}\\ & \text{and} & & s^{(k-1)} = Hy^{(k-1)}, \end{aligned}\]

where the first condition is symmetry and the second one is the secant equation. For the norm $||\cdot||_W$ we pick the weighted Frobenius norm:

\[||A||_W := ||W^{1/2}AW^{1/2}||_F,\]

where $||\cdot||_F$ is the usual Frobenius norm[2] and the matrix $W=\tilde{R}^{(k-1)}$ is the average Hessian:

\[\tilde{R}^{(k-1)} = \int_0^1 \nabla^2f(x^{(k-1)} + \tau\eta^{(k-1)}p^{(k-1)})d\tau.\]

We now state the solution to this minimization problem:

Theorem

The solution of the minimization problem is:

\[H^{(k)} = (\mathbb{I} - \frac{1}{(s^{(k-1)})^Ty^{(k-1)}}s^{(k-1)}(y^{(k-1)})^T)H^{(k-1)}(\mathbb{I} - \frac{1}{(s^{k-1})^Ty^{(k-1)}}y^{(k-1)}(s^{(k-1)})^T) + \\ \frac{1}{(s^{(k-1)})^Ty^{(k-1)}}s^{(k-1)}(s^{(k-1)})^T,\]

with $y^{(k-1)} = \nabla_{x^{(k)}}L - \nabla_{x^{(k-1)}}L$ and $s^{(k-1)} = x^{(k)} - x^{(k-1)}$ as above.

Proof

In order to find the ideal $H^{(k)}$ under the conditions described above, we introduce some notation:

  • $\tilde{H}^{(k-1)} := W^{1/2}H^{(k-1)}W^{1/2}$,
  • $\tilde{H} := W^{1/2}HW^{1/2}$,
  • $\tilde{y}^{(k-1)} := W^{-1/2}y^{(k-1)}$,
  • $\tilde{s}^{(k-1)} := W^{1/2}s^{(k-1)}$.

With this notation we can rewrite the problem of finding $H^{(k)}$ as:

\[\begin{aligned} & \min_{\tilde{H}} & & ||\tilde{H} - \tilde{H}^{(k-1)}||_F \\ & \text{s.t.}\quad & & \tilde{H} = \tilde{H}^T\quad \\ & \text{and} & & \tilde{s}^{(k-1)} = \tilde{H}\tilde{y}^{(k-1)}. \end{aligned}\]

We further have $y^{(k-1)} = Ws^{(k-1)}$ and hence $\tilde{y}^{(k-1)} = \tilde{s}^{(k-1)}$ by a corollary of the mean value theorem: $\int_0^1 g'(\xi_1 + \tau(\xi_2 - \xi_1)) d\tau (\xi_2 - \xi_1) = g(\xi_2) - g(\xi_1)$ for a vector-valued function $g$.

Now we rewrite $H$ and $H^{(k-1)}$ in a new basis $U = [u|u_\perp]$, where $u := \tilde{y}^{(k-1)}/||\tilde{y}^{(k-1)}||$ and $u_\perp$ is an orthogonal complement of $u$ (i.e. we have $u^Tu_\perp=0$ and $u_\perp^Tu_\perp=\mathbb{I}$):

\[\begin{aligned} U^T\tilde{H}^{(k-1)}U - U^T\tilde{H}U = \begin{bmatrix} u^T \\ u_\perp^T \end{bmatrix}(\tilde{H}^{(k-1)} - \tilde{H})\begin{bmatrix} u & u_\perp \end{bmatrix} = \\ \begin{bmatrix} u^T\tilde{H}^{(k-1)}u - 1 & u^T\tilde{H}^{(k-1)}u_\perp \\ u_\perp^T\tilde{H}^{(k-1)}u & u_\perp^T(\tilde{H}^{(k-1)}-\tilde{H}^{(k)})u_\perp \end{bmatrix}. \end{aligned}\]

By a property of the Frobenius norm we can consider the blocks independently:

\[\begin{aligned} ||\tilde{H}^{(k-1)} - \tilde{H}||^2_F & = ||U^T(\tilde{H}^{(k-1)} - \tilde{H})U||^2_F \\ & = (u^T\tilde{H}^{(k-1)}u -1)^2 + ||u^T\tilde{H}^{(k-1)}u_\perp||_F^2 + ||u_\perp^T\tilde{H}^{(k-1)}u||_F^2 + ||u_\perp^T(\tilde{H}^{(k-1)} - \tilde{H})u_\perp||_F^2. \end{aligned}\]

We see that $\tilde{H}$ only appears in the last term, which should therefore be made zero, i.e. the projections of $\tilde{H}_{k-1}$ and $\tilde{H}$ onto the space spanned by $u_\perp$ should coincide. With the condition $\tilde{H}u \overset{!}{=} u$ we hence get:

\[\tilde{H} = U\begin{bmatrix} 1 & 0 \\ 0 & u^T_\perp\tilde{H}^{(k-1)}u_\perp \end{bmatrix}U^T = uu^T + (\mathbb{I}-uu^T)\tilde{H}^{(k-1)}(\mathbb{I}-uu^T).\]

If we now map back to the original coordinate system, the ideal solution for $H^{(k)}$ is:

\[H^{(k)} = (\mathbb{I} - \frac{1}{(s^{(k-1)})^Ty^{(k-1)}}s^{(k-1)}(y^{(k-1)})^T)H^{(k-1)}(\mathbb{I} - \frac{1}{(s^{k-1})^Ty^{(k-1)}}y^{(k-1)}(s^{(k-1)})^T) + \\ \frac{1}{(s^{(k-1)})^Ty^{(k-1)}}s^{(k-1)}(s^{(k-1)})^T,\]

and the assertion is proved.

The cache and the parameters are updated with:

  1. Compute the gradient $\nabla_{x^{(k)}}L$,
  2. obtain a negative search direction $p^{(k)} \gets -H^{(k)}\nabla_{x^{(k)}}L$,
  3. compute $s^{(k)} = \eta^{(k)}p^{(k)}$,
  4. compute $y^{(k)} \gets \nabla_{x^{(k)}}L - \nabla_{x^{(k-1)}}L$,
  5. update $H^{(k + 1)} \gets (\mathbb{I} - \frac{1}{(s^{(k)})^Ty^{(k)}}s^{(k)}(y^{(k)})^T)H^{(k)}(\mathbb{I} - \frac{1}{s^({k})^Ty^{(k)}}y^{(k)}(s^{(k)})^T) + \\ \frac{1}{(s^{(k)})^Ty^{(k)}}s^{(k)}(s^{(k)})^T$,
  6. update $x^{(k + 1)} \gets x^{(k)} + s^{(k)}$.

The cache of the BFGS algorithm thus consists of the matrix $H^{(\cdot)}$ for each weight $x^{(\cdot)}$ in the neural network and the gradient for the previous time step $\nabla_{x^{(k-1)}}L$. $s^{(k)}$ here is again the velocity that we use to update the neural network weights.

The Riemannian Version of the BFGS Algorithm

Generalizing the BFGS algorithm to the setting of a Riemannian manifold is straightforward. All we have to do is replace Euclidean gradient by Riemannian ones (composed with a lift via global_rep):

\[\nabla_{x^{(k)}}L \implies (\Lambda^{(k)})^{-1}(\Omega(x^{(k)}, \mathrm{grad}_{x^{(k)}}L))\Lambda^{(k)} = \mathtt{global\_rep}(\mathrm{grad}_{x^{(k)}}),\]

and addition by a retraction:

\[ x^{(k+1)} \gets x^{(k)} + s^{(k)} \implies x^{(k+1)} \gets \mathrm{Retraction}(s^{(k)})x^{(k)}.\]

The Hessian for the manifold BFGS algorithm is of size $\tilde{N}\times\tilde{N}$ where $\tilde{N} = \mathrm{dim}(\mathfrak{g}^\mathrm{hor})$. For the global tangent space belonging to the Stiefel manifold we have $\tilde{N} = (N - n)n + n(n - 1)\div2$.

In order to multiply the stored weights with the Hessian $H$ we use the vectorization operation vec for all weights:

A = SkewSymMatrix([1, 2, 3], 3)
B = [4 5 6; ]
B̄ = StiefelLieAlgHorMatrix(A, B, 4, 3)
B̄ |> vec
vcat(3-element Vector{Int64}, 3-element Vector{Int64}):
 1
 2
 3
 4
 5
 6

The $H$ matrix in the cache is correspondingly initialized as:

BFGSCache(B̄)
"`BFGSCache` that currently stores `B`as  ..."4×4 StiefelLieAlgHorMatrix{Int64, SkewSymMatrix{Int64, Vector{Int64}}, Matrix{Int64}}:
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  0  0  0
... and `H` as
6×6 Matrix{Int64}:
 1  0  0  0  0  0
 0  1  0  0  0  0
 0  0  1  0  0  0
 0  0  0  1  0  0
 0  0  0  0  1  0
 0  0  0  0  0  1

We see that $\bar{B}$ is of dimension $\tilde{N} = (N - n)n + n(n - 1)\div2 = 3 + 3 = 6$ and $H$ is of dimension $\tilde{N}\times\tilde{N} = 6\times6.$

The Curvature Condition and the Wolfe Conditions

In textbooks [44] an application of the BFGS algorithm typically further involves a line search subject to the Wolfe conditions. If these are satisfied the curvature condition usually also is.

A condition that is similar to the secant condition discussed before is that $R^{(k)}$ has to be positive-definite at point $s^{(k-1)}$:

\[(s^{(k-1)})^Ty^{(k-1)} > 0.\]

This is referred to as the standard curvature condition. If we impose the Wolfe conditions, the standard curvature condition holds automatically. The Wolfe conditions are stated with respect to the parameter $\eta^{(k)}$.

Definition

The Wolfe conditions are:

\[\begin{aligned} L(x^{(k)}+\eta^{(k)}p^{(k)}) & \leq{}L(x^{(k)}) + c_1\eta^{(k)}(\nabla_{x^{(k)}}L)^Tp^{(k)} & \text{ for } & c_1\in(0,1) \quad\text{and} \\ (\nabla_{(x^{(k)} + \eta^{(k)}p^{(k)})}L)^Tp^{(k)} & \geq c_2(\nabla_{x^{(k)}}L)^Tp^{(k)} & \text{ for } & c_2\in(c_1,1). \end{aligned}\]

The two Wolfe conditions above are respectively called the sufficient decrease condition and the curvature condition.

A possible choice for $c_1$ and $c_2$ are $10^{-4}$ and $0.9$ [44]. We further have:

Theorem

The second Wolfe condition, also called curvature condition, is stronger than the standard curvature condition under the assumption that the first Wolfe condition is true and $L(x^{(k+1)}) < L(^{(x_k)})$.

Proof

We use the second Wolfe condition to write

\[(\nabla_{x^{(k)}}L)^Tp^{(k-1)} - c_2(\nabla_{x^{(k-1)}}L)^Tp^{(k-1)} = (y^{(k-1)})^Tp^{(k-1)} + (1 - c_2)(\nabla_{x^{(k-1)}}L)^Tp^{(k-1)} \geq 0,\]

and we can apply the first Wolfe condition on the second term in this expression:

\[(1 - c_2)(\nabla_{x^{(k-1)}}L)^Tp^{(k-1)}\geq\frac{1-c_2}{c_1\eta^{(k-1)}}(L(x^{(k)}) - L(x^{(k-1)})),\]

which is negative if the value of $L$ is decreasing.

It is noteworthy that line search has not been used a lot in deep learning in the past. This is beginning to change however [46, 47]. We also note that the BFGS optimizer combined with the global tangent space representation offers a way of performing second order optimization on manifolds, this is however not the only way to do so [48, 49].

Stability of the Algorithm

Similar to the Adam optimizer we also add a $\delta$ term for stability to two of the terms appearing in the update rule of the BFGS algorithm in practice.

Library Functions

GeometricMachineLearning.BFGSCacheType
BFGSCache(B)

Make the cache for the BFGS optimizer based on the array B.

It stores an array for the gradient of the previous time step B and the inverse of the Hessian matrix H.

The cache for the inverse of the Hessian is initialized with the idendity. The cache for the previous gradient information is initialized with the zero vector.

Note that the cache for H is changed iteratively, whereas the cache for B is newly assigned at every time step.

source
AbstractNeuralNetworks.update!Method
update!(o::Optimizer{<:BFGSOptimizer}, C, B)

Peform an update with the BFGS optimizer.

C is the cache, B contains the gradient information (the output of global_rep in general).

First we compute the final velocity with

vecS = -o.method.η * C.H * vec(B)

and then we update H

C.H .= (𝕀 - ρ * SY) * C.H * (𝕀 - ρ * SY') + ρ * vecS * vecS'

where SY is vecS * Y' and 𝕀 is the idendity.

Implementation

For stability we use δ for computing ρ:

ρ = 1. / (vecS' * Y + o.method.δ)

This is similar to the AdamOptimizer

Extended help

If we have weights on a Manifold than the updates are slightly more difficult. In this case the vec operation has to be generalized to the corresponding global tangent space.

source
Base.vecMethod
vec(A::StiefelLieAlgHorMatrix)

Vectorize A.

Examples

using GeometricMachineLearning

A = SkewSymMatrix([1, ], 2)
B = [2 3; ]
B̄ = StiefelLieAlgHorMatrix(A, B, 3, 2)
B̄ |> vec

# output

vcat(1-element Vector{Int64}, 2-element Vector{Int64}):
 1
 2
 3

Implementation

This is using Vcat from the package LazyArrays.

source

References

[44]
J. N. Stephen J. Wright. Numerical optimization (Springer Science+Business Media, New York, NY, 2006).
[45]
A.Γ. (math.stackexchange user 253273). Quasi-newton methods: Understanding DFP updating formula, https://math.stackexchange.com/q/2279304 (2017). Accessed on September 19, 2024.
[48]
W. Huang, P.-A. Absil and K. A. Gallivan. A Riemannian BFGS method for nonconvex optimization problems. In: Numerical Mathematics and Advanced Applications ENUMATH 2015 (Springer, 2016); pp. 627–634.
  • 1Various Newton methods and quasi-Newton methods differ in how they model the approximate Hessian.
  • 2The Frobenius norm is $||A||_F^2 = \sum_{i,j}a_{ij}^2$.