Example of a Neural Network with a Grassmann Layer

Here we show how to implement a neural network that contains a layer whose weight is an element of the Grassmann manifold and where this is useful. Recall that the Grassmann manifold $Gr(n, N)$ is the set of vector spaces of dimension $n$ embedded in $\mathbb{R}^N$. So if we optimize on the Grassmann manifold, we optimize for an ideal $n$-dimensional vector space in the bigger space $\mathbb{R}^N$.

We visualize this:

So assume that we are given data on a nonlinear manifold:

\[\begin{pmatrix} x_1^{(1)} \\ x_2^{(1)} \\ \vdots \\ x_N^{(1)} \end{pmatrix}, \begin{pmatrix} x_1^{(2)} \\ x_2^{(2)} \\ \vdots \\ x_N^{(2)} \end{pmatrix}, \cdots, \begin{pmatrix} x_1^{(\mathtt{nop})} \\ x_2^{(\mathtt{nop})} \\ \vdots \\ x_N^{(\mathtt{nop})} \end{pmatrix} =: \mathcal{D} \subset \mathcal{M}\subset\mathbb{R}^N``,\]

and $\mathrm{dim}(\mathcal{M}) = n.$ We want to obtain additional data on this manifold by sampling from an $n$-dimensional distribution. Here we do not want to identify an isomorphism $\mathbb{R}^n \overset{\approx}{\to} \mathcal{M}$ (as is the case with autoencoders for example), but find a mapping from a distribution on $\mathbb{R}^n$ to a distribution on $\mathcal{M}$. This is where the Grassmann manifold is useful: each element $V$ of the Grassmann manifold is an $n$-dimensional subspace of $\mathbb{R}^N$ from which we can easily sample. We can then construct a mapping from this space $V$ onto a space that contains the data points $\mathcal{D}$.

Remark

This example for learning weights on a Grassmann manifold also shows how GeometricMachineLearning can be used together with other packages. Here we use the Wasserstein distance from the package BrenierTwoFluid for example.

Graph of Rosenbrock Function as Example

Consider the following toy example: we want to sample from the graph of the (scaled) Rosenbrock function

\[f(x,y) = ((1 - x)^2 + 100(y - x^2)^2)/1000\]

without using the explicit form of the function during sampling. We show the graph of $f$ for $(x, y)\in[-1.5, 1.5]^2$ in the following picture:

We now build a neural network whose task it is to map a product of two Gaussians $\mathcal{N}(0,1)\times\mathcal{N}(0,1)$ onto the graph of the Rosenbrock function:

\[ \Psi: \mathcal{N}(0,1)\times\mathcal{N}(0,1) \to \mathcal{W}(\{(x, y, z): (x, y)\in[-1.5, 1.5]\times[-1.5, 1.5], z = f(x, y)\}) =: \tilde{\mathcal{W}},\]

where $\mathcal{W}$ is a distribution on the graph of $f.$ For computing the loss between the two distributions, i.e. $\Psi(\mathcal{N}(0,1)\times\mathcal{N}(0,1))$ and $\mathcal{W}(f([-1.5,1.5], [-1.5,1.5]))$ we use the Wasserstein distance[2]. The Wasserstein distance can compute the distance between $\mathcal{N}(0,1)\times\mathcal{N}(0,1)$ and $\tilde{\mathcal{W}}.$ For more details confer [91, 92].

Before we can use the Wasserstein distance however to train the neural network we need to set it up. It consists of three layers:

using Zygote, BrenierTwoFluid

model = Chain(GrassmannLayer(2,3), Dense(3, 8, tanh), Dense(8, 3, identity))

nn = NeuralNetwork(model, CPU(), Float64)

We then lift the neural network parameters via GlobalSection.

λY = GlobalSection(params(nn))

As the cost function $c$ for the Wasserstein loss[3] we simply use:

# this computes the cost that is associated to the Wasserstein distance
c = (x,y) -> .5 * norm(x - y)^2
∇c = (x,y) -> x - y

We define a function compute_wasserstein_gradient; this is the gradient of the Wasserstein distance with respect to one of the probability distributions it is supplied with (this is further explained below). The function is defined as:

function compute_wasserstein_gradient(ensemble1::AT, ensemble2::AT) where AT<:AbstractArray
    number_of_particles1 = size(ensemble1, 2)
    number_of_particles2 = size(ensemble2, 2)
    V = SinkhornVariable(copy(ensemble1'), ones(number_of_particles1) / number_of_particles1)
    W = SinkhornVariable(copy(ensemble2'), ones(number_of_particles2) / number_of_particles2)
    S = SinkhornDivergence(V, W, c, params; islog = true)
    initialize_potentials!(S)
    compute!(S)
    value(S), x_gradient!(S, ∇c)'
end

This function associates particles in two point clouds with each other. As an illustrative example we will compare the following two point clouds:

\[ \mathcal{D}_1 = \{(x, y, z): (x, y) \in \mathtt{-1.5:0.1:1.5}^2, z = f(x, y, z)\}\]

and

\[ \mathcal{D}_2 = \left\{ \begin{pmatrix} 2 \\ 2 \\ 2 \end{pmatrix} + x: x \sim \mathtt{rand} \right\},\]

where $x \sim \mathtt{rand}$ means that we draw $x$ with the function rand. In code the two sets are:

xyz_points = hcat([[x, y, rosenbrock([x,y])] for x in x for y in y]...)

and

point_cloud = rand(size(xyz_points)...) .+ [2., 2., 2.]

We then compute the Wasserstein gradients and plot 30 of those (picked at random):

We now want to train a neural network based on this Wasserstein loss. The loss function is:

\[L_\mathcal{NN}(\theta) = W_2(\mathcal{NN}_\theta([x^{(1)}, \ldots, x^{(\mathtt{np})}]), \mathcal{D}_2),\]

where np is the number of points in $\mathcal{D}_2$ and $W_2$ is the Wasserstein distance. We then have

\[\nabla_\theta{}L_\mathcal{NN} = (\nabla{}W_2)\nabla_\theta\mathcal{NN},\]

where $\nabla{}W_2$ is equivalent to the function compute_wasserstein_gradient.

function compute_gradient(ps::NeuralNetworkParameters)
    samples = randn(2, size(xyz_points, 2))
    estimate, nn_pullback = Zygote.pullback(ps -> model(samples, ps), ps)

    valS, wasserstein_gradient = compute_wasserstein_gradient(estimate, xyz_points)
    valS, nn_pullback(wasserstein_gradient)[1]
end

# note the very high value for the learning rate
optimizer = Optimizer(nn, AdamOptimizer(1e-1))

So we use Zygote [94] to compute $\nabla_\theta\mathcal{NN}$ and we use compute_wasserstein_gradient to obtain $\nabla{}W_2$. We can now train our network:

# note the small number of training steps
const training_steps = 80
loss_array = zeros(training_steps)
for i in 1:training_steps
    val, dp = compute_gradient(params(nn))
    loss_array[i] = val
    optimization_step!(optimizer, λY, params(nn), dp.params)
end

and plot the training loss:

Now we plot a few points to check how well they match the graph:

const number_of_points = 35
coordinates = nn(randn(2, number_of_points))

If points appear in darker color this means that they lie behind the graph of the Rosenbrock function.

Library Functions

GeometricMachineLearning.GrassmannLayerType
GrassmannLayer(n, N)

Make an instance of GrassmannLayer.

This layer performs simple multiplication with an element of the Grassmann manifold, i.e.

\[ \mathtt{GrassmannLayer}: x \mapsto Ax,\]

where $A$ is a representation of an element in the Grassmann manifold, i.e. $\mathcal{A} = \mathrm{span}(A)$.

source
  • 2The implementation of the Wasserstein distance is taken from [93].
  • 3For each Wasserstein loss we need to define such a cost function. For this particular choice of $c$ the Wasserstein distance corresponds to a $W_2$ loss.