Deep learning involves lot’s of the derivatives’computations in multi-dimensional spaces.

In particular, any network module, as well as a full network composed of multiple modules, are mapping functions between input and output spaces

but also between parameter and output space

which is of much interest for the gradient descents.

A loss function is a mapping to a 1-dimension space (scalar) :

Linear algebra

A matrix multiplication (or tensor product in higher dimensions) gives a new matrix :

The inner product or scalar product or of 2 vectors outputs a scalar :

The same for matrices, the matrix dot product or Frobenius inner product, outputs a scalar :

The outer product of 2 vectors produces a matrix :

Jacobian

The Jacobian is the first order derivative of the function.

For the network module, the Jacobian is a matrix $\in \mathbb{R}^{o \times n}$

where j is the indice of the output in the network output and $\vec{w}$ is a column vector.

In the case of the loss function, the Jacobian is usually the vector

but I’ll write it as a matrix with one row $\in \mathbb{R}^{1 \times o}$

where i is the indice of the output in the network output.

For the example, the Jacobian of a linear layer :

and with an activation function f :

The introduction of a non-linearity modifies the Jacobian so that each row (corresponding to the Jacobian of one neuron / output) is multiplied by the non-linearity derivative value at this neuron output.

Hessian

The hessian is the second order derivative, following the same definition as for Jacobian :

Since $J_{\mathcal{N}}$ is a matrix, $H_{\mathcal{N}}$ is a 3-dimensional tensor. T2 is for the transpose to the third dimension.

Let’s write the special case for a scalar function, for which the hessian is a (o x o) symetric matrix

Composition of functions

A composition is for example the loss function computed on the output of the network (softmax can be seen as a module inside the network or inside the loss function)

In the general case when $\mathcal{L}$ has a multi-dimensional output

which is a simple matrix multiplication named the chain rule :

And in the scalar case (when $\mathcal{L}$ outputs a scalar), this can be rewritten with the vector notation :

That is why it can sometimes be a bit confusing.

Let’s go for the hessian of a composition of functions, but considering the scalar case only (the multi-dimensional case is left as an exercice for the reader :) ), let’s keep in mind that the jacobian of the loss function is being evaluated at the output of the network in fact :

and derivate (if you have a headache, it might not be an anomaly) :

The first part $G = J_{\mathcal{N}}^T \times h_{\mathcal{L}} \times J_{\mathcal{N}}$ is the Gauss-Newton matrix, modeling the interactions of second order originated from the top part $\mathcal{L}$. It is positive semi definite and is used as an approximation in some second order optimization algorithms.

Matching loss function

Let’s consider the case where $\mathcal{N}$ is the output non-linearity module, such as softmax.

It is easy to see that

Hence, for the loss function $\mathcal{L} = Y \cdot \log \mathcal{N}$ where Y is the one-hot encoding of the correct label :

We say the log likelihood loss function matches the softmax output non-linearity since its Jocabian is an affine transformation of the output.

In the same way, the mean squared error loss matches a linear output module.