Deep learning involves lot’s of the derivatives’computations in multi-dimensional spaces.

In particular, any network module, as well as a full network composed of multiple modules, are mapping functions between input and output spaces

but also between parameter and output space

which is of much interest for the gradient descents.

A loss function is a mapping to a 1-dimension space (scalar) :

Linear algebra

A matrix multiplication (or tensor product in higher dimensions) gives a new matrix :

The inner product or scalar product or of 2 vectors outputs a scalar :

The same for matrices, the matrix dot product or Frobenius inner product, outputs a scalar :

The outer product of 2 vectors produces a matrix :

Jacobian

The Jacobian is the first order derivative of the function.

For the network module, the Jacobian is a matrix

where j is the indice of the output in the network output and is a column vector.

In the case of the loss function, the Jacobian is usually the vector

but I’ll write it as a matrix with one row

where i is the indice of the output in the network output.

For the example, the Jacobian of a linear layer :

and with an activation function f :

The introduction of a non-linearity modifies the Jacobian so that each row (corresponding to the Jacobian of one neuron / output) is multiplied by the non-linearity derivative value at this neuron output.

Hessian

The hessian is the second order derivative, following the same definition as for Jacobian :

Since is a matrix, is a 3-dimensional tensor. T2 is for the transpose to the third dimension.

Let’s write the special case for a scalar function, for which the hessian is a (o x o) symetric matrix

Composition of functions

A composition is for example the loss function computed on the output of the network (softmax can be seen as a module inside the network or inside the loss function)

In the general case when has a multi-dimensional output

which is a simple matrix multiplication named the chain rule :

And in the scalar case (when outputs a scalar), this can be rewritten with the vector notation :

That is why it can sometimes be a bit confusing.

Let’s go for the hessian of a composition of functions, but considering the scalar case only (the multi-dimensional case is left as an exercice for the reader :) ), let’s keep in mind that the jacobian of the loss function is being evaluated at the output of the network in fact :

and derivate (if you have a headache, it might not be an anomaly) :

The first part is the Gauss-Newton matrix, modeling the interactions of second order originated from the top part . It is positive semi definite and is used as an approximation in some second order optimization algorithms.

Matching loss function

Let’s consider the case where is the output non-linearity module, such as softmax.

It is easy to see that

Hence, for the loss function where Y is the one-hot encoding of the correct label :

We say the log likelihood loss function matches the softmax output non-linearity since its Jocabian is an affine transformation of the output.

In the same way, the mean squared error loss matches a linear output module.