Deep learning involves lot’s of the derivatives’computations in multi-dimensional spaces.
In particular, any network module, as well as a full network composed of multiple modules, are mapping functions between input and output spaces
but also between parameter and output space
which is of much interest for the gradient descents.
A loss function is a mapping to a 1-dimension space (scalar) :
A matrix multiplication (or tensor product in higher dimensions) gives a new matrix :
The inner product or scalar product or of 2 vectors outputs a scalar :
The same for matrices, the matrix dot product or Frobenius inner product, outputs a scalar :
The outer product of 2 vectors produces a matrix :
The Jacobian is the first order derivative of the function.
For the network module, the Jacobian is a matrix
where j is the indice of the output in the network output and is a column vector.
In the case of the loss function, the Jacobian is usually the vector
but I’ll write it as a matrix with one row
where i is the indice of the output in the network output.
For the example, the Jacobian of a linear layer :
and with an activation function f :
The introduction of a non-linearity modifies the Jacobian so that each row (corresponding to the Jacobian of one neuron / output) is multiplied by the non-linearity derivative value at this neuron output.
The hessian is the second order derivative, following the same definition as for Jacobian :
Since is a matrix, is a 3-dimensional tensor. T2 is for the transpose to the third dimension.
Let’s write the special case for a scalar function, for which the hessian is a (o x o) symetric matrix
Composition of functions
A composition is for example the loss function computed on the output of the network (softmax can be seen as a module inside the network or inside the loss function)
In the general case when has a multi-dimensional output
which is a simple matrix multiplication named the chain rule :
And in the scalar case (when outputs a scalar), this can be rewritten with the vector notation :
That is why it can sometimes be a bit confusing.
Let’s go for the hessian of a composition of functions, but considering the scalar case only (the multi-dimensional case is left as an exercice for the reader :) ), let’s keep in mind that the jacobian of the loss function is being evaluated at the output of the network in fact :
and derivate (if you have a headache, it might not be an anomaly) :
The first part is the Gauss-Newton matrix, modeling the interactions of second order originated from the top part . It is positive semi definite and is used as an approximation in some second order optimization algorithms.
Matching loss function
Let’s consider the case where is the output non-linearity module, such as softmax.
It is easy to see that
Hence, for the loss function where Y is the one-hot encoding of the correct label :
We say the log likelihood loss function matches the softmax output non-linearity since its Jocabian is an affine transformation of the output.
In the same way, the mean squared error loss matches a linear output module.