Linear algebra for derivatives in multi-dimensional spaces : tensor products, inner and outer products...
Deep learning involves lot’s of the derivatives’computations in multi-dimensional spaces.
In particular, any network module, as well as a full network composed of multiple modules, are mapping functions between input and output spaces
\[\mathbb{R}^m \rightarrow \mathbb{R}^o\]but also between parameter and output space
\[\mathcal{N} : \mathbb{R}^n \rightarrow \mathbb{R}^o\]which is of much interest for the gradient descents.
A loss function is a mapping to a 1-dimension space (scalar) :
\[\mathcal{L} : \mathbb{R}^o \rightarrow \mathbb{R}\]Linear algebra
A matrix multiplication (or tensor product in higher dimensions) gives a new matrix :
\[A B = A \times B = A \cdot B = [ \sum_k a_{i,k} b_{k,j} ]_{i,j}\]The inner product or scalar product or of 2 vectors outputs a scalar :
\[u \cdot v = \sum_i u_i v_i = u^T \times v\]The same for matrices, the matrix dot product or Frobenius inner product, outputs a scalar :
\[A \odot B = \sum_{i,j} a_{i,j} b_{i,j}\]The outer product of 2 vectors produces a matrix :
\[u \otimes v = u \times v^T = [ u_i v_j ]_{i,j}\]Jacobian
The Jacobian is the first order derivative of the function.
For the network module, the Jacobian is a matrix \(\in \mathbb{R}^{o \times n}\)
\[J_{i,j} = \frac{ \partial \mathcal{N_i} }{ \partial w_j }\] \[J = \frac{ \partial \vec{\mathcal{N}} }{ \partial \vec{w}^T }\]where j is the indice of the output in the network output and \(\vec{w}\) is a column vector.
In the case of the loss function, the Jacobian is usually the vector
\[j_{i} = \frac{ \partial \mathcal{L} }{ \partial u_i }\] \[j = \frac{ \partial \mathcal{L} }{ \partial \vec{u} }\]but I’ll write it as a matrix with one row \(\in \mathbb{R}^{1 \times o}\)
\[J_{0,i} = \frac{ \partial \mathcal{L} }{ \partial u_i }\] \[J = \frac{ \partial \mathcal{L} }{ \partial \vec{u}^T }\]where i is the indice of the output in the network output.
For the example, the Jacobian of a linear layer :
\[\frac{\partial }{\partial \vec{h}^T } \left[ W \cdot \vec{h} \right] = W\]and with an activation function f :
\[\frac{\partial }{\partial \vec{h}^T} \left[ f(W \cdot \vec{h}) \right] = diag( f' (W \cdot \vec{h}) ) \cdot W\]The introduction of a non-linearity modifies the Jacobian so that each row (corresponding to the Jacobian of one neuron / output) is multiplied by the non-linearity derivative value at this neuron output.
Hessian
The hessian is the second order derivative, following the same definition as for Jacobian :
\[H_{\mathcal{N}} = \frac{ \partial J_{\mathcal{N}} }{ \partial \vec{w}^{T2} } = \frac{ \partial^2 \vec{\mathcal{N}} }{ \partial \vec{w}^T \partial \vec{w}^{T2} }\]Since \(J_{\mathcal{N}}\) is a matrix, \(H_{\mathcal{N}}\) is a 3-dimensional tensor. T2 is for the transpose to the third dimension.
\[(H_{\mathcal{N}})_{i,j,k} = \frac{ \partial^2 \mathcal{N}_i }{ \partial w_j \partial w_k }\]Let’s write the special case for a scalar function, for which the hessian is a (o x o) symetric matrix
\[h_{\mathcal{L}} = \frac{ \partial^2 \mathcal{L}}{ \partial \vec{w} \partial \vec{w}^T } = \frac{ \partial \vec{j_{\mathcal{L}}} }{ \partial \vec{w}^T }\] \[(h_{\mathcal{L}})_{i,j} = \frac{ \partial^2 \mathcal{L} }{ \partial w_i \partial w_j }\]Composition of functions
A composition is for example the loss function computed on the output of the network (softmax can be seen as a module inside the network or inside the loss function)
\[\mathcal{C} = \mathcal{L} \circ \mathcal{N}\]In the general case when \(\mathcal{L}\) has a multi-dimensional output
\[\frac{ \partial \mathcal{C}_i }{ \partial w_j } = \sum_k \frac{ \partial \mathcal{L}_i }{ \partial u_k } \times \frac{ \partial \mathcal{N}_k }{ \partial w_j }\]which is a simple matrix multiplication named the chain rule :
\[J_{\mathcal{C}} = J_{\mathcal{L}} \times J_{\mathcal{N}}\]And in the scalar case (when \(\mathcal{L}\) outputs a scalar), this can be rewritten with the vector notation :
\[j_{\mathcal{C}} = J_{\mathcal{N}}^T \times j_{\mathcal{L}}\]That is why it can sometimes be a bit confusing.
Let’s go for the hessian of a composition of functions, but considering the scalar case only (the multi-dimensional case is left as an exercice for the reader :) ), let’s keep in mind that the jacobian of the loss function is being evaluated at the output of the network in fact :
\[\frac{ \partial \mathcal{C} }{ \partial w_j } = \sum_k \left( \frac{ \partial \mathcal{L} }{ \partial u_k } \circ \mathcal{N} \right) \times \frac{ \partial \mathcal{N}_k }{ \partial w_j }\]and derivate (if you have a headache, it might not be an anomaly) :
\[\frac{ \partial \mathcal{C} }{ \partial w_i \partial w_j } = \sum_k \sum_l \frac{ \partial^2 \mathcal{L} }{ \partial u_k \partial u_l } \times \frac{ \partial \mathcal{N}_l}{\partial w_i} \times \frac{ \partial \mathcal{N}_k }{ \partial w_j } + \sum_k \frac{ \partial \mathcal{L} }{ \partial u_k } \times \frac{ \partial \mathcal{N}_k }{ \partial w_i \partial w_j }\] \[h_{\mathcal{C}} = J_{\mathcal{N}}^T \times h_{\mathcal{L}} \times J_{\mathcal{N}} + \sum_k (J_{\mathcal{L}})_k \times h_{\mathcal{N}_k}\]The first part \(G = J_{\mathcal{N}}^T \times h_{\mathcal{L}} \times J_{\mathcal{N}}\) is the Gauss-Newton matrix, modeling the interactions of second order originated from the top part \(\mathcal{L}\). It is positive semi definite and is used as an approximation in some second order optimization algorithms.
Matching loss function
Let’s consider the case where \(\mathcal{N}\) is the output non-linearity module, such as softmax.
It is easy to see that
\[\frac{ \partial \mathcal{N}_i }{ \partial u_j } = \mathcal{N}_i \times (1_{i == j} - \mathcal{N}_j)\]Hence, for the loss function \(\mathcal{L} = Y \cdot \log \mathcal{N}\) where Y is the one-hot encoding of the correct label :
\[\frac{ \partial \mathcal{L} \circ \mathcal{N} }{ \partial \vec{u} } = Y - \mathcal{N}\]We say the log likelihood loss function matches the softmax output non-linearity since its Jocabian is an affine transformation of the output.
In the same way, the mean squared error loss matches a linear output module.