Matrix Gradients of Scalar Functions
Understanding the building blocks of reverse-mode automatic differentiation.
Lawrence Murray on 7 November 2022
These notes were written for the implementation of reverse-mode automatic differentiation in Birch, specifically the NumBirch library that provides its numerics on both CPU and GPU. The motivation is to:
consolidate some essential gradients in one place, especially those commonly required for probabilistic modeling (e.g. Cholesky factorizations, logarithms of determinants), and
present those gradients in the form required for reverse-mode automatic differentiation.
While some gradients are simply stated and cited, others are derived from first principles using element-wise calculations as an aid to understanding.
- Inverse of symmetric positive definite matrix
- Solve with symmetric positive definite matrix
- Logarithm of the determinant
- Logarithm of the determinant of a symmetric positive definite matrix
- Cholesky factorization
In the simplest setting, we are interested in a function that accepts a matrix argument and returns a scalar result. A typical setting is model training, where is the objective function (e.g. log-likelihood or mean squared error), are some parameters (e.g. weights and biases of a neural network), and we wish to compute , being the gradient of with respect to . Having the gradient, we then might apply a gradient-based update to the parameters, as when optimizing with stochastic gradient descent or sampling with Langevin Monte Carlo.
We will focus in particular on reverse-mode automatic differentation. For illustration, assume that can be decomposed into a single chain of simpler functions : We first perform a forward pass to compute the intermediate results, first computing then proceeding recursively: We then perform a backward pass to compute the gradient by applying the chain rule, first computing then proceeding recursively: The intermediate results computed during the forward pass often appear in the computations required during the backward pass. They can be memoized for reuse rather than computed again.
A particular convenience of working in reverse mode for scalar functions is that each intermediate gradient has the same size (rows and columns) as the corresponding intermediate result, i.e. has the same size as , and the same size as .
The above is a simplification in that it assumes can be decomposed into a single chain. Generally, for the presence of intermediate functions with multiple arguments, it will decompose into a tree or graph (often called a compute graph in machine learning or expression template elsewhere). The principle is nonetheless the same: we apply the chain rule along branches of the graph.
We limit ourselves to reverse mode here, but it is not the only approach. Forward mode automatic differentation works in the opposite order, using a single forward pass to recursively compute and simultaneously—but for scalar functions this foregoes the sizing convenience of reverse mode. Mixed mode denotes any other order, which can be designed to exploit specific structure in the compute graph.
To derive a gradient of with respect to , we assume that the upstream gradient is known for some that denotes a matrix operation of interest, e.g. transpose or inverse . We then apply the chain rule element-wise to compute the partial derivative of with respect to the th element of , denoted : We then simplify the element-wise expression into a matrix expression to allow the partial derivatives with respect to all elements to be computed simultaneously—also more efficiently using high-performance linear algebra routines—and so obtain the gradient.
The transpose is straightforward: the gradient of with respect to is the transpose of the gradient of with respect to . But to demonstrate use of the above formulation, let be a scalar function and be given; we then have, element-wise: where denotes the Kronecker delta (i.e. 1 when and 0 otherwise). We recognize this as a simple transpose operation on all elements simultaneously:
Let be a scalar function and be given. Element-wise, we have:
We observe from this that is the dot product between the th row of and th row of , and consolidate in matrix form as: Next we wish to compute . We can derive using the same element-wise approach, or simply apply the transpose property above to obtain:
Let be a scalar function and be given. We have (Petersen & Pedersen, 2016): using the shorthand notation . We can derive this element-wise but it requires a non-obvious property given in (Petersen & Pedersen, 2016): Using this, the derivation proceeds: from which we recognize the matrix form above.
Inverse of symmetric positive definite matrix
Let be a scalar function and be given, where is a symmetric positive definite matrix with Cholesky decomposition , and so inverse , where is a lower-triangular matrix. We use the transpose, multiplication and inverse properties to obtain:
When computing, matrix inversions are replaced with the solutions of equations wherever possible, for numerical stability. We can think of this as .
Let be a scalar function and be given. We use the multiplication and inverse properties above to obtain:
In the second last line we substitute in from the first line, and in the last line from the forward pass, to avoid repeating expensive computations.
Solve with symmetric positive definite matrix
Let be a scalar function and be given, where is a symmetric positive definite matrix with Cholesky decomposition , and a lower-triangular matrix. Using the solve and multiplication properties above we obtain: where is a function that returns a matrix with its lower triangle set to that of its argument, and its strictly upper triangle set to zero. This form gives the gradient in the space of lower-triangular matrices, rather than the space of arbitrary dense matrices, to ensure that a gradient update maintains as lower triangular.
Let be a scalar function and be given. We have: Or in matrix form:
Logarithm of the determinant
Let be a scalar function and be given. We have (Petersen & Pedersen, 2016): By extension, and assuming that the determinant is positive so that we can take the logarithm, let be given: These seem non-trivial to derive element-wise.
Logarithm of the determinant of a symmetric positive definite matrix
Let be a symmetric positive definite matrix with Cholesky decomposition , with a lower-triangular matrix consisting of positive elements along its main diagonal. The determinant of (or indeed any triangular matrix) is the product of the elements along its main diagonal, and so the logarithm of the determinant the sum of the logarithms of the elements along its main diagonal.
Let be a scalar function and be given. We then have: which can be written as: where is a function that returns a diagonal matrix with its main diagonal set to that of the argument, and all off-diagonal elements set to zero.
As , we have and so:
Let be a scalar function and be given, where . Define (Murray, 2016): where is a function that returns a lower-triangular matrix with the strictly lower triangle set to that of its argument, the main diagonal set to half that of its argument, and the upper triangle set to zero. We then have (Murray, 2016):
This form gives the gradient in the space of symmetric matrices, rather than the space of arbitrary dense matrices. A gradient update on will only affect the lower triangle. This is suitable when using linear algebra routines for symmetric matrices, which typically reference only one triangle (in this case, the lower) and assume that the other matches by symmetry. If a dense matrix must be formed, however, the strictly lower triangle should be transposed and copied into the strictly upper.
Zero-stride catch and a custom CUDA kernel.
A how-to and round-up of cloud service providers.
Gnome Authenticator for Desktop, Aegis Authenticator for Android, import and export between.
A Jekyll plugin for responsive images using ImageMagick. Works with Jekyll 4.