blog

Gradients of Softmax and Logsumexp

Essential functions for categorical distributions and attention mechanisms in machine learning

4 min read / 4 May 24

Two mathematical functions that commonly arise in machine learning models are softmax and logsumexp. They occur when dealing with categorical and multinomial probability distributions, as well as in attention mechanisms used in neural network architectures (such as the transformer). The two functions are closely related in that they both involve sums of exponentials, with similar numerical considerations. In fact they are so closely related that deriving the gradient of softmax involves (or at least can involve) logsumexp, while the gradient of logsumexp is softmax. This note serves as a quick explanation of the functions as well as a derivation of their gradients for the purposes of implementation.

Logsumexp and Softmax
Gradient of Logsumexp
Gradient of Softmax

Logsumexp and Softmax

The two functions of interest are logsumexp and softmax, defined as:

\begin{aligned} \mathrm{logsumexp}(\mathbf{x}) & :=\log\left(\sum_{i}\exp(x_{i})\right)\\ \mathrm{softmax}(\mathbf{x}) & :=\frac{\exp(\mathbf{x})}{\sum_{i}\exp(x_{i})}. \end{aligned}

The first resolves to a scalar, whereas the second resolves to a vector of the same length as $\mathbf{x}$ (by $\exp(\mathbf{x})$ we mean the exponential function applied to each element of $\mathbf{x}$ ). The latter can be written in terms of the former as:

\begin{aligned} \mathrm{softmax}(\mathbf{x}) & =\frac{\exp(\mathbf{x})}{\sum_{i}\exp(x_{i})}.\\ & =\exp\left(\log\left(\frac{\exp(\mathbf{x})}{\sum_{i}\exp(x_{i})}\right)\right)\\ & =\exp\left(\mathbf{x}-\log\left(\sum_{i}\exp(x_{i})\right)\right)\\ & =\exp\left(\mathbf{x}-\mathrm{logsumexp}\left(\mathbf{x}\right)\right). \end{aligned}

The two functions arise in several contexts, including categorical and multinomial distributions (where $\mathbf{x}$ is a vector of logarithms of unnormalized probabilities) and attention mechanisms in neural network architectures (where $\mathbf{x}$ is a vector of unnormalized attention weights).

When computing the softmax and logsumexp functions, implementations are usually adapted from the above definitions, rather than taking them exactly as written, for reasons of numerical stability. The exponential function is very sensitive to the scale of $\mathbf{x}$ , and in floating point arithmetic has a propensity to underflow to zero for large negative arguments, and overflow to infinity for large positive arguments. The effect is much worse in single precision than double precision, and much worse again in half precision—critical as these lower precisions are commonly used in machine learning to reduce compute time and memory use, especially on GPUs. To mitigate the issue, observe that $\mathrm{softmax}(\mathbf{x} - c)=\mathrm{softmax}(\mathbf{x})$ for any scalar $c$ (allowing the addition to broadcast the scalar). This can be used to translate $\mathbf{x}$ into a better range for numerical stability of the exponential function. A common choice is to set $c$ to the maximum element of $\mathbf{x}$ . In practice the functions may then be computed as:

\begin{aligned} \mathrm{logsumexp}(\mathbf{x}) & =\log\left(\sum_{i}\exp(x_{i}-c)\right)+c\\ \mathrm{softmax}(\mathbf{x}) & :=\frac{\exp(\mathbf{x}-c)}{\sum_{i}\exp(x_{i}-c)}. \end{aligned}

For the purposes of gradients, however, we can go back to the original definitions. To simplify notation in what follows, let $\mathbf{s}=\mathrm{softmax}(\mathbf{x})$ and $l=\mathrm{logsumexp}(\mathbf{x})$ .

Gradient of Logsumexp

Element-wise, using the usual rule for derivatives of the logarithmic function, we have:

\begin{aligned} \frac{\partial l}{\partial x_{i}} & =\frac{\partial}{\partial x_{i}}\log\left(\sum_{j}\exp(x_{j})\right)\\ & =\frac{\exp(x_{i})}{\sum_{j}\exp(x_{j})}\\ & =s_{i}. \end{aligned}

Written in vector form, this is:

\frac{\partial l}{\partial\mathbf{x}}=\mathbf{s}.

This is, the gradient of $\mathrm{logsumexp}(\mathbf{x})$ is just $\mathrm{softmax}(\mathbf{x})$ .

Gradient of Softmax

Element-wise, using the property introduced above that $\mathrm{softmax}(\mathbf{x})=\exp\left(\mathbf{x}-\mathrm{logsumexp}\left(\mathbf{x}\right)\right)$ (and so $\mathbf{s}=\exp(\mathbf{x}-l)$ , and so $s_{i}=\exp(x_{i}-l)$ ), and the previously-derived $\partial l/\partial x_{i}=s_{i}$ , we have:

\begin{aligned} \frac{\partial s_{i}}{\partial x_{j}} & =\frac{\partial}{\partial x_{j}}\exp\left(x_{i}-l\right)\\ & =\begin{cases} \left(1-s_{j}\right)\exp\left(x_{i}-l\right) & i=j\\ -s_{j}\exp\left(x_{i}-l\right) & i\neq j \end{cases}\\ & =\begin{cases} (1-s_{j})s_{i} & i=j\\ -s_{j}s_{i} & i\neq j. \end{cases} \end{aligned}

We can write this as a matrix of all-pairs partial derivatives (the Jacobian) as:

\mathbf{J}:=\mathbf{I}-\mathbf{s}\mathbf{s}^{\top}.

According to the application, we may never need to compute this matrix explicitly. A typical use case is computing gradients of a scalar function with respect to some parameters of that function. In machine learning, during the training of a model, that scalar function is a loss function, and reverse-mode automatic differentiation is used to backpropagate losses onto parameters for the next gradient update. Let $f$ denote such a scalar function and let $\partial f/\partial\mathbf{s}$ be given (i.e. working backward from a loss given by $f$ , reverse mode automatic differentiation has arrived at the output of the softmax, for which the gradient is $\partial f/\partial\mathbf{s}$ ); we can write:

\frac{\partial f}{\partial x_{i}}=\sum_{j}\frac{\partial f}{\partial s_{j}}\cdot\frac{\partial s_{j}}{\partial x_{i}},

which in matrix form is:

\begin{aligned} \frac{\partial f}{\partial\mathbf{x}} & =\mathbf{J}\frac{\partial f}{\partial\mathbf{s}}\\ & =(\mathbf{I}-\mathbf{s}\mathbf{s}^{\top})\frac{\partial f}{\partial\mathbf{s}}\\ & =\frac{\partial f}{\partial\mathbf{s}}-\mathbf{s}\left(\mathbf{s}^{\top}\frac{\partial f}{\partial\mathbf{s}}\right). \end{aligned}

The last line is a more efficient way of computing the gradient than the first line, in both time and memory, as it avoids computing the whole Jacobian matrix.