Last updated: Dec 2, 2024

Through the course of my daily work and studies, I occasionally find myself writing small proofs and derivations. This post is an attempt to continuously catalogue interesting ones for future reference.

  1. Evidence Lower Bound (ELBO)
  2. Normal Equations
  3. Kullback-Leibler Divergence (Two Proofs)
  4. Show \(\log \det(\mathbf{A}) = \mathrm{Tr}(\log \mathbf{A})\)
  5. Posterior motivation for \(\sigma(\cdot)\)



1.          Evidence Lower Bound (ELBO)

Suppose we have a joint distribution \(p(\mathbf{x}, \mathbf{z})\) where \(\mathbf{z}\) are latent variables. Then the Evidence Lower Bound (ELBO) for \(p(\mathbf{x})\) can be derived as:

\[\begin{align} \log p(\mathbf{x}) &= \log \int p(\mathbf{z}, \mathbf{x}) d\mathbf{z} && \text{definition of marginalization} \\ &= \log \int q(\mathbf{z}|\mathbf{x}) \frac{p(\mathbf{z}, \mathbf{x})}{q(\mathbf{z}|\mathbf{x})} d\mathbf{z} && \text{identity trick} \ q(\mathbf{z}|\mathbf{x})/q(\mathbf{z}|\mathbf{x}) = 1 \\ &= \log \mathbb{E}_{q(\mathbf{z}| \mathbf{x})} \left[ \frac{p(\mathbf{z}, \mathbf{x})}{q(\mathbf{z}|\mathbf{x})} \right] && \text{definition of expectation} \\ &\ge \mathbb{E}_{q(\mathbf{z}| \mathbf{x})} \left[ \log \frac{p(\mathbf{z}, \mathbf{x})}{q(\mathbf{z}|\mathbf{x})} \right] && \text{Jensen's inequality} \\ &= \mathcal{L} \end{align}\]


2.          Normal Equations

Suppose we are given a design matrix \(\mathbf{X} \in \mathbb{R}^{N \times K}\) and targets vector \(\mathbf{y} \in \mathbb{R}^N\). We want to find parameter vector \(\mathbf{w} \in \mathbb{R}^K\) which minimizes the residual sum of squares, i.e.

\[\begin{align} \mathop{\arg \min}\limits_\mathbf{w} \ \mathrm{RSS}(\mathbf{w}) &= \sum_i^N (y_i - \mathbf{x}_i^T \mathbf{w})^2 \\ &= (\mathbf{y} - \mathbf{Xw})^T(\mathbf{y} - \mathbf{Xw}) \end{align}\]

The solution to the (least squares) problem can be obtained using the normal equations. We can derive the normal equations from first principles as follows. First, let’s expand the \(\mathrm{RSS}\) definition so it’s easier to work with.

\[\begin{align} \mathrm{RSS}(\mathbf{w}) &= (\mathbf{y} - \mathbf{Xw})^T(\mathbf{y} - \mathbf{Xw}) \\ &= (\mathbf{y}^T - \mathbf{w}^T\mathbf{X}^T)(\mathbf{y} - \mathbf{Xw}) && \text{definition of transpose} \\ &= \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{Xw} - \mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{Xw} \\ &= \mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T\mathbf{Xw} + \mathbf{w}^T\mathbf{X}^T\mathbf{Xw} \\ \end{align}\]

The last step follows because \(\mathbf{w}^T\mathbf{X}^T\mathbf{y}\) is a scalar and so it is equal to its transpose \(\mathbf{y}^T\mathbf{Xw}\).

Now, to minimize \(\mathrm{RSS}(\mathbf{w})\), we take its derivative w.r.t. \(\mathbf{w}\), set it to \(0\), and solve for \(\mathbf{w}\).

\[\begin{align} \nabla_\mathbf{w} \mathrm{RSS}(\mathbf{w}) &= 0 \\ \nabla_\mathbf{w} ( \mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T\mathbf{Xw} + \mathbf{w}^T\mathbf{X}^T\mathbf{Xw} ) &= 0 \\ 0 - 2 \mathbf{X}^T\mathbf{y} + 2 \mathbf{X}^T\mathbf{X}\mathbf{w} &=0 \\ \mathbf{X}^T\mathbf{X}\mathbf{w} &= \mathbf{X}^T\mathbf{y} \\ \mathbf{w} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align}\]

What we have arrived at is exactly the normal equations, where \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) is known as the (Moore-Penrose) pseudo- inverse.


3.          Kullback-Leibler Divergence (Two Proofs)

We show two different ways to prove that \(D_\mathrm{KL}(p || q) \ge 0\) given distributions \(p(x), q(x)\).

Jensen’s Inequality

The first proof makes use of Jensen’s inequality and the fact that \(-\log(x)\) is a convex function. Jensen’s inequality states that \(\mathbb{E}[f(x)] \ge f(\mathbb{E}[x])\) for any convex function \(f(x)\). We can use this to show:

\[\begin{align} D_\mathrm{KL}(p||q) &= \mathbb{E}_{p(x)} \left[ -\log \left( \frac{q(x)}{p(x)} \right) \right] \\ &\ge -\log \left( \mathbb{E}_{p(x)} \left[\frac{q(x)}{p(x)} \right] \right) && \text{Jensen's inequality} \\ &= -\log \left( \int p(x) \frac{q(x)}{p(x)} dx \right) && \text{definition of expectation} \\ &= -\log \left( \int q(x)dx \right) \\ &= -\log(1) \\ &= 0 \end{align}\]

Natural Logarithm Bounds

The second proof makes use of the natural logarithm bounds. More specifically, we use the fact that \(-\log(x) \ge 1 - x\) to show:

\[\begin{align} D_\mathrm{KL}(p||q) &= \mathbb{E}_{p(x)} \left[ -\log \left( \frac{q(x)}{p(x)} \right) \right] \\ &\ge \mathbb{E}_{p(x)} \left[ \left( 1 -\frac{q(x)}{p(x)} \right) \right] && \text{natural logarithm bound} \\ &=\int p(x) \left( 1 -\frac{q(x)}{p(x)} \right) dx && \text{definition of expectation} \\ &=\int p(x) - q(x) dx \\ &= 1 - 1 \\ &= 0 \end{align}\]


4.          \(\log \det(\mathbf{A}) = \mathrm{Tr}(\log \mathbf{A})\)

We show that \(\log \det(\mathbf{A}) = \mathrm{Tr}(\log \mathbf{A})\) for diagonalizable matrices. This property holds for general matricies, but we do not prove that here.

Suppose that \(\mathbf{A}\) is a diagonalizable matrix with eigenvalues \(\lambda_i\). Then,

\[\begin{align} \log \det(\mathbf{A}) &= \log \left( \prod_i \lambda_i \right) \\ &= \sum_i \log \lambda_i \\ &= \mathrm{Tr}(\log \mathbf{A}) \end{align}\]

The first line makes use of the identity \(\det(\mathbf{A}) = \prod_i \lambda_i\). The third line makes use of the identity \(\log(\mathbf{A}) = \mathbf{P}^{-1}(\log \boldsymbol{\Lambda})\mathbf{P}\) for diagonalizable \(\mathbf{A}\) and that \(\mathrm{Tr}(\mathbf{A}) = \sum_i \lambda_i\).


5.          Posterior motivation for \(\sigma(\cdot)\)

Using \(\sigma(\cdot)\) as the output function for classification, where \(\sigma\) is either the sigmoid or softmax function, can be motivated from a number of angles. One interesting motivation from Bishop uses the class posterior probability as follows.

Binary classification

Suppose we have data \(\mathbf{x}\), labels \(\mathcal{C}\) sampled from \(p(\mathbf{x}, \mathcal{C})\), where \(\mathcal{C} \in {0, 1}\). Then the posterior probability that \(C=1\) can be written as

\[\begin{align} p(\mathcal{C}=1|\mathbf{x}) &= \frac{p(\mathbf{x}|\mathcal{C}=1)p(\mathcal{C}=1)}{p(\mathbf{x}|\mathcal{C}=0)p(\mathcal{C}=0) + p(\mathbf{x}|\mathcal{C}=1)p(\mathcal{C}=1)} \\ &=\frac{1}{\frac{p(\mathbf{x}|\mathcal{C}=0)p(\mathcal{C}=0) + p(\mathbf{x}|\mathcal{C}=1)p(\mathcal{C}=1)}{p(\mathbf{x}|\mathcal{C}=1)p(\mathcal{C}=1)}} && \text{reciprocal rule} \\ &=\frac{1}{1+\frac{p(\mathbf{x}|\mathcal{C}=0)p(\mathcal{C}=0)}{p(\mathbf{x}|\mathcal{C}=1)p(\mathcal{C}=1)}} \\ &=\frac{1}{1+\exp(-a)} \\ &= \sigma(a) \end{align}\]

where we have defined \(a=\ln\frac{p(\mathbf{x}|\mathcal{C}=1)p(\mathcal{C}=1)}{p(\mathbf{x}|\mathcal{C}=0)p(\mathcal{C}=0)}.\) Therefore, given some function that outputs logits \(\text{h} = f(\mathbf{x})\), the function \(\sigma(h)\) directly corresponds to the posterior probability of observing the positive class \(\mathcal{C}=1\).

Multiclass classification

The same motivation extends to the case when we have \(K\) classes. The posterior probability that \(C=k\) can be written as

\[\begin{align} p(\mathcal{C}=k|\mathbf{x}) &= \frac{p(\mathbf{x}|\mathcal{C}=k)p(\mathcal{C}=k)}{\sum_j p(\mathbf{x}|\mathcal{C}=j)p(\mathcal{C}=0)} \\ &= \frac{\exp(a_k)}{\sum_j \exp(a_j)} \\ &= \sigma(a_k) \end{align}\]

where we have defined \(a_k = \ln p(\mathbf{x}|\mathcal{C}=k)p(\mathcal{C}=k).\) Interestingly, the multiclass motivation is much easier to see and simpler to derive.