Linear Sandwich
Creating a Linear
sandwich by stacking a bunch of Linear
layers on top of each other
just results in another linear transformation. Below, we show that this is indeed the
case.
Suppose we have \(3\) Linear
layers with parameters \(\mathbf{W}_i\), \(\mathbf{b}_i\)
\(, i \in \{1, 2, 3\}\). Then given input, target \(\mathbf{x}\), \(\mathbf{y}\) we have
To make the notation cleaner, we can define \(\mathbf{A}_1 = \mathbf{W}_1\mathbf{W}_2\) and \(\boldsymbol{\beta}_1 = \mathbf{W}_1\mathbf{b}_2 + \mathbf{b}_1\). It’s clear that \(\mathbf{A}_1\) and \(\boldsymbol{\beta}_1\) are also linear because they’re just compositions of linear functions. Substituting them into the last equation we have
\[\begin{align} \mathbf{y} &= \mathbf{A}_1(\mathbf{W}_3\mathbf{x} + \mathbf{b}_3) + \boldsymbol{\beta}_1 \\ &= \mathbf{A}_1\mathbf{W}_3\mathbf{x} + \mathbf{A}_1\mathbf{b}_3 + \boldsymbol{\beta}_1. \end{align}\]Again, we can define \(\mathbf{A}_2 = \mathbf{A}_1\mathbf{W}_3\), \(\boldsymbol{\beta}_2 = \mathbf{A}_1\mathbf{b}_3 + \boldsymbol{\beta}_1\) and substitute into the last equation to get
\[\mathbf{y} = \mathbf{A}_2\mathbf{x} + \boldsymbol{\beta}_2\]which is yet another linear function.
Obviously, the process above can be extended to any number of layers. This shows than
any size stack of Linear
layers is linear.