Written by Aryan Sidhwani on 5/26/2024

Why gradient descent works

Normal Equations

Let there be a design matrix $X$ such that it contains training samples in each row, with features in each column
$X = \begin{pmatrix} x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \dots & x_j^{(1)} \\ x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \dots & x_j^{(2)} \\ \vdots & \vdots & \vdots & \ddots \\ x_1^{(i)} & x_2^{(i)} & x_3^{(i)} & ... & x_j^{(i)} \end{pmatrix}$

This is also known as the Design Matrix

And similarly, there is a vector y such that it contains all the training sample outputs

$y = \begin{pmatrix} y^{(1)} \\ y^{(2)} \\ y^{(3)} \\ .. \\ y^{(i)} \end{pmatrix}$

Therefore, we can write the differences vector as $X\theta - y$
To write Error function $J(\theta)$ in this form we can write it as -

$J(\theta) = \frac{1}{2} (X\theta - y) (X\theta - y)^T$

And then to find the minima, we can take the matrix derivative of this like so

$\nabla_\theta J = 0$

and by opening up the factors and applying derivative, we get this -

$X^T X \theta - X^T y = 0$

If $X^{T} X$ is invertible,

$\theta = (X^T X)^{-1} X^T y$

Why minimize J?

In order to arrive at this conclusion, we just need to have a pretty reasonable assumption of Error function for each input , ie, $y^{(i)} - h^{(i)}$ to be random and IID (Independent and Identically distributed)

Which means that it will be normally distributed like so - $p (\epsilon) = \frac{1}{\sqrt{2\pi} \sigma} exp(- \frac{(\epsilon^{(i)})^2}{2 \sigma^2})$

Therefore from the relation,

$y^{(i)} - \theta^T x^{(i)} = \epsilon^{(i)}$

this probability density function becomes

$p (\epsilon) = \frac{1}{\sqrt{2\pi} \sigma} exp(- \frac{(y^{(i)} - \theta^T x^{(i)})^2}{2 \sigma^2})$

At the maximum value of this probability function, exponential is 1, which means that error is zero and target variable is exactly equal to the hypothesis.

Therefore, we need to choose $\theta^T$ such that this function is maximized. This function is called Likelihood function $L(\theta) =$ . Now, since $\epsilon^{(i)}$ are IID, the probabilities of each training sample can be multiplied to get total likelihood

$L(\theta) = \prod_{i = 1}^{n} \frac{1}{\sqrt{2\pi} \sigma} exp(- \frac{(y^{(i)} - \theta^T x^{(i)})^2}{2 \sigma^2})$

since any strictly increasing function of L, would be at its maximum when the best parameters are chosen, we can maximize $l o g (L)$ as well to convert the multiplying factors into simply a sum

$log(L(\theta)) = \sum_i log(\frac{1}{\sqrt{2\pi} \sigma}) - \frac{1}{2} \frac{(y^{(i)} - \theta^T x^{(i)})^2}{\sigma^2}$
Summing the $log(\frac{1}{\sqrt{2\pi}\sigma})$ terms over all samples,

$n \cdot log(\frac{1}{\sqrt{2\pi} \sigma}) - \frac{1}{\sigma^2} \sum_i \frac{(y^{(i)} - \theta^T x^{(i)})^2}{2}$

Maximizing the log(L) thus would require minimizing the second term, which is nothing but $J(\theta)$ ,

$\text{QED}$

Comments