Let there be a design matrix X such that it contains training samples in each row, with features in each column X=x1(1)x1(2)⋮x1(i)x2(1)x2(2)⋮x2(i)x3(1)x3(2)⋮x3(i)……⋱...xj(1)xj(2)xj(i)
This is also known as the Design Matrix
And similarly, there is a vector y such that it contains all the training sample outputs
y=y(1)y(2)y(3)..y(i)
Therefore, we can write the differences vector as Xθ−y To write Error function J(θ) in this form we can write it as -
J(θ)=21(Xθ−y)(Xθ−y)T
And then to find the minima, we can take the matrix derivative of this like so
∇θJ=0
and by opening up the factors and applying derivative, we get this -
In order to arrive at this conclusion, we just need to have a pretty reasonable assumption of Error function for each input , ie, y(i)−h(i) to be random and IID (Independent and Identically distributed)
Which means that it will be normally distributed like so - p(ϵ)=2πσ1exp(−2σ2(ϵ(i))2)
Therefore from the relation,
y(i)−θTx(i)=ϵ(i)
this probability density function becomes
p(ϵ)=2πσ1exp(−2σ2(y(i)−θTx(i))2)
At the maximum value of this probability function, exponential is 1, which means that error is zero and target variable is exactly equal to the hypothesis.
Therefore, we need to choose θT such that this function is maximized. This function is called Likelihood function L(θ)=. Now, since ϵ(i) are IID, the probabilities of each training sample can be multiplied to get total likelihood
L(θ)=i=1∏n2πσ1exp(−2σ2(y(i)−θTx(i))2)
since any strictly increasing function of L, would be at its maximum when the best parameters are chosen, we can maximize log(L) as well to convert the multiplying factors into simply a sum
log(L(θ))=i∑log(2πσ1)−21σ2(y(i)−θTx(i))2 Summing the log(2πσ1) terms over all samples,
n⋅log(2πσ1)−σ21i∑2(y(i)−θTx(i))2
Maximizing the log(L) thus would require minimizing the second term, which is nothing but J(θ),