Closed form solution of linear regression via least squares using matrix derivatives

Question

How is the closed form solution to linear regression derived using matrix derivatives as opposed to using the trace method as Andrew Ng does in his Machine learning lectures. Specifically, I am trying to understand how Nando de Frietas does it here.

We want to find the value of $ \theta $ that minimizes $ J(\theta)=(X\theta-Y)^{T}(X\theta-Y) $, where $\theta \in \mathbb{R}^{N \times 1}, X \in \mathbb{R}^{M \times N}$, and $Y \in \mathbb{R}^{M \times 1}$

$\nabla_{\theta}J(\theta) = \nabla_{\theta} (X\theta-Y)^{T}(X\theta-Y)$

$ = \nabla_{\theta} (\theta^{T} X^{T}-Y^{T})(X\theta-Y)$

$ = \nabla_{\theta} (\theta^{T} X^{T}X\theta-\theta^{T} X^{T}Y - Y^{T}X\theta + Y^{T}Y) $

Note that $\theta^{T} X^{T}Y$ is a scalar, so $\theta^{T} X^{T}Y = (\theta^{T} X^{T}Y)^{T} = Y^{T} X \theta$

$\nabla_{\theta}J(\theta) = \nabla_{\theta}(\theta^{T} X^{T}X\theta-Y^{T} X \theta - Y^{T}X\theta + Y^{T}Y)$

$ = \nabla_{\theta}(\theta^{T} X^{T}X\theta- 2 Y^{T} X \theta + Y^{T}Y)$

$ = \nabla_{\theta} \theta^{T} X^{T}X\theta - \nabla_{\theta} 2 Y^{T} X \theta + \nabla_{\theta} Y^{T}Y$

$ = \nabla_{\theta} \theta^{T} X^{T}X\theta - \nabla_{\theta} 2 Y^{T} X \theta $

How do I apply the matrix derivatives described in that video to solve this? He skip steps.

Edit: Below is the suggested strategy of removing theta by differentiating, then taking the inverse of both sides. So looking at one term at a time, we have

$ \nabla_{\theta} \theta^{T} X^{T}X\theta = ? $ How do I differntiate this? This is like differntiating $x\alpha_{1} \alpha_{2} x$ w.r.t. x in the scalar case. I need to combine those $\theta$ terms to hit them with the derivative. Transposing seems to result in the same expression: $$ (\nabla_{\theta} \theta^{T} X^{T}X\theta)^{T} = \nabla_{\theta} \theta^{T} X^{T}X\theta$$

Looking at the second term, we have

$ \nabla_{\theta} 2 Y^{T} X \theta = 2 X^{T} Y$.

Putting together this, we have: $$\nabla_{\theta} \theta^{T} X^{T}X\theta = 2 X^{T} Y$$

Knowing the solution is $\theta = (X^{T}X)^{-1}X^{T}Y$ we can reverse engineer the problem, but I am just not seeing it. And how do we get rid of that 2 factor?

Actually you went right and the rest is as video says. Transpose both sides of the last equation and remove $\Theta$ from both sides. Then multiplying the inverse of $X^{T}X$ and $X^{T}Y$ would be the $\Theta$ ... so I didn't get the question ... what are you looking for exactly? — Kasra Manshaei, Feb 21 '18 at 10:55
For the more mathematically oriented questions, perhaps [CrossValidated](https://stats.stackexchange.com/) is a better site. — kingledion, Feb 21 '18 at 14:31
Hi guys. Very valuable comments. Thank you very much. I made an edit to the original question and got one step further, but am still stuck. I have been bashing my head against the wall for weeks on this. Please help! — user8919, Feb 22 '18 at 00:57
@user8919 look at equations 43-47 here: https://atmos.washington.edu/~dennis/MatrixCalculus.pdf . Remember that you are only differentiating with respect to $\theta$ — Eumenedies, Feb 23 '18 at 15:53
This matrix cookbook (Sec. 2) might come handy to you: http://compbio.fmph.uniba.sk/vyuka/ml/old/2008/handouts/matrix-cookbook.pdf — Paulo A. Ferreira, Apr 06 '18 at 13:20

score 1 · Accepted Answer · answered Mar 15 '18 at 13:25

Matrix derivatives work a bit different than regular ones. The scalar parallel of $\nabla_{\theta} \theta^{T} X^{T}X\theta$ you make should be more like $\theta x^2 \theta$ which in turn is just $x^2 \theta^2$. (Note that I changed $\alpha$ for $x$ to avoid confusion with $X$.) You do not need to modify $\theta^{T} X^{T}X\theta$ so that the two $\theta$'s are beside each other, this actually is what you want.

Just like you usually have $\frac{d}{d\theta} (x^2 \theta^2) = 2 x^2 \theta$, then in matrix notation the rule is $$\nabla_{\theta} \theta^{T} X^{T}X\theta = 2 X^{T}X \theta.$$

Hence, I feel he's not really "skipping steps", but applying a different step than the one you expected.

Closed form solution of linear regression via least squares using matrix derivatives

1 Answers1