5

How is the closed form solution to linear regression derived using matrix derivatives as opposed to using the trace method as Andrew Ng does in his Machine learning lectures. Specifically, I am trying to understand how Nando de Frietas does it here.

We want to find the value of $ \theta $ that minimizes $ J(\theta)=(X\theta-Y)^{T}(X\theta-Y) $, where $\theta \in \mathbb{R}^{N \times 1}, X \in \mathbb{R}^{M \times N}$, and $Y \in \mathbb{R}^{M \times 1}$

$\nabla_{\theta}J(\theta) = \nabla_{\theta} (X\theta-Y)^{T}(X\theta-Y)$

$ = \nabla_{\theta} (\theta^{T} X^{T}-Y^{T})(X\theta-Y)$

$ = \nabla_{\theta} (\theta^{T} X^{T}X\theta-\theta^{T} X^{T}Y - Y^{T}X\theta + Y^{T}Y) $

Note that $\theta^{T} X^{T}Y$ is a scalar, so $\theta^{T} X^{T}Y = (\theta^{T} X^{T}Y)^{T} = Y^{T} X \theta$

$\nabla_{\theta}J(\theta) = \nabla_{\theta}(\theta^{T} X^{T}X\theta-Y^{T} X \theta - Y^{T}X\theta + Y^{T}Y)$

$ = \nabla_{\theta}(\theta^{T} X^{T}X\theta- 2 Y^{T} X \theta + Y^{T}Y)$

$ = \nabla_{\theta} \theta^{T} X^{T}X\theta - \nabla_{\theta} 2 Y^{T} X \theta + \nabla_{\theta} Y^{T}Y$

$ = \nabla_{\theta} \theta^{T} X^{T}X\theta - \nabla_{\theta} 2 Y^{T} X \theta $

How do I apply the matrix derivatives described in that video to solve this? He skip steps.

Edit: Below is the suggested strategy of removing theta by differentiating, then taking the inverse of both sides. So looking at one term at a time, we have

$ \nabla_{\theta} \theta^{T} X^{T}X\theta = ? $ How do I differntiate this? This is like differntiating $x\alpha_{1} \alpha_{2} x$ w.r.t. x in the scalar case. I need to combine those $\theta$ terms to hit them with the derivative. Transposing seems to result in the same expression: $$ (\nabla_{\theta} \theta^{T} X^{T}X\theta)^{T} = \nabla_{\theta} \theta^{T} X^{T}X\theta$$

Looking at the second term, we have

$ \nabla_{\theta} 2 Y^{T} X \theta = 2 X^{T} Y$.

Putting together this, we have: $$\nabla_{\theta} \theta^{T} X^{T}X\theta = 2 X^{T} Y$$

Knowing the solution is $\theta = (X^{T}X)^{-1}X^{T}Y$ we can reverse engineer the problem, but I am just not seeing it. And how do we get rid of that 2 factor?

user8919
  • 53
  • 4
  • Actually you went right and the rest is as video says. Transpose both sides of the last equation and remove $\Theta$ from both sides. Then multiplying the inverse of $X^{T}X$ and $X^{T}Y$ would be the $\Theta$ ... so I didn't get the question ... what are you looking for exactly? – Kasra Manshaei Feb 21 '18 at 10:55
  • 1
    For the more mathematically oriented questions, perhaps [CrossValidated](https://stats.stackexchange.com/) is a better site. – kingledion Feb 21 '18 at 14:31
  • Hi guys. Very valuable comments. Thank you very much. I made an edit to the original question and got one step further, but am still stuck. I have been bashing my head against the wall for weeks on this. Please help! – user8919 Feb 22 '18 at 00:57
  • @user8919 look at equations 43-47 here: https://atmos.washington.edu/~dennis/MatrixCalculus.pdf . Remember that you are only differentiating with respect to $\theta$ – Eumenedies Feb 23 '18 at 15:53
  • This matrix cookbook (Sec. 2) might come handy to you: http://compbio.fmph.uniba.sk/vyuka/ml/old/2008/handouts/matrix-cookbook.pdf – Paulo A. Ferreira Apr 06 '18 at 13:20

1 Answers1

1

Matrix derivatives work a bit different than regular ones. The scalar parallel of $\nabla_{\theta} \theta^{T} X^{T}X\theta$ you make should be more like $\theta x^2 \theta$ which in turn is just $x^2 \theta^2$. (Note that I changed $\alpha$ for $x$ to avoid confusion with $X$.) You do not need to modify $\theta^{T} X^{T}X\theta$ so that the two $\theta$'s are beside each other, this actually is what you want.

Just like you usually have $\frac{d}{d\theta} (x^2 \theta^2) = 2 x^2 \theta$, then in matrix notation the rule is $$\nabla_{\theta} \theta^{T} X^{T}X\theta = 2 X^{T}X \theta.$$

Hence, I feel he's not really "skipping steps", but applying a different step than the one you expected.

Perochkin
  • 311
  • 2
  • 6