29

I see some literature consider L2 loss (least squared error) and mean squared error loss are two different kinds of loss functions.

However, it seems to me these two loss functions essentially compute the same thing (with a 1/n factor difference).

So I am wondering if I have missed anything? Is there any scenario that one should use one of the two loss functions?

ebrahimi
  • 1,277
  • 7
  • 20
  • 39
Edamame
  • 2,705
  • 5
  • 23
  • 32

6 Answers6

27

Function $L_2(x):=\left \|x \right \|_2$ is a norm, it is not a loss by itself. It is called a "loss" when it is used in a loss function to measure a distance between two vectors, $\left \| y_1 - y_2 \right \|^2_2$, or to measure the size of a vector, $\left \| \theta \right \|^2_2$. This goes with a loss minimization that tries to bring these quantities to the "least" possible value.

These are some illustrations:

  1. $L_p$ norm: $L_p(x) := \left \|x \right \|_p = (\sum_{i=1}^{D} |x_i|^p)^{1/p}$,
    where $D$ is the dimension of vector $x$,

  2. Squared error: $\mbox{SE}(A, \theta) =\sum_{n=1}^{N} \left \| y_n - f_{\theta}(x_n) \right \|^2_2$,
    where $A=\{(x_n, y_n)_{n=1}^{N}\}$ is a set of data points, and $f_{\theta}(x_n)$ is model's estimation of $y_n$,

  3. Mean squared error: $\mbox{MSE}(A, \theta) =\mbox{SE}(A, \theta)/N$,

  4. Least squares optimization: $\theta^*=\mbox{argmin}_{\theta} \mbox{MSE}(A, \theta)$$=\mbox{argmin}_{\theta} \mbox{SE}(A, \theta)$,

  5. Ridge loss: $\mbox{R}(A, \theta, \lambda) = \mbox{MSE}(A, \theta) + \lambda\left \| \theta \right \|^2_2$

  6. Ridge optimization (regression): $\theta^*=\mbox{argmin}_{\theta} \mbox{R}(A, \theta, \lambda)$.

In all of the above examples, $L_2$ norm can be replaced with $L_1$ norm or $L_\infty$ norm, etc.. However the names "squared error", "least squares", and "Ridge" are reserved for $L_2$ norm. For example for $L_1$, "squared error" becomes "absolute error":

  1. Absolute error: $\mbox{AE}(A, \theta) =\sum_{n=1}^{N} \left \| y_n - f_{\theta}(x_n) \right \|_1$,
Esmailian
  • 9,147
  • 2
  • 31
  • 47
10

To be precise, L2 norm of the error vector is a root mean-squared error, up to a constant factor. Hence the squared L2-norm notation $\|e\|^2_2$, commonly found in loss functions.

However, $L_p$-norm losses should not be confused with regularizes. For instance, a combination of the L2 error with the L2 norm of the weights (both squared, of course) gives you a well known ridge regression loss, while a combination of L2 error + L1 norm of the weights gives rise to a Lasso regression.

M0nZDeRR
  • 385
  • 3
  • 6
10

They are different:

L2 = $\sqrt{\sum_{i=1}^{N}(y_i-y_{i}^{pred})^2}$

MSE = $\frac{\sum_{i=1}^{N}(y_i-y_{i}^{pred})^2}{N}$

There are sum and square root for L2-Norm, but sum and mean for MSE!

We can check it by following code:

import numpy as np
from sklearn.metrics import mean_squared_error

y = np.array(range(10, 20))  # array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
y_pred = np.array(range(10))  # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.linalg.norm(y_pred - y, ord=2)  # L2-Nomr: 31.622776601683793
mean_squared_error(y_pred, y)  # MSE: 100.0
Belter
  • 201
  • 2
  • 4
0

By the theory of Riemann integration, \begin{align*} \int_a^b |f(x)-g(x)|^2dx &= \lim_{n \to \infty} \sum_{k=1}^n |f(x_k)-g(x_k)|^2 \Delta x\\ &= \lim_{n \to \infty} \frac{1}{n} \sum_{k=1}^n |f(x_k) - g(x_k)|^2 \\ & \approx \frac{1}{n} \sum_{k=1}^n |f(x_k) - g(x_k)|^2 \end{align*} for $n$ sufficiently large. You can recognize the LHS as originating from the $L2$ norm while the RHS, MSE. If working on function spaces and point-wise evaluation of functions are considered then MSE essentially approximates squared $L2$ norm, for the difference. MSE on the other hand, is the squared norm modulo the dimension in finite dimensions. i.e., $$ ||y - \hat{y}||_2^2 = \sum_{k=1}^n |y_k - \hat{y}_k|^2\\ \text{MSE} = \frac{1}{n} ||\cdot||_2^2 $$ The difference, if there is one, is measure-theoretic.

Toonia
  • 1
  • 1
0

Belter is right, but, as observed by Toonia, we can see that: $$L_2 = \sqrt{N \times MSE}= \sqrt{\sum_{i=1}^{N}(y_i-y_{i}^{pred})^2}$$

-2

I think for computation purpose we are using L2 norms. Because if we use MSE we have to use "for loop" and this will take more computation. But, on the other hand, we can use N2 norms by using matrix and this saves more computation for any programing language considering if we have a huge data. Overall, I think both are doing the same thing. Please correct me if I am wrong!

  • I don't see why a `for` loop would be needed for MSE but not $L2$ norm. – Dave Jun 28 '21 at 17:20
  • MSE and L2 norm is the same thing up to a square root and a constant factor. They both require summing over all errors^2. Also, their gradients are the same (up to a constant), hence the extrema (optimal solutions) are the same as well. – M0nZDeRR Oct 27 '21 at 03:18
  • I think you are engaging an aspect of this that the other answers are not: computational overhead. If data is big, and you are optimizing something "hairy" over it which means you have to go over it many times, then having a lower overhead is a requirement, not just "nice". Something like mean absolute error is an N operations for N rows in complexity while mean squared error is 2N operations for N rows. It also takes more bits to represent, so higher order representation. If you are in reduced fix-width for big compute you can have a ceiling there. – EngrStudent Oct 30 '22 at 13:35