L2 loss vs. mean squared loss

Question

I see some literature consider L2 loss (least squared error) and mean squared error loss are two different kinds of loss functions.

However, it seems to me these two loss functions essentially compute the same thing (with a 1/n factor difference).

So I am wondering if I have missed anything? Is there any scenario that one should use one of the two loss functions?

Could you provide a reference to a source where the two losses are considered to be different? — Neil Slater, Jan 01 '18 at 08:41

Esmailian · Answer 1 · 2019-03-12T07:43:37.120

Function $L_2(x):=\left \|x \right \|_2$ is a norm, it is not a loss by itself. It is called a "loss" when it is used in a loss function to measure a distance between two vectors, $\left \| y_1 - y_2 \right \|^2_2$, or to measure the size of a vector, $\left \| \theta \right \|^2_2$. This goes with a loss minimization that tries to bring these quantities to the "least" possible value.

These are some illustrations:

$L_p$ norm: $L_p(x) := \left \|x \right \|_p = (\sum_{i=1}^{D} |x_i|^p)^{1/p}$,
where $D$ is the dimension of vector $x$,
Squared error: $\mbox{SE}(A, \theta) =\sum_{n=1}^{N} \left \| y_n - f_{\theta}(x_n) \right \|^2_2$,
where $A=\{(x_n, y_n)_{n=1}^{N}\}$ is a set of data points, and $f_{\theta}(x_n)$ is model's estimation of $y_n$,
Mean squared error: $\mbox{MSE}(A, \theta) =\mbox{SE}(A, \theta)/N$,
Least squares optimization: $\theta^*=\mbox{argmin}_{\theta} \mbox{MSE}(A, \theta)$$=\mbox{argmin}_{\theta} \mbox{SE}(A, \theta)$,
Ridge loss: $\mbox{R}(A, \theta, \lambda) = \mbox{MSE}(A, \theta) + \lambda\left \| \theta \right \|^2_2$
Ridge optimization (regression): $\theta^*=\mbox{argmin}_{\theta} \mbox{R}(A, \theta, \lambda)$.

In all of the above examples, $L_2$ norm can be replaced with $L_1$ norm or $L_\infty$ norm, etc.. However the names "squared error", "least squares", and "Ridge" are reserved for $L_2$ norm. For example for $L_1$, "squared error" becomes "absolute error":

Absolute error: $\mbox{AE}(A, \theta) =\sum_{n=1}^{N} \left \| y_n - f_{\theta}(x_n) \right \|_1$,

score 10 · Answer 2 · answered Mar 11 '19 at 23:23

10

To be precise, L2 norm of the error vector is a root mean-squared error, up to a constant factor. Hence the squared L2-norm notation $\|e\|^2_2$, commonly found in loss functions.

However, $L_p$-norm losses should not be confused with regularizes. For instance, a combination of the L2 error with the L2 norm of the weights (both squared, of course) gives you a well known ridge regression loss, while a combination of L2 error + L1 norm of the weights gives rise to a Lasso regression.

answered Mar 11 '19 at 23:23

M0nZDeRR

385
3
6

"L2 norm of the error vector is a root mean-squared error" - are you sure? I don't think L2 involves a mean. – David Gilbertson Jun 28 '22 at 21:07
@DavidGilbertson "... up to a _constant_ factor", yes, I'm sure. – M0nZDeRR Jun 28 '22 at 22:04
Oh. I don't know what "mean-squared error, _up to a constant factor_" means. It "undoes" the fact that you said "mean" I guess. – David Gilbertson Jun 29 '22 at 03:26
@DavidGilbertson, it means that the ratio between L2 norm of the error vector and RMSE is a constant factor. – M0nZDeRR Jun 29 '22 at 16:33
Aah, gotcha. I read it as "L2 is defined as ..." rather than "L2 is similar to ... within bounds". Thanks for explaining. – David Gilbertson Jun 29 '22 at 23:17
@DavidGilbertson, right, this is not a definition, but rather an equivalence. – M0nZDeRR Jun 30 '22 at 00:30

Belter · Answer 3 · 2021-05-19T07:54:25.807

10

They are different:

L2 = $\sqrt{\sum_{i=1}^{N}(y_i-y_{i}^{pred})^2}$

MSE = $\frac{\sum_{i=1}^{N}(y_i-y_{i}^{pred})^2}{N}$

There are sum and square root for L2-Norm, but sum and mean for MSE!

We can check it by following code:

import numpy as np
from sklearn.metrics import mean_squared_error

y = np.array(range(10, 20))  # array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
y_pred = np.array(range(10))  # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.linalg.norm(y_pred - y, ord=2)  # L2-Nomr: 31.622776601683793
mean_squared_error(y_pred, y)  # MSE: 100.0

edited May 19 '21 at 07:54

answered Mar 02 '21 at 15:47

Belter

201
2
4

Thanks for the clear distinction. They are proportional but not equal – WestCoastProjects Jun 25 '23 at 01:22

Toonia · Answer 4 · 2022-10-30T12:26:12.317

By the theory of Riemann integration, \begin{align*} \int_a^b |f(x)-g(x)|^2dx &= \lim_{n \to \infty} \sum_{k=1}^n |f(x_k)-g(x_k)|^2 \Delta x\\ &= \lim_{n \to \infty} \frac{1}{n} \sum_{k=1}^n |f(x_k) - g(x_k)|^2 \\ & \approx \frac{1}{n} \sum_{k=1}^n |f(x_k) - g(x_k)|^2 \end{align*} for $n$ sufficiently large. You can recognize the LHS as originating from the $L2$ norm while the RHS, MSE. If working on function spaces and point-wise evaluation of functions are considered then MSE essentially approximates squared $L2$ norm, for the difference. MSE on the other hand, is the squared norm modulo the dimension in finite dimensions. i.e., $$ ||y - \hat{y}||_2^2 = \sum_{k=1}^n |y_k - \hat{y}_k|^2\\ \text{MSE} = \frac{1}{n} ||\cdot||_2^2 $$ The difference, if there is one, is measure-theoretic.

score 0 · Answer 5 · answered Jun 24 '23 at 21:44

0

Belter is right, but, as observed by Toonia, we can see that: $$L_2 = \sqrt{N \times MSE}= \sqrt{\sum_{i=1}^{N}(y_i-y_{i}^{pred})^2}$$

answered Jun 24 '23 at 21:44

Andrea Dalseno

1

score -2 · Answer 6 · answered Jun 28 '21 at 17:07

-2

I think for computation purpose we are using L2 norms. Because if we use MSE we have to use "for loop" and this will take more computation. But, on the other hand, we can use N2 norms by using matrix and this saves more computation for any programing language considering if we have a huge data. Overall, I think both are doing the same thing. Please correct me if I am wrong!

answered Jun 28 '21 at 17:07

Shivahari

1

I don't see why a `for` loop would be needed for MSE but not $L2$ norm. – Dave Jun 28 '21 at 17:20
MSE and L2 norm is the same thing up to a square root and a constant factor. They both require summing over all errors^2. Also, their gradients are the same (up to a constant), hence the extrema (optimal solutions) are the same as well. – M0nZDeRR Oct 27 '21 at 03:18
I think you are engaging an aspect of this that the other answers are not: computational overhead. If data is big, and you are optimizing something "hairy" over it which means you have to go over it many times, then having a lower overhead is a requirement, not just "nice". Something like mean absolute error is an N operations for N rows in complexity while mean squared error is 2N operations for N rows. It also takes more bits to represent, so higher order representation. If you are in reduced fix-width for big compute you can have a ceiling there. – EngrStudent Oct 30 '22 at 13:35

L2 loss vs. mean squared loss

6 Answers6