Why non-differentiable regularization lead to setting coefficients to 0?

Question

The L2 regularization lead to minimize the values in the vector parameter. The L1 regularization lead to setting some coefficients to 0 in the vector parameter.

More generally, I've seen that non-differentiable regularization function lead to setting coefficients to 0 in the parameter vector. Why is that the case?

Does this answer your question? [how Lasso regression helps to shrinks the coefficient to zero and why ridge regression dose not shrink the coefficient to zero?](https://datascience.stackexchange.com/questions/85220/how-lasso-regression-helps-to-shrinks-the-coefficient-to-zero-and-why-ridge-regr) — Ben Reiniger, Jul 27 '21 at 03:10
@BenReiniger There is some additional value here: the generalization of L1 to "non differentiable" regularization. I was well aware of L1 but not the general case. — WestCoastProjects, May 20 '23 at 14:07
@WestCoastProjects fair, but I think the general intuition is the same (e.g. L_alpha "norms" with alpha between 0 and 2)...except when it's false, e.g. using the infinity-norm would put the sharp corners of the constraint region along $x_1=x_2$ and thus encourage **non**-sparse solutions! — Ben Reiniger, May 20 '23 at 14:46

score 0 · Answer 1 · answered Mar 29 '22 at 17:51

ISLR talks about this topic in details, it can be understood by looking Contours of error and loss function given in the image below:

Each of the ellipses centered around βˆ represents a contour: this means that all of the points on a particular ellipse have the same RSS value. As the ellipses expand away from the least squares coefficient estimates, the RSS increases. Equations (6.8) and (6.9) indicate that the lasso and ridge regression coefficient estimates are given by the first point at which an ellipse contacts the constraint region. Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero. In higher dimensions, many of the coefficient estimates may equal zero simultaneously

score 0 · Answer 2 · answered Jul 01 '19 at 07:11

0

Look at the penalty terms in linear Ridge and Lasso regression:

Ridge (L2):

Lasso (L1):

Note the absolute value (L1 norm) in the Lasso penalty compared to the squared value (L2 norm) in the Ridge penalty.

In Introduction to Statistical Learning (Ch. 6.2.2) it reads: "As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection."

http://www-bcf.usc.edu/~gareth/ISL/

answered Jul 01 '19 at 07:11

Peter

7,277
5
18
47

2

This, I know already, I'm asking why that is the case, more like a mathematical intuition or proof. – Victor Jul 01 '19 at 09:09
1

I thought you can directly see it from the math: Have a look at this page (Section: Comparing regularization techniques — Intuition). It's all about the norms (L1 vs. L2) https://blog.alexlenail.me/what-is-the-difference-between-ridge-regression-the-lasso-and-elasticnet-ec19c71c9028 – Peter Jul 01 '19 at 11:39
This is not answering the question about the _general_ case of non-differentiable regularization functions. If that reference answers it then an excerpt would helpfully be included above – WestCoastProjects May 20 '23 at 14:08

Why non-differentiable regularization lead to setting coefficients to 0?

2 Answers2