Questions tagged [ridge-regression]

A regularization method for regression models that shrinks coefficients towards zero.

Ridge Regression is a technique which penalizes the size of regression coefficients in order to deal with multicollinear variables or ill-posed statistical problems. It is based on the Tikhonov regularization named after the mathematician Andrey Tikhonov.

Given a set of training data $(x_1,y_1),...,(x_n,y_n)$ where $x_i \in \mathbb{R}^{J}$, the estimation problem is:

$$\min_\beta \sum\limits_{i=1}^{n} (y_i - x_i'\beta)^2 + \lambda \sum\limits_{j=1}^J \beta_j^2$$

for which the solution is given by

$$\widehat{\beta}_{ridge} = (X'X + \lambda I)^{-1}X'y$$

which is similar to the OLS estimator but including the tuning parameter $\lambda$ and the Tikhonov matrix (in this case $I$, the identity matrix but other choices are possible). Note that, unlike the OLS estimator, the ridge regression estimator is always invertible even if there are more parameters in the model than degrees of freedom and hence there always exists a unique solution to the estimation problem.

Bayesian derivation

Ridge regression is equivalent to Bayesian linear regression assuming a Normal prior on $\beta$. Define the likelihood:

$$L(X,Y;\beta,\sigma^2) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}}$$

And using a normal prior with mean 0 and variance $\alpha I_p$ on $\beta$:

$$\pi(\beta) \sim N(0,\alpha I_p)$$

Using Bayes rule, we calculate the posterior distribution:

$$P(\beta | X,Y) \propto L(X,Y;\beta,\sigma^2)\pi(\beta) $$ $$ \propto \big[\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}}\big]e^{-\frac12\beta^T(\alpha^2 I_p)^{-1}\beta}$$

Maximizing the posterior is equivalent to minimizing the negative of the log of the posterior (after some algebra):

$$log (P(\beta | X,Y)) \propto -\frac12\big(\frac{1}{\sigma^2}\sum_{i=1}^{n}(y_i - \beta^Tx_i)^2 + \frac{1}{\alpha}\beta^T\beta\big)$$ $$\propto \sum_{i=1}^{n}(y_i - \beta^Tx_i)^2 + \frac{\sigma^2}{\alpha}\sum_{j=1}^{p}\beta^2 $$

Where $\frac{1}{\alpha}$ is the tuning parameter, corresponding to the choice of $\lambda$ from above.

The tuning parameter $\lambda$ determines the degree of shrinkage of the regression coefficients. The idea is to introduce some degree of bias in order to improve the variance (see bias-variance trade-off). In cases of highly multicollinear variables a small increase in bias to trade off for a lower variance can have a substantial effect.

The bias of the ridge regression estimator is $$Bias(\widehat{\beta}) = -\lambda (X'X + \lambda I)^{-1} \beta$$ It is always possible to find $\lambda$ such that the MSE of the ridge regression estimator is smaller than that of the OLS estimator.

Note that as $\lambda \rightarrow 0, \beta_{ridge} \rightarrow \beta_{ols}$ and as $\lambda \rightarrow \infty, \beta_{ridge} \rightarrow 0$. It is therefore important how to choose the value for $\lambda$. Common methods for this decision include the use of information criteria (AIC or BIC) or (generalized) cross-validation.

28 questions
4
votes
2 answers

Extremely high MSE/MAE for Ridge Regression(sklearn) when the label is directly calculated from the features

Edit: Removing TransformedTargetRegressor and adding more info as requested. Edit2: There were 18K rows where the relation did not hold. I'm sorry :(. After removing those rows and upon @Ben Reiniger's advice, I used LinearRegression and the…
RAbraham
  • 177
  • 7
4
votes
2 answers

What does a negative coefficient of determination mean for evaluating ridge regression?

Judging by the negative result being displayed from my ridge.score() I am guessing that I am doing something wrong. Maybe someone could point me in the right direction? # Create a practice data set for exploring Ridge Regression data_2 =…
Ethan
  • 1,625
  • 8
  • 23
  • 39
3
votes
1 answer

Does ridge regression always reduce coefficients by equal proportions?

Below is an excerpt from the book Introduction to statistical learning in R, (chapter-linear model selection and regularization) "In ridge regression, each least squares coefficient estimate is shrunken by the same proportion" On a simple dataset, I…
3
votes
3 answers

Can ridge regression be used for feature selection?

I'm trying to figure out whether using Ridge Regression for regularization can be used to cause a more sparse hypothesis however to me it seems like ridge will never actually bring any coefficients to zero, only really close to it. So can ridge…
3
votes
1 answer

Does it matter whether we put regularization parameter ($C$) with error or weight term in Kernel ridge regression?

Kernel ridge regression associate a regularization parameter $C$ with weight term ($\beta$): $\text{Minimize}: {KRR}=C\frac{1}{2} \left \|\beta\right\|^{2} + \frac{1}{2}\sum_{i=1}^{\mathcal{N}}\left\|e_i \right \|_2^{2} \\ \text{Subject to}:\…
Chandan Gautam
  • 301
  • 2
  • 13
3
votes
2 answers

Constraining linear regressor parameters in scikit-learn?

I'm using sklearn.linear_model.Ridge to use ridge regression to extract the coefficients of a polynomial. However, some of the coefficients have physical constraints that require them to be negative. Is there a way to impose a constraint on those…
2
votes
3 answers

how Lasso regression helps to shrinks the coefficient to zero and why ridge regression dose not shrink the coefficient to zero?

How does Lasso regression help with feature selection of model by making the coefficient shrink to zero? I could see few below with below diagram. Can any please explain in simple terms how to correlate below diagram with: How Lasso shrinks the…
star
  • 1,411
  • 7
  • 18
  • 29
2
votes
2 answers

How do standardization and normalization impact the coefficients of linear models?

One benefit of creating a linear model is that you can look at the coefficients the model learns and interpret them. For example, you can see which features have the most predictive power and which do not. How, if at all, does feature…
1
vote
1 answer

Why we take $\alpha\sum B_j^2$ as penalty in Ridge Regression?

$$RSS_{RIDGE}=\sum_{i=1}^n(\hat{y_i}-y_i)^2+\alpha\sum_{i=1}^nB_j^2$$ Why we are taking $\alpha\sum B_j^2$ as a penalty here? We are adding this term for minimizing variance in Machine Learning Model. But how this term minimizing variance. If I add…
1
vote
1 answer

Do the benefits of ridge regression diminish with larger datasets?

I have a question about ridge regression and about its benefits (relative to OLS) when the datasets are big. Do the benefits of ridge regression disappear when the datasets are larger (e.g. 50,000 vs 1000)? When the dataset is large enough, wouldn't…
1
vote
2 answers

What is the meaning of the sparsity parameter

Sparse methods such as LASSO contain a parameter $\lambda$ which is associated with the minimization of the $l_1$ norm. Higher the value of $\lambda$ ($>0$) means that more coefficients will be shrunk to zero. What is unclear to me is that how does…
Sm1
  • 511
  • 3
  • 17
1
vote
0 answers

what other metrics can i use to estimate quality of the model predicting income range - interval estimation task?

I trained a model that predicts customer's income given the features: age, declared income number of oustanding instalment, overdue total amount active credit limit, total credit limit total amount The output is a prediction: lower-upper bound for…
1
vote
1 answer

Why is Regularization after PCA or Factor Analysis a bad idea?

I have done Factor Analysis on my data and applied various machine learning models on it. I particularly find it giving high MSE value for Ridge and Lasso Regression compared to other models. I want to know the reason why this happens.
1
vote
1 answer

How is learning rate calculated in sklearn Lasso regression?

I was applying different regression models to Kaggle Housing dataset for advanced regression. I am planning to test out lasso, ridge and elastic net. However, none of these models have learning rate as their parameter. How is the learning rate…
1
vote
1 answer

Is there a reference data set for ridge regression?

In order to test an algorithm, I am looking for a reference data set for ridge regression in research papers. Kind of like the equivalent of MNIST but for regression.
Marie
  • 23
  • 3
1
2