1

I am trying to do a prediction of real estate (prices are in millions).

The mean price for the dataset is 4 million.

I do not have any negative values in my dataset, but there are predicted values which are negative like -10 million.

Xgboost is also predicting negative values:

Xgboost: RMSE is 1.24 and R$^2$ is 0.81

Linear regression: RMSE is 1.54 and R$^2$ 0.74

What am I doing wrong? I tried to use $\log(\text{price})$ but the RMSE is bigger. What solutions can be found for this type of problem?

1 Answers1

0

This can happen with regression, especially if the training data is too small and/or the test data has important differences with the training data. It can be caused by bias or overfitting, but it's more likely overfitting in your case so the solution is either to improve the training data or to simplify the model, for example by removing some features.

Erwan
  • 24,823
  • 3
  • 13
  • 34
  • It's a dataset with more than 5 million of rows ,training data 4 millions rows for training and 1 million for test , I tried many solutions , repoved features, used logarithmic Price but nothing is working , i dont have outliers or négative values , – Djakarta_zero Jun 18 '22 at 17:35
  • 3
    I'm new to this site please don't take me wrong, but I have noticed most of your answers seem to be opinion-based with no apparent technical or theoretical support. Is it possible for you to offer references for some of your answers in the future? – user_2340102 Jun 18 '22 at 19:34
  • @Djakarta_zero you can investigate what i the cause by taking a few instances predicted with negative value, find whether there are any similar instances in the training set (probably not) and if possible which feature values are outliers wrt to the training set. You can also 'open the model' to see how the model ends predicting these values. Imho there's a good chance that you will obtain what I described: this happens for instances which differ too much from the training set. – Erwan Jun 19 '22 at 14:37
  • 1
    @user_2340102 this is a good question, but the answer is a bit complex. First, DSSE is a bit special in several ways compared to other SEs: (too) many questions, especially by newcomers who never come back; very few regular contributors, and very broad topic which means that contributors cannot be expert in everything. This results in a high proportion of unanswered questions, and a high proportion of questions/answers with zero votes (quite often even the OP doesn't bother upvoting or accepting). Additionally DSSE is clearly more open to practical questions ... – Erwan Jun 19 '22 at 14:46
  • ... compared to [cvSE](https://stats.stackexchange.com/) which is more about theoretical questions and more formal statistics stuff. I realized the issue long time ago and [was wondering what to do](https://datascience.meta.stackexchange.com/questions/2468/why-so-few-people-vote-and-is-there-any-way-to-encourage-voting): taking the time to write well researched answers becomes quickly frustrating when there's zero feedback. Additionally there are so many people asking questions, so for me the choice is like this: I spend 1h writing 1 really good answer with references, or I write 5-6 quick ... – Erwan Jun 19 '22 at 15:02
  • ... answers based on my experience and intuition. Quite often people (especially beginners) are looking for an intuitive understanding which is not provided in books/courses, so sometimes this kind of answer is useful. sometimes it's not, and I can't always know this in advance. To be honest my personal taste also plays a role: I'm ok to do a serious job professionally of course, but when I work for free I do it the way I like ;) Obviously people can downvote my answers and/or propose their own answers and I'd be happy with this, but this never happens for the reasons mentioned above. – Erwan Jun 19 '22 at 15:09
  • I'm not a beginner , and i think that he is wright in what he is saying , answers are very poor and not even justified , if you take a look you will very easily observe that beginner's questions are answered but answers like that are not answered, all my questions are without any answer , please take in account that i didn't brought your answer in account, because it was illogical and many of collaboraters consider people in this site as beginner's that's why they are answering with a very low quality answers, and another important thing , new members don't have possibility to downvote – Djakarta_zero Jun 21 '22 at 09:51
  • We don't even know if they accepted the answer or not , and the better answer and the most popular when you're asking difficult questions they're answering that the question is not well asked or formuled – Djakarta_zero Jun 21 '22 at 09:53
  • 1
    @Djakarta_zero you realize that I'm not being paid for answering questions, right? I agree that my answers are not always relevant and sometimes even contain mistakes, it's unavoidable when one answers many questions. But if you want the level on this site to improve, don't hesitate: since you're not a beginner, you can answer questions. With more contributors offering good quality answers, the other contributors must improve their game too. We will see how good your answers are, and how motivated you are to keep contributing in the long run. – Erwan Jun 21 '22 at 10:19
  • Fyi contributors tend to avoid answering questions by a user who asked questions before but never gave any feedback. And some users thank a contributor for their answer even if it turns out not to be useful for them: first because they're grateful that somebody took the time to look at their problem, second because this can lead to a discussion to clarify details and eventually find the answer. – Erwan Jun 21 '22 at 10:26