0

I'm working on a Demand Forecasting project, I have a lot of 0 (75% of the database)

I got a highly right skewed target (5.5).

So I decided to log transform my target: target = log(target + 1)

When I train my models (linear regression or LGBM, RandomForest), performance (RMSE here) decrease with log transformation.

If I'm not wrong, Tree based algorithms doesn't care about skewed data. But even, I don't understand how decreasing skewness can decrease performance

2 Answers2

2

Most of the online discussions, that I looked up, seemed to talk about and recommend log-transformation of the target variable for better results for tree based regression algorithms. It's indeed intriguing to see that the performance is hampered using the log transformation. May be looking into the dataset might give a better idea

Questions- @Dummy01

  1. Are you converting the predicted target value via antilog and then computing the RMSE ? I am asking this, as this might also effect the performance score.
  2. Have you done any log-transformation on the Independent Variables? You didn't mention that in the original post, but just wondering, if that's the case

Apart from that, Can you please provide a sample from your dataset and the code snippet that you are using ? This might be helpful into looking at the problem.

Polymath
  • 319
  • 1
  • 4
2

If the target is skewed, you could try oversampling, under sampling or SMOTE (synthetic minority oversampling technique) Since 75% of data is 0 you could bin them into two groups ones that are equal to 0 and the other not equal to 0 and use over or under sampling.