2

I am trying to predict future stock market values using a gradient boosted tree model. As far as I know, gradient boosted trees use the data in one row, and only that row, to predict the target variable for that row.

Therefore, I am thinking that setting up the training dataset like this would not cause data leakage?

enter image description here

Would this count as roll-forward partitioning in some sense, because for each row, the last year's worth of historical values are provided?

Darcey BM
  • 197
  • 1
  • 6
  • This isn't data leakage but also a really inefficient way to model a time series. You are basically "hard codding" earlier values and try to use them as a predictor in a "static model". When predicting time series you have to go about it in a different way. I suggest looking at ARIMA / Prophet / etc. to get going in that direction. – Fnguyen Jul 14 '20 at 13:02
  • Hi @Fnguyen, thanks for your feedback. I am also using ARIMA/Prophet but I want to use XGBoost as a comparison between classic methods and ML approaches for time series forecasting. There is precedent for doing that which I can see from research. Given this, would you structure the data differently for training an XGBoost model? And if yes, how so? – Darcey BM Jul 14 '20 at 17:28
  • 1
    No, not leakage. I've used this method and had great success with it. It certainly relies on you transforming the target so that its stationary, though. A tree based model won't predict a target value outside of the ranges that it obeserved, so remember that. Another thing to consider is: are you only predicting one period ahead (or the 3 days time value is only 1 point in reference to the date columns) .. if not, you can still extend this by adding a "forecast distance" variable which is the time between the date and the start of your target. – Josh Jul 14 '20 at 20:32
  • @Josh thank you for that, it is good to know you have done something similar before and I did not know that tree-based methods won't predict outside observed values. Would you recommend using differencing to transforming stock market data to stationary? – Darcey BM Jul 14 '20 at 21:13
  • 1
    I tend to avoid stock market problems in general :). I have played with predicting something like stock dispersion which is generally easier because it tends to be high or low over longer periods. – Josh Jul 14 '20 at 21:26
  • 1
    Definitely transform the target if it was stock prices. Log might work since stocks never really hit 0. Here's a nice link that outlines how to test for stationary: https://towardsdatascience.com/stock-market-forecasting-using-time-series-c3d21f2dd37f Just ignore the ARIMA part if you want to do it this way. Oh one other thing - this technique seems to be more powerful for me when you have a lot of covariates for each day, including categorical. When its just the lagged target then its not as exciting to do.. but have fun storming the castle! – Josh Jul 14 '20 at 21:32
  • 1
    Make sure you parse out day of week, month, holiday/weekends, and other fun date variables from that date! If you have a calendar of holidays you can add variables like "days until holiday X" etc! – Josh Jul 14 '20 at 21:33
  • That's so helpful thank you! I will definitely add extra date variables for holidays etc, and I will give that article a read now. I am actually using a categorical variable as the target, just is the market "up"/"down" on yesterday's price (based on yesterday's Twitter sentiment). Do you think I should switch to a regression problem, with the aim of predicting actual value rather than just up/down and apply differencing? – Darcey BM Jul 15 '20 at 10:38

0 Answers0