Will setting up time series data in this way cause data leakage?

Question

I am trying to predict future stock market values using a gradient boosted tree model. As far as I know, gradient boosted trees use the data in one row, and only that row, to predict the target variable for that row.

Therefore, I am thinking that setting up the training dataset like this would not cause data leakage?

Would this count as roll-forward partitioning in some sense, because for each row, the last year's worth of historical values are provided?

This isn't data leakage but also a really inefficient way to model a time series. You are basically "hard codding" earlier values and try to use them as a predictor in a "static model". When predicting time series you have to go about it in a different way. I suggest looking at ARIMA / Prophet / etc. to get going in that direction. — Fnguyen, Jul 14 '20 at 13:02
Hi @Fnguyen, thanks for your feedback. I am also using ARIMA/Prophet but I want to use XGBoost as a comparison between classic methods and ML approaches for time series forecasting. There is precedent for doing that which I can see from research. Given this, would you structure the data differently for training an XGBoost model? And if yes, how so? — Darcey BM, Jul 14 '20 at 17:28
No, not leakage. I've used this method and had great success with it. It certainly relies on you transforming the target so that its stationary, though. A tree based model won't predict a target value outside of the ranges that it obeserved, so remember that. Another thing to consider is: are you only predicting one period ahead (or the 3 days time value is only 1 point in reference to the date columns) .. if not, you can still extend this by adding a "forecast distance" variable which is the time between the date and the start of your target. — Josh, Jul 14 '20 at 20:32
@Josh thank you for that, it is good to know you have done something similar before and I did not know that tree-based methods won't predict outside observed values. Would you recommend using differencing to transforming stock market data to stationary? — Darcey BM, Jul 14 '20 at 21:13
I tend to avoid stock market problems in general :). I have played with predicting something like stock dispersion which is generally easier because it tends to be high or low over longer periods. — Josh, Jul 14 '20 at 21:26
Definitely transform the target if it was stock prices. Log might work since stocks never really hit 0. Here's a nice link that outlines how to test for stationary: https://towardsdatascience.com/stock-market-forecasting-using-time-series-c3d21f2dd37f Just ignore the ARIMA part if you want to do it this way. Oh one other thing - this technique seems to be more powerful for me when you have a lot of covariates for each day, including categorical. When its just the lagged target then its not as exciting to do.. but have fun storming the castle! — Josh, Jul 14 '20 at 21:32
Make sure you parse out day of week, month, holiday/weekends, and other fun date variables from that date! If you have a calendar of holidays you can add variables like "days until holiday X" etc! — Josh, Jul 14 '20 at 21:33
That's so helpful thank you! I will definitely add extra date variables for holidays etc, and I will give that article a read now. I am actually using a categorical variable as the target, just is the market "up"/"down" on yesterday's price (based on yesterday's Twitter sentiment). Do you think I should switch to a regression problem, with the aim of predicting actual value rather than just up/down and apply differencing? — Darcey BM, Jul 15 '20 at 10:38

Will setting up time series data in this way cause data leakage?

0 Answers0