3

Does it cause data leakage to train a bidirectional LSTM on data where a user can be a sample in the training data multiple times?

Each row is a snapshot at a different point in time for a given user. Their past N months of behavior are the features and their current month of behavior is the target.

Example data: "Months Prior" columns are the only features. Both the features and target continuous values.

+------------------+---------+--------------------+----------------+----------------+----------------+-----------------+----------------+------------------------+
| Train Test Split | User Id | Current Month Date | 5 Months Prior | 4 Months Prior | 3 Months Prior | 2 Months Prior  | 1 Month Prior  | Target (Current Month) |
+------------------+---------+--------------------+----------------+----------------+----------------+-----------------+----------------+------------------------+
| test             |     123 | June               |              1 |              4 |              2 |               8 |              2 |                      6 |
| test             |     123 | May                |              0 |              1 |              4 |               2 |              8 |                      2 |
| training         |     123 | April              |              0 |              0 |              1 |               4 |              2 |                      8 |
| training         |     123 | March              |              0 |              0 |              0 |               1 |              4 |                      2 |
| training         |     123 | Feb                |              0 |              0 |              0 |               0 |              1 |                      4 |
+------------------+---------+--------------------+----------------+----------------+----------------+-----------------+----------------+------------------------+

Would the bidirectional LSTM learn that some columns in the training data contain the target for other rows?

Example: April "2 Months Prior" and "1 Month Prior" of, 4 and 2, would have the pattern of March "1 Month Prior" and the Target, 4 and 2.

Intuitively I don't think it would learn these relationships, I believe other machine learning models, like tree models and linear regression. But I don't have enough knowledge on LSTMs to say for sure. I could verify by creating simulated data with a random number generator, but I'd rather understand the math/intuition.

David Feldman
  • 193
  • 1
  • 4

0 Answers0