Questions tagged [data-leakage]

62 questions
7
votes
1 answer

How to deal with possible data leakage in time series data?

I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan. My issue is that for some customer in the data set, historical transactions are only available…
irkinosor
  • 213
  • 1
  • 6
6
votes
4 answers

Does label encoding an entire dataset cause data leakage?

I have a dataset on which one of the features has a lot of different categorical values. Trying to use a LabelEncoder, OrdinalEncoder or a OneHotEncoder results in an error, since when splitting the data, the test set ends up having some values that…
kaylani2
  • 65
  • 1
  • 7
4
votes
2 answers

Does using user-specific accumulative variables causes data leakage?

Let's say I have a scenario in which my observational unit is a bill that was issued after a certain service was given and my goal is to predict if this bill is going to be paid or not. I have users in the system so I include user-variables like…
Corel
  • 159
  • 4
4
votes
1 answer

What can I do when my test and validation scores are good, but the submission is terrible?

This is a very broad question, I understand and I'm totally fine if someone believes it's not appropriate to do it. But it's killing me not to understand this... Here's the thing, I'm doing a machine learning model to predict the tweet topic. I'm…
Yuxxxxxx
  • 141
  • 2
3
votes
1 answer

Can data leakage be sometimes acceptable?

I have recently started using kaggle and I have stumbled on a few examples of practices I would consider do be data leakage. Many of them were done by people well established on the platform and I could tell by their notebooks, that they knew what…
Mateusz
  • 115
  • 6
3
votes
0 answers

Data leakage in bidirectional LSTM timeseries data

Does it cause data leakage to train a bidirectional LSTM on data where a user can be a sample in the training data multiple times? Each row is a snapshot at a different point in time for a given user. Their past N months of behavior are the…
David Feldman
  • 193
  • 1
  • 4
3
votes
2 answers

Manual feature engineering based on the output

So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset…
3
votes
1 answer

Is normalizing the validation set of time series a kind of look ahead bias?

Here's the data normalization process of a time series in a paper about stock prediction using LSTM: Split train and test set based on time (e.g. training set: 2001-2010, test set:2011-2012). This looks fine to me. Normalize the training set by…
TQA
  • 526
  • 2
  • 14
3
votes
2 answers

Can preprocessing the whole population cause data leakage?

Introduction I understand the problem of data leakage that could be caused by the preprocessing step when our training and test sets are just samples of an unknown population. The preprocessing parameters should be calculated from the training set…
3
votes
2 answers

What is the difference between data leakage and endogeneity?

I have the impression the former is used in ML whereas the latter is used in econometrics. They both carry the idea that information from the target is "leaking" in explanatory variables. Is there any difference between those two notions?
Tanguy
  • 250
  • 2
  • 10
2
votes
1 answer

Information leakage when train/test are truly i.i.d.?

I am well aware that to avoid information leakage, it is recommended to fit any transformation (e.g., standardization or imputation based on the median value) on the training dataset and applying it to the test datasets. However. I am not clear what…
2
votes
1 answer

How to split up my dataset in a train and testset, in order to prevent data leakage?

I realize that this could be considered a duplicate of this question, Is using samples from the same person in both trainset and testset considers being a data leakage?, where it is stated that "The testing data should not be linked to the training…
2
votes
1 answer

Is using samples from the same person in both trainset and testset considers being a data leakage?

Suppose a neural network is built for a binary classification problem such as recognize the face as a smiley face or not, by using a dataset of 1000 persons and each person has ten images of his face. If the dataset randomly spilt into trainset and…
AI_new2
  • 85
  • 4
2
votes
0 answers

Will setting up time series data in this way cause data leakage?

I am trying to predict future stock market values using a gradient boosted tree model. As far as I know, gradient boosted trees use the data in one row, and only that row, to predict the target variable for that row. Therefore, I am thinking that…
Darcey BM
  • 197
  • 1
  • 6
2
votes
1 answer

Can I apply feature selection before splitting by requiring selection occurs > 90% of time

I want to move the feature selection step to before splitting to save time and allow bigger input dataset. If, in repeated subsamples, a feature is selected in over X percentage of cases I will keep it. Alternatively use very low X to remove…
ran8
  • 343
  • 3
  • 8
1
2 3 4 5