Questions tagged [data-imputation]

Data imputation is the process of replacing missing data with substituted values. This could involve statistically representative data filling (e.g. local averages) or simply replacing the missing data with encoded values (e.g. replace NaNs with zeros).

Data imputation is the process of replacing missing data with substituted values. This could involve statistically representative data filling (e.g. local averages) or simply replacing the missing data with encoded values (e.g. replace NaNs with zeros).

122 questions
12
votes
5 answers

Please review my sketch of the Machine Learning process

It's amazingly difficult to find an outline of the end-to-end machine learning process. As a total beginner, this lack of information is frustrating, so I decided to try scraping together my own process by looking at a lot of tutorials that all do…
12
votes
2 answers

Which comes first? Multiple Imputation, Splitting into train/test, or Standardization/Normalization

I am working on a multi-class classification problem, with ~65 features and ~150K instances. 30% of features are categorical and the rest are numerical (continuous). I understand that standardization or normalization should be done after splitting…
Sarah
  • 601
  • 2
  • 5
  • 17
11
votes
5 answers

Retrieve dropped column names from `sklearn.impute.SimpleImputer`

The SimpleImputer class takes pandas dataframes and returns unlabeled numpy arrays. Which means that the SimpleImputer drops some features at will, but has no way to communicate which features have been dropped to the caller I've been trying to come…
lurscher
  • 213
  • 2
  • 5
7
votes
1 answer

How to deal with missing data for only some categories

Or in other words, data for category A is irrelevant for category B. So it is not present, how can imputing missing data distort/effect learning models broadly. I can't find any logic how to deal with this relative data. So I am sorry that I don't…
bacloud14
  • 453
  • 5
  • 13
7
votes
2 answers

R's mice imputation alternative in Python

What is Python's alternative to missing data imputation with mice in R? Imputation using median/mean seems pretty lame, I'm looking for other methods of imputation, something like randomForest.
user25935
7
votes
5 answers

How to handle missing value if imputation doesnt make sense

I have column/feature in my dataset showing years a person has been married "years_married". Since not every person is married there are NaN fields. It does not make sense to fillna(0) "years_married" since 0 would mean the person just married.A…
methus
  • 111
  • 5
6
votes
2 answers

When to use missing data imputation in the data analysis problem?

I want to run statistical analysis of a dataset and build a logistic regression model and multinominal linear model by R according to the research question. But I was wondering which step should I use the missing value imputation to complete the…
Eileen
  • 61
  • 1
5
votes
3 answers

What predictive model to use to impute Gender?

My data looks like this: birth_date has 634,990 missing values gender has 328,849 missing values Both of these are a substantial amounts since I have 900k entries, so I can't discard empty rows. For birth_date someone recommended using Multivariate…
Bn.F76
  • 195
  • 1
  • 7
4
votes
1 answer

XGBoost - Imputing Vs keeping NaN

What is the benefit of imputing numerical or categorical features when using DT methods such as XGBoost that can handle missing values? This question is mainly for when the values are missing not at random. An example of missing not at random…
4
votes
2 answers

What is the difference between Missing at Random and Missing not at Random data?

I have been working with a dataset where the missing data seem to following a few particular patterns. I have gone through a lot websites and articles related to missing data but I haven't been able to understand the difference between MAR and…
4
votes
2 answers

Imputation of missing values and dealing with categorical values

I have a dataset (10 million rows, 55 columns) with many missing values. I need to predict those values somehow using other non-missing values, i.e. replace them with something that is not NaN. Mean and median are not the solution here. I tried to…
4
votes
1 answer

Computationally Inexpensive Imputation Techniques in R

I have a large data-frame (155257 x 21 to be specific) with only a few missing values. Say, some 2.16% of the values need to be imputed. The values are floating point numbers. I'd like to use a method that is much faster than it is accurate, because…
yad
  • 1,773
  • 3
  • 16
  • 27
4
votes
2 answers

How to measure the performance of an imputation technique

I would like to know how I can measure the performance of an imputation technique. I have read a lot about this. Most literature on the web are applying a classifier after the data has been completed. So this classifier will be used in order to make…
3
votes
1 answer

Can data leakage be sometimes acceptable?

I have recently started using kaggle and I have stumbled on a few examples of practices I would consider do be data leakage. Many of them were done by people well established on the platform and I could tell by their notebooks, that they knew what…
Mateusz
  • 115
  • 6
3
votes
4 answers

Dropping columns or inputing numbers

After looking at the various different ways of inputting data to replace NaN in a dataset vs. dropping observations or columns based on a threshold, the right technique is still is very confusing. I know that this must be treated on a case by case…
1
2 3
8 9