Highest Voted 'data-imputation' Questions - Data Science Stack Exchange

12

votes

5 answers

Please review my sketch of the Machine Learning process

It's amazingly difficult to find an outline of the end-to-end machine learning process. As a total beginner, this lack of information is frustrating, so I decided to try scraping together my own process by looking at a lot of tutorials that all do…

asked Apr 06 '20 at 01:10

rocksNwaves

309
1
10

12

votes

2 answers

Which comes first? Multiple Imputation, Splitting into train/test, or Standardization/Normalization

I am working on a multi-class classification problem, with ~65 features and ~150K instances. 30% of features are categorical and the rest are numerical (continuous). I understand that standardization or normalization should be done after splitting…

multiclass-classification normalization data-imputation

asked Jun 03 '19 at 14:59

Sarah

601
2
5
17

11

votes

5 answers

Retrieve dropped column names from `sklearn.impute.SimpleImputer`

The SimpleImputer class takes pandas dataframes and returns unlabeled numpy arrays. Which means that the SimpleImputer drops some features at will, but has no way to communicate which features have been dropped to the caller I've been trying to come…

scikit-learn data-imputation

asked Jan 06 '20 at 15:17

lurscher

213
2
5

7

votes

1 answer

How to deal with missing data for only some categories

Or in other words, data for category A is irrelevant for category B. So it is not present, how can imputing missing data distort/effect learning models broadly. I can't find any logic how to deal with this relative data. So I am sorry that I don't…

categorical-data data-imputation

asked Sep 19 '18 at 22:08

bacloud14

453
5
13

7

votes

2 answers

R's mice imputation alternative in Python

What is Python's alternative to missing data imputation with mice in R? Imputation using median/mean seems pretty lame, I'm looking for other methods of imputation, something like randomForest.

python r data-imputation

asked Jun 19 '17 at 18:57

user25935

7

votes

5 answers

How to handle missing value if imputation doesnt make sense

I have column/feature in my dataset showing years a person has been married "years_married". Since not every person is married there are NaN fields. It does not make sense to fillna(0) "years_married" since 0 would mean the person just married.A…

data-science-model missing-data data-imputation

asked Mar 02 '23 at 16:41

methus

111
5

6

votes

2 answers

When to use missing data imputation in the data analysis problem?

I want to run statistical analysis of a dataset and build a logistic regression model and multinominal linear model by R according to the research question. But I was wondering which step should I use the missing value imputation to complete the…

dataset data-cleaning missing-data data-imputation

asked Aug 11 '19 at 22:39

Eileen

61
1

5

votes

3 answers

What predictive model to use to impute Gender?

My data looks like this: birth_date has 634,990 missing values gender has 328,849 missing values Both of these are a substantial amounts since I have 900k entries, so I can't discard empty rows. For birth_date someone recommended using Multivariate…

predictive-modeling missing-data data-imputation

asked May 07 '19 at 19:39

Bn.F76

195
1
7

4

votes

1 answer

XGBoost - Imputing Vs keeping NaN

What is the benefit of imputing numerical or categorical features when using DT methods such as XGBoost that can handle missing values? This question is mainly for when the values are missing not at random. An example of missing not at random…

decision-trees xgboost data-imputation gradient-boosting-decision-trees

asked May 24 '21 at 15:25

thereandhere1

715
1
7
22

4

votes

2 answers

What is the difference between Missing at Random and Missing not at Random data?

I have been working with a dataset where the missing data seem to following a few particular patterns. I have gone through a lot websites and articles related to missing data but I haven't been able to understand the difference between MAR and…

machine-learning r data-mining missing-data data-imputation

asked Sep 12 '18 at 07:48

AdeeThyag

71
2
3

4

votes

2 answers

Imputation of missing values and dealing with categorical values

I have a dataset (10 million rows, 55 columns) with many missing values. I need to predict those values somehow using other non-missing values, i.e. replace them with something that is not NaN. Mean and median are not the solution here. I tried to…

python scikit-learn pandas categorical-data data-imputation

asked May 23 '17 at 11:35

user32550

41
1
2

4

votes

1 answer

Computationally Inexpensive Imputation Techniques in R

I have a large data-frame (155257 x 21 to be specific) with only a few missing values. Say, some 2.16% of the values need to be imputed. The values are floating point numbers. I'd like to use a method that is much faster than it is accurate, because…

machine-learning r efficiency missing-data data-imputation

asked Jun 13 '16 at 02:37

yad

1,773
3
16
27

4

votes

2 answers

How to measure the performance of an imputation technique

I would like to know how I can measure the performance of an imputation technique. I have read a lot about this. Most literature on the web are applying a classifier after the data has been completed. So this classifier will be used in order to make…

data-mining descriptive-statistics data-imputation

asked Mar 10 '16 at 21:48

user6046209

41
1
2

3

votes

1 answer

Can data leakage be sometimes acceptable?

I have recently started using kaggle and I have stumbled on a few examples of practices I would consider do be data leakage. Many of them were done by people well established on the platform and I could tell by their notebooks, that they knew what…

preprocessing kaggle data-imputation data-leakage

asked Jul 18 '21 at 11:53

Mateusz

115
6

3

votes

4 answers

Dropping columns or inputing numbers

After looking at the various different ways of inputting data to replace NaN in a dataset vs. dropping observations or columns based on a threshold, the right technique is still is very confusing. I know that this must be treated on a case by case…

missing-data data-imputation

asked Jul 07 '21 at 02:27

Roger Steinberg

113
6

Questions tagged [data-imputation]