how to modify sparse survey dataset with empty data points?

Question

I am working on a data set where the categorical variables have lots of empty spaces (not "NA" but ""). For example, one variable has 14587 empty spaces out of 14644 observations. There are many such variables where most of the observations are empty.In fact it is a survey dataset where the participant just chose to ignore a particular question.

I have never handled similar dataset. I am looking for advise as to how best to handle such datasets before any modeling is done. Deleting the rows or the variables with lots of empty spaces doesn't seem feasible.

Thanks a lot.

score 0 · Accepted Answer · edited Apr 13 '17 at 12:50

0

I would consider approaching this situation from the following two perspectives:

Missing data analysis. Despite formally the values in question are empty and not NA, I think that effectively incomplete data can (and should) be considered as missing. If that is the case, you need to automatically recode those values and then apply standard missing data handling approaches, such as multiple imputation. If you use R, you can use packages Amelia (if the data is multivariate normal), mice (supports non-normal data) or some others. For a nice overview of approaches, methods and software for multiple imputation of data with missing values, see the 2007 excellent article by Nicholas Horton and Ken Kleinman "Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models".
Sparse data analysis, such as sparse regression. I'm not too sure how well this approach would work for variables with high levels of sparsity, but you can find a lot of corresponding information in my relevant answer.

edited Apr 13 '17 at 12:50

Community

1

answered Apr 02 '15 at 23:38

Aleksandr Blekh

6,518
4
28
54

1

Thanks a lot Aleksandr Blekh. I am looking into your suggestions now. – user62198 Apr 02 '15 at 23:52
@user62198: You're very welcome. Good luck! – Aleksandr Blekh Apr 02 '15 at 23:55
By "automatically recode those values", do you mean that I consider the missing values as a separate factor level ? Could you please comment ? Thanks.. – user62198 Apr 03 '15 at 02:51
@user62198: No. I meant the recoding in `R` parlance: `''` => `NA`. – Aleksandr Blekh Apr 03 '15 at 02:56
@user62198: My pleasure. I'm glad I could help. Feel free to upvote as well :-). – Aleksandr Blekh Apr 03 '15 at 18:01

how to modify sparse survey dataset with empty data points?

1 Answers1