1

I am working on a data set where the categorical variables have lots of empty spaces (not "NA" but ""). For example, one variable has 14587 empty spaces out of 14644 observations. There are many such variables where most of the observations are empty.In fact it is a survey dataset where the participant just chose to ignore a particular question.

I have never handled similar dataset. I am looking for advise as to how best to handle such datasets before any modeling is done. Deleting the rows or the variables with lots of empty spaces doesn't seem feasible.

Thanks a lot.

user62198
  • 1,091
  • 4
  • 15
  • 32

1 Answers1

0

I would consider approaching this situation from the following two perspectives:

  • Missing data analysis. Despite formally the values in question are empty and not NA, I think that effectively incomplete data can (and should) be considered as missing. If that is the case, you need to automatically recode those values and then apply standard missing data handling approaches, such as multiple imputation. If you use R, you can use packages Amelia (if the data is multivariate normal), mice (supports non-normal data) or some others. For a nice overview of approaches, methods and software for multiple imputation of data with missing values, see the 2007 excellent article by Nicholas Horton and Ken Kleinman "Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models".

  • Sparse data analysis, such as sparse regression. I'm not too sure how well this approach would work for variables with high levels of sparsity, but you can find a lot of corresponding information in my relevant answer.

Aleksandr Blekh
  • 6,518
  • 4
  • 28
  • 54