4

I have a large data-frame (155257 x 21 to be specific) with only a few missing values. Say, some 2.16% of the values need to be imputed. The values are floating point numbers.

I'd like to use a method that is much faster than it is accurate, because of the size of the data-set and the fact that I don't have much to lose in a speed-accuracy tradeoff.

Running missForest() takes several hours while Hmisc's impute() function gives unsatisfactory results.

What functions in R might be useful in such (or similar) case?


<code>mice_plot</code> output

yad
  • 1,773
  • 3
  • 16
  • 27
  • the `mice` (multiple imputation by chained equations) package is very popular for missing data in R. Its not exactly inexpensive, but it may be a good balance between `missForest` and `impute` – TBSRounder Jun 13 '16 at 13:15
  • `mice` has been running for the past 4 hours, appears to be doing better than `missForest` (efficiency-wise). Are there any metrics to actually compare these imputation methods? – yad Jun 13 '16 at 13:28
  • Perhaps this is where I should mention that my computational resources are _very_ limited, hence the emphasis on speed and efficiency. – yad Jun 13 '16 at 13:36
  • Probably the best way to compare them would be to make a fake data set with variables that are similar to yours, artificially remove data, run both methods, and see which was closer to the true missing values (rmse/accuracy). Probably not worth it in your limited resources scenario though. – TBSRounder Jun 13 '16 at 14:54
  • 1
    You can try avg-filling all columns except one, then predict on that column with the data you do have, and use that model to predict the values that are missing. So you would do this 4 times, with each iteration being predicted on a different column that contains missing values. Not perfect or super-efficient, but might be easy for a simple algorithm like KNN – TBSRounder Jun 13 '16 at 14:57

1 Answers1

1

Take a look at the h20 package https://cran.r-project.org/web/packages/h2o/h2o.pdf.

Everything is designed with parallelization in mind. I've had great success with many of their implementations, in R and Scala.

If you have to do it in R and are going for pure speed I doubt you'll find something faster.

  • I've up-voted because this would have suited my purpose earlier when I was willing to trade-off speed for accuracy, but as it turns out any standard impute function only messes up the classification later on. A regular `hmisc::impute` works just as well for me in terms of speed. Can I somehow speed-up the `mice` based imputation? – yad Jun 27 '16 at 12:08