6

I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values. I get errors due to these missing values, as the values of my cost-function and gradient vector become NaN, when I try to perform logistic regression using the following Matlab code (from Andrew Ng's Coursera Machine Learning class) :

[m, n] = size(X);
X = [ones(m, 1) X];    
initial_theta = ones(n+1, 1);
[cost, grad] = costFunction(initial_theta, X, y);
options = optimset('GradObj', 'on', 'MaxIter', 400);

[theta, cost] = ...
    fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);

Note: sigmoid and costfunction are working functions I created for overall ease of use.

The calculations can be performed smoothly if I replace all NaN values with 1 or 0. However I am not sure if that is the best way to deal with this issue, and I was also wondering what replacement value I should pick (in general) to get the best results for performing logistic regression with missing data. Are there any benefits/drawbacks to using a particular number (0 or 1 or something else) for replacing the said missing values in my data?

Note: I have also normalized all feature values to be in the range of 0-1.

Any insight on this issue will be highly appreciated. Thank you

stats_nerd
  • 163
  • 1
  • 1
  • 4

1 Answers1

3

Use caution when removing features with missing values. Sometimes the fact that a feature has missing values is valuable data in and of itself.

What you are asking about is called imputation. A google search will give you a TON of literature about various imputation methods. Here are some of the most common:

-Mean imputation: replace all missing values with the mean of other values

-Mode imputation: same as above, with the mode

-Median imputation: get the idea? Replace all with the median

-Multiple Imputation by Chained Equations (MICE): basically do linear regression to predict the missing values based on other variables. This is a high-variance solution so some domain knowledge may be necessary. This can also take a long time. There are packages available to do this in R and Python.

-Replace missing values with 0 or 1 (looks like you have already tried this)

Play around with the different methods and see which ones give you the best result!

bstrain
  • 401
  • 3
  • 6