6

I am trying to learn data analysis and machine learning by trying out some problems.

I found a competition "House prices" which is actually a playground competition. Since I am very new to this field, I got confused after exploring the data. The data has 81 columns out of which 1 is the target column which is the house value. This data contains multiple columns where majority of values are "NaN". When I ran:

nulls = data.isnull().sum()
nulls[nulls > 0]

This shows the columns with missing values:

LotFrontage     259 
Alley           1369
MasVnrType      8   
MasVnrArea      8   
BsmtQual        37  
BsmtCond        37  
BsmtExposure    38  
BsmtFinType1    37  
BsmtFinType2    38  
Electrical      1   
FireplaceQu     690 
GarageType      81  
GarageYrBlt     81  
GarageFinish    81  
GarageQual      81  
GarageCond      81  
PoolQC          1453
Fence           1179
MiscFeature     1406

At this point I am totally lost and I don't know how to get rid of these "NaN" values.
Any help would be appreciated.

Ola Ström
  • 111
  • 1
  • 1
  • 6
Ahmed Dhanani
  • 163
  • 1
  • 1
  • 5

3 Answers3

7

You can use the DataFrame.fillna function to fill the NaN values in your data. For example, assuming your data is in a DataFrame called df,

df.fillna(0, inplace=True)

will replace the missing values with the constant value 0. You can also do more clever things, such as replacing the missing values with the mean of that column:

df.fillna(df.mean(), inplace=True)

or take the last value seen for a column:

df.fillna(method='ffill', inplace=True)

Filling the NaN values is called imputation. Try a range of different imputation methods and see which ones work best for your data.

timleathart
  • 3,900
  • 20
  • 35
  • Thanks for the response. The dataset also consists of string values. I think `df.fillna()` will work on float or integer values. Any pointers on converting string values to numeric values? – Ahmed Dhanani Dec 26 '16 at 13:07
  • 1
    Ah, I had assumed the data was numeric for some reason. By string values, do you mean categorical data i.e. strings from a particular set of values? Then, you can use [scikit-learn's LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). Natural language, on the other hand, is more difficult to deal with. Bag-of-words is probably the easiest to think about, but have a look at [these](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) options. – timleathart Dec 26 '16 at 22:01
  • "`inplace` is evil": https://stackoverflow.com/questions/45570984/in-pandas-is-inplace-true-considered-harmful-or-not – fantabolous May 29 '23 at 05:47
1
  # Taking care of missing data
  from sklearn.preprocessing import Imputer
  imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
  imputer = imputer.fit(X[:, 1:3])
  X[:, 1:3] = imputer.transform(X[:, 1:3])

suppose the name of my array is $X$ and I want to take care of missing data in columns indexed $1$ and $2$ by replacing it with mean. Imputer is a great class to do this from sklearn library

Siong Thye Goh
  • 3,003
  • 2
  • 15
  • 23
smit patel
  • 11
  • 2
0

While Tim Earhart has already provided the answer, I would like to add here there are cases when rather than using choosing df.mean() to substitute your NA values, it is better to choose df.median() - which calculates your median value.

Mean is notorious for taking into consideration even the outliers.

Since you are a beginner, you might want to try both.

Ethan
  • 1,625
  • 8
  • 23
  • 39