Histograms in Machine Learning

Question

I have a large data set with over 100k samples and I want to predict a continuous target feature from 4 other continuous features using Scikit Learn. For this project, I would like to visualize and analyze the data using both 1 dimensional and two dimensional histograms. I know how to plot histograms and I know what a histogram means/displays mathematically but how can I make good use of it in order to analyze my data?

One thing that comes to mind is that I could spot regions with outliers, but this doesn't seem so useful/efficient (correct me if I'm wrong).

So what are useful ways to use histograms for analyzing Machine Learning data?

Thanks

Outlier analysis for pre-screening ML input and data transformations, e.g. normalisiation, are two clear applications of histograms. — M__, Jun 05 '19 at 13:46

score 1 · Answer 1 · answered Jun 05 '19 at 09:12

I would suggest you, other than simple histograms, to visualize how variables are associated with each other using a pairplot from seaborn.pairplot(). This will let you check how correlated your explanatory variables are with each other. Multicollinearity can be a problem that you can solve using dimensionality reduction, for example.

Outliers might not be a problem, but you can't say before running any model. On that, I suggest you to run the same model more than once, with and without outliers. Also, always normalize your data, this might affect the "outlierness" of an observation.

Histograms in Machine Learning

1 Answers1