3

I have a large data set with over 100k samples and I want to predict a continuous target feature from 4 other continuous features using Scikit Learn. For this project, I would like to visualize and analyze the data using both 1 dimensional and two dimensional histograms. I know how to plot histograms and I know what a histogram means/displays mathematically but how can I make good use of it in order to analyze my data?

One thing that comes to mind is that I could spot regions with outliers, but this doesn't seem so useful/efficient (correct me if I'm wrong).

So what are useful ways to use histograms for analyzing Machine Learning data?

Thanks

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
user120112
  • 73
  • 3
  • Outlier analysis for pre-screening ML input and data transformations, e.g. normalisiation, are two clear applications of histograms. – M__ Jun 05 '19 at 13:46

1 Answers1

1

I would suggest you, other than simple histograms, to visualize how variables are associated with each other using a pairplot from seaborn.pairplot(). This will let you check how correlated your explanatory variables are with each other. Multicollinearity can be a problem that you can solve using dimensionality reduction, for example.

Outliers might not be a problem, but you can't say before running any model. On that, I suggest you to run the same model more than once, with and without outliers. Also, always normalize your data, this might affect the "outlierness" of an observation.

Leevo
  • 6,005
  • 3
  • 14
  • 51