Is there a good systematic approach to explore and analyze data (prior to modelling)?

Question

I have found a few examples of kernels on Kaggle people have made where they seem to follow a certain methodology in order to systematically analyze and explore the data, to make sure to find all outliers, missing values etc. They are just practical examples and leave quite a few questions.

I would just assume there must be one, or a few, recipes/methodologies/flowcharts that I can learn/use in order to have a good checklist and systematically work my way through my data and catch all data quality issues as well as note them in a well defined way. I.e. 1. Do this, 2. Do that, if result A do this, if result B note that for later this way 3. Do this etc.

What should I look into in order to find something like this? I can't imagine this not existing. It seems very much like Data Science Course 101 (which I am currently taking but it seems like a lousy course) but I can't find anything.

Any references to YouTube-videos or Coursera-courses that explain this would be welcome as well.

score 1 · Answer 1 · answered Dec 03 '20 at 21:26

Yes, there is this.

Always check the distribution of the target . If classification,% of each class. This is important for many type of metrics and losses. If regression, check for eventual high concentration in some fixed value. For example, it happens often that because of some mistakes in processing the target ends up being imputed and there are a lot of zeros.
Always check for drift. Typically these are id's. They have no meaning, but often they drive models to absurdly high performance because of some hidden bias.
Understand the distribution of features. often what you expect in production will not match what is in test. Think of some distribution of populations, that were sampled in particular way in training but will be more evenly matched in production. This happens often in churn/targetting models
If you are working with structured data, take time to understand what your features mean. If there is a data catalog available, read it. Example: Users tend to register on the website before purchase. Then if you are fitting a targeting model, probably due to use registration as a feature (BTW: This is a real life example that I have seen happen not only once, nor twice, but over three times at three different companies.)
If you are working with text. Take some time to read a couple of examples. Just don't throw it in the model. NLP is one of the few use-cases where you have the most intuition on what the model needs to understand.

Thanks, Benoit. It was a good checklist that I will likely have usage for. I was however thinking more that if there would be "The Official Something Something EDA checklist" that everyone used and was taught in Data Science Data Explantory 101-courses etc. — Christoffer, Dec 06 '20 at 20:37

score 1 · Accepted Answer · answered Dec 03 '20 at 22:03

I assume you are asking about tabular data not vision or NLP. It doesn't exist because 1) there is so much type of data and weird problems and 2) univariate EDA is generally not enough. I can detail you what I usually do for univariate analysis if this helps.

In the context of supervised ML. I have a generic R markdown file that is used to genereate a Word report on a given variable. I like the word format because I can anotate, comment and share them easily in a professional context. I use another R script to generate those reports for all of my variables. At some point I was trying to do the same in Python but didn't manage to get something similar with easy formatting.

In this report I have by default (tweak in parenthesis) for continuous variables :

name of the variable, format
generate the univariate distribution plot and log plot (removing 0.001 and 0.999 quantile outliers, missing values, can get tricky if variable get negatives - often I use a |x| * log(1+c*|x|) transformation - not otpimal but ok).
same graphs but with distribution for positive / negative classes (scaling might be tricky for unbalanced problems)
some sort partial dependency plot (group your variables in buckets and see the associated positive ratio)
a table with main values (mean, median, min, max, extreme quantiles, count of missing values, count of weird encoding for NA, it also detect if some value is taken by more than 5% of instances and give that count) and associated positive ratio as calculated above
distribution (generally in log scale) regarding some of my main identification variables (think sex or race), time (to look if there is some drift) and geographical distribution of some average value.

This is relatively difficult to do as each variable is different and you will always have a weird variable causing some bug and formatting is a pain. As you can see it is quite dependent on your variables and the problem. Nothing easy (It took me around 150 tries to have it to work on all of my variable the first time).

Then you have to look at each report individually, this can take mutlitple days if you want to do it properly as there is no rule on what you are looking for exactly. Sometimes its a weird bump in distribution, sometimes it is missing values encoded incorrectly, sometimes it is a discrepency in categories, sometimes it is a very skewed variable. As the problems often depends on the data generating process, their solutions depend on that too and there is no general rule to deal with them.

At some point I tried to have something similar for categorical but didn't get anything satisfactory. Provided the category number is quite low I just do the count of positive class by category and some of the main table mentionned above.

Then thing get weird as you go to multivariate EDA. What I usually do :

Some dimension reduction (UMAP) for a 2D plot to see if there are clusters or not, if the output is clustered too.
Some linear correlation analysis, correlation matrix + some clustering / dendograms, but rarely remove anything as I have imbalanced data set where info may come from the difference between two highly correlated variables.
I abandon any idea of iterative variable selection process and go with one simple model with a dozen variable (selected by experts) to create a benchmark (think glmnet), then use a model with strong regularisation on all my variables (xgboost or vanilla NN).
Then I remove the 50%-80% of variables that have no importance in the big model.
After that there is no rule (except answering your manager positively)

Thanks for the lengthy response. I was somewhat surprised there didn't exist any "official" checklist like that but I see what you mean that datasets can differ so much that it wouldn't be viable. What I meant was however not an automated walkthrough/checklist but rather something at at least a higher level. I mean, when I started just understanding that looking into Univariate analysis, separate data into continuous/discrete and how to best visualize them, what to remember from them etc would have been very helpful. So I guess each DS does his/her own then? — Christoffer, Dec 06 '20 at 20:42

Is there a good systematic approach to explore and analyze data (prior to modelling)?

2 Answers2