EDA for analysis of nominal variable with high cardinality

Question

I have a nominal variable (car model) with very high cardinality (~8500 labels) and I would like to analyse its relation with a binary target variable. While I can create logical groups and compare the distribution of target variable for each of the groups, can anyone suggest if there are any superior techniques/visualization tools for this type of analysis?

score 1 · Answer 1 · answered Mar 01 '19 at 13:09

1

You can calculate mean target for each categorical variable and compare its values. In pandas this can be done easily: df.groupby('categorical_feature').target.mean()

Then you can make a histogram to compare the approach. I also, seaborn has a catplot, where it do the same as above in a bar plot format, showing mean value for target variable based on each categorical one.

answered Mar 01 '19 at 13:09

Victor Oliveira

800
3
10

My target variable is dichotomous. So taking the mean is not an option. May be I can take count, but the real problem is that I have around 8000 levels in one categorical attribute. How can I study that? – Rohit Gavval Mar 07 '19 at 09:43
@RohitGavval, if you have a binary variable, you can calculate mean. It will be something like 0.333, 0.67, that is the point. Look at my answer to this question where I put the links with more explanation for the mentioned methods: https://datascience.stackexchange.com/questions/46780/what-are-the-approaches-to-aggregate-categorical-variables/46787?noredirect=1#comment53607_46787 – Victor Oliveira Mar 07 '19 at 11:23

EDA for analysis of nominal variable with high cardinality

1 Answers1