Remove Outliers - Market Basket Analysis

Question

I'm having some thoughts on whether I should remove the outliers. I'm trying to find the tags that are commonly used together. Imagine that I have the following dataset. The first column is the Tag_ID and the second column is the Number of People that used that Tag.

1   3472034  
2   1277918  
3   1249839  
4   1010770  
5   915099  
6   898292  
7   636792  
8   604352  
9   555673  
10  298495  
11  291511  
12  211074  
13  200868  
...

(This was copied from my actual dataset).

My question is: Should I remove a Tag instance when it is much more frequent than the other? Is that regarded as a good practice?

Many thanks!

I don't think this can be answered without knowing your experiment. Sometimes it's good to remove outliers, sometimes not. — Hobbes, Nov 15 '16 at 15:59

score 2 · Accepted Answer · answered Nov 15 '16 at 01:13

Since I cannot comment to ask for clarification, I am asking it here. What is your reason to think about removing the most frequent value in your dataset? If the second column actually represent frequency of usage, you probably should not remove it and I think it would be illogical to throw away that piece of information. Having said that, you may consider removing that tag if it is a "less meaningful" word (e.g. a, an etc).

Can you give a little bit more context on what you are trying to achieve?

In general, one way to find outlier is to look at points that lie beyond 1.5 times of inter-quartile range of the distribution i.e. for the frequency count in your data.

Just a quick thought, did you try clustering for finding similar tags? What are the ways you are considering to find similar tags?

Since the Tag_ID have a number very far from the others I was thinking that ID could be like an incosistent value — Pedro Alves, Nov 15 '16 at 09:09
It can be something else as well. It can just be the most natural things for your users. Depends on the nature of the data and analysis. — Sal, Nov 15 '16 at 23:33

Remove Outliers - Market Basket Analysis

1 Answers1