Anomaly detection on sparse categorical data

Question

I have a big dataset with a column "clientid" and a categorical column "choice". I want to find out what are the clients that have strange combinations of choices (less frequent ones) and being able in the future to identify new strange combinations of future clients immediately.

clientid	choice
cl1	a
cl2	b
cl2	c
cl3	d
cl4	b
cl4	c

If I transpose the table by clientID I have a row for each client and different columns based on the choices, it will became a sparse dataset with categorical variables (choices). Some clients have only one choice and some have multiple ones and I want to find outlier records (clientid)

Which type of algorithm could help me in this type of problem? It is unsupervised, so I dont know what are the normal combinations and it is sparse data on categorical variables.

Are the rows ordered? i.e. would you consider 'a','b','c' different from 'b','a','c'? — WBM, Mar 04 '21 at 14:01
You probably need to do some kind of dimensionality reduction. Since you have users and choices, ALS seems reasonable, you will end up with client_feature matrix and choice_feature matrix, you will probably only care about client_feature, you can use these latent features to do anomaly detection. — Akavall, Mar 05 '21 at 03:22

WBM · Answer 1 · 2021-03-16T16:38:19.637

1

No need for machine learning here.

After you've transposed the dataframe, just count the number of unique combinations in the new column, and then rank them by frequency. Set a suitable threshold of "rareness" (like freq=2 below) and you will have your list of strange combinations.

There's a tool in Pandas for this called df.values_count()

e.g.

combination	freq
a,b	1
a,c	1
a,d	1
a,b,c	2
a,b,c,d	10
b,d	10

Then just compare you new combinations with your "bank of rare combinations", and update them if they are no longer rare.

edited Mar 16 '21 at 16:38

answered Mar 04 '21 at 14:00

WBM

691
5
16

Thanks but I want to find out for future clients if their combination is strange or not. – DataLover Mar 04 '21 at 23:46
Low frequency would mean strange i.e. rare or anomalous – WBM Mar 05 '21 at 08:52
And then just compare new clients to this rare event list – WBM Mar 05 '21 at 09:00
Please accept the answer if it resolved your issue or let me know if something is unclear – WBM Apr 29 '21 at 11:27

Anomaly detection on sparse categorical data

1 Answers1