1

I have a big dataset with a column "clientid" and a categorical column "choice". I want to find out what are the clients that have strange combinations of choices (less frequent ones) and being able in the future to identify new strange combinations of future clients immediately.

clientid choice
cl1 a
cl2 b
cl2 c
cl3 d
cl4 b
cl4 c

If I transpose the table by clientID I have a row for each client and different columns based on the choices, it will became a sparse dataset with categorical variables (choices). Some clients have only one choice and some have multiple ones and I want to find outlier records (clientid)

Which type of algorithm could help me in this type of problem? It is unsupervised, so I dont know what are the normal combinations and it is sparse data on categorical variables.

DataLover
  • 19
  • 3
  • Are the rows ordered? i.e. would you consider 'a','b','c' different from 'b','a','c'? – WBM Mar 04 '21 at 14:01
  • A,b,c or c,b,a is the same in my situation – DataLover Mar 04 '21 at 23:44
  • 1
    You probably need to do some kind of dimensionality reduction. Since you have users and choices, ALS seems reasonable, you will end up with client_feature matrix and choice_feature matrix, you will probably only care about client_feature, you can use these latent features to do anomaly detection. – Akavall Mar 05 '21 at 03:22

1 Answers1

1

No need for machine learning here.

After you've transposed the dataframe, just count the number of unique combinations in the new column, and then rank them by frequency. Set a suitable threshold of "rareness" (like freq=2 below) and you will have your list of strange combinations.

There's a tool in Pandas for this called df.values_count()

e.g.

combination freq
a,b 1
a,c 1
a,d 1
a,b,c 2
a,b,c,d 10
b,d 10

Then just compare you new combinations with your "bank of rare combinations", and update them if they are no longer rare.

WBM
  • 691
  • 5
  • 16