2

I have a dataset where I have to detect anomalies. Now, I use a subset of the data(let's call that subset A) and apply the DBSCAN algorithm to detect anomalies on set A.Once the anomalies are detected, using the dbscan labels I create a label variable (anomaly:1, non-anomaly:0) in the dataset A. Now, I train a supervised algorithm on dataset A to predict the anomalies using the label as the dependent/target variable and finally use the trained supervised model to predict the anomalies on the rest of the data (A compliment).

While, this seems to be a fair approach to me, I am just wondering if there is any data leakage happening at any stage. Please note that I am using the same set of variables/features at both stage(unsupervised and supervised). Reason for posting is when I train the supervised model, I get very high roc-auc score, which is around 0.99XX and that is suspicious.

Note that, I can not use the DBSCAN algorithm for the entire data set because of computational constraints. I can not use supervised model as I do not have labels.

2 Answers2

0

Please take care of stratification while sampling data for training. You should mention stratify = y in train_test_split .

High ROC-AUC may be pointing to imbalance dataset - when-is-an-auc-score-misleadingly-high

Also check this out - k-fold-cross-validation-auc-score-vs-test-auc-score

Sandeep Bhutani
  • 884
  • 1
  • 7
  • 22
  • +1, I use 5 fold cross-validation system with stratification for validation and compute my ROC-AUC score on the out of fold data. However, do you think that high AUC score in the out of fold sample is not because of the data leakage? – Indranil Bhattacharya Sep 17 '19 at 16:01
0

Not knowing exactly your dependant variable, and your explanatory variables, and the volume of data you have at hands, it's hard to give a good diagnosis.

However 99+%-sh scores often hide problems within models. To rule out the data leakage problem : before doing anything, start by keeping a hold-out set that doesn't participate neither in the unsupervised learning nor in the supervised one, and evaluate on it.

Blenz
  • 2,044
  • 10
  • 28
  • +1, @Blenz, thanks for your comment. To give you a perspective, I have around 30 million records with 15 variables. Using 1% of this for DBSCAN and 99% would be used for inference from trained classifier. – Indranil Bhattacharya Oct 18 '19 at 17:11