0

The Accuracy before ovesampling :

On Training : 98,54% On Testing : 98,21%

The Accuracy after ovesampling :

On Training : 77,92% On Testing : 90,44%

What does mean this and how to increase the accuracy ?

Edit:

Classes before SMOTE:

dataset['Label'].value_counts()

BENIGN           168051
Brute Force        1507
XSS                 652
Sql Injection        21

Classes after SMOTE:

BENIGN           117679 
Brute Force      117679 
XSS              117679 
Sql Injection    117679 

I used the following model:

-Random Forest : 
Train score : 0.49   Test score: 0.85
-Logistic Regression : 
Train score: 0.72    Test score: 0.93

-LSTM:
Train score: 0.79    Test score: 0.98

enter image description here

Mimi
  • 45
  • 7
  • It depends on the kind of imbalance you have. It seems oversampling is not needed, but again it depends on amount of imbalance present. For example if 99,9%-0.01% then highly imbalanced and not much can be done – Nikos M. Jun 20 '21 at 18:05
  • I used SMOTE, and I used this method because some class are very low compared to some other, for example the sum of class_3 is only 21, and the sum of class_1 is 168051. – Mimi Jun 20 '21 at 19:04
  • This is weird. The accuracy on test set is highe then on the training set. What is the imbalance ratio ? How many samples in train and test set ? – Akash Dubey Jun 21 '21 at 04:04
  • Dear @AkashDubey, The dataset split is 80% 20%. For the imbalance ratio, I did not specify this argument to SMOTE, but after Oversampling, the number of samples in each class is same. NB. SMOTE is applied only to Training set. – Mimi Jun 21 '21 at 12:10
  • @Mimi Possible argument, please look at this : https://stackoverflow.com/questions/51464591/test-accuracy-is-greater-than-train-accuracy-what-to-do/51468429 Also, this : https://stats.stackexchange.com/questions/59630/test-accuracy-higher-than-training-how-to-interpret – Akash Dubey Jun 21 '21 at 12:14
  • @AkashDubey Thank you, but the linked question did not use any Oversampling method, as you can see in my question, the accuracy is low after using Oversampling. – Mimi Jun 21 '21 at 12:34
  • @Mimi I don't think, this is because of the oversampling technique that you used. There is something inherently wrong with the model. What was the distribution of class labels before oversampling ? How many classes ? What modelling technique are you using ? Can you please add these details to the question. – Akash Dubey Jun 21 '21 at 12:38
  • @AkashDubey, I added some details. – Mimi Jun 21 '21 at 12:56
  • 1
    Why would you want to oversample instead of modeling probabilities? – Dave Jun 21 '21 at 13:55

2 Answers2

0

Accuracy is not a very good metric generally, but especially in the presence of serious class imbalance. In your case, always predicting BENIGN would achieve an accuracy of 98.72%, but is useless, while your models might be useful despite having lower accuracy.

That oversampling hurts your training accuracy is natural. The largest effect of oversampling is that the predicted probabilities are shifted, and if you are predicting the class as the one with largest predicted probability, this will result in many more predictions of the minority classes, which is wrong from an accuracy point of view. (One thing wasn't made clear in the post: are you measuring performance on the resampled data, or the original? The resampled data won't suffer from the effect above, but might well suffer from poor predictions of the tiny Sql injection class, which might not have enough signal to properly identify.)

As Dave says in a comment, best to start without oversampling, and create a metric that properly captures the cost/benefit tradeoff of true/false positives/negatives for each class. Then it might be beneficial to oversample, but if you are using the predicted probabilities, that is unlikely to give huge lift.

Ben Reiniger
  • 11,094
  • 3
  • 16
  • 53
0

This is weird that test accuracy is greater than the training accuracy. However, the plausible observation/explanation after looking at the distribution of classes are :

  1. The classes are highly imbalance for a multi-class classification setting.
  2. You are using SMOTE for oversampling. You can also try Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN) and check if the result improves.
  3. However, the most important thing is : You should try to optimise for Recall or F1 score given your classes are highly imbalanced. Since, Accuracy is not a preferred metric in a highly imbalanced classification problem. I would recommend optimising for Recall.

Possible recommendations :

  1. Hyperparameter tuning
  2. Better regularisation
  3. K-Fold cross validation.
  4. Make sure that train, validation and test sets are different.
Akash Dubey
  • 676
  • 2
  • 5
  • 16
  • Thank you so much, ADASYN was tried, it gives almost the same result. What did you mean by 'Better regularization' ? – Mimi Jun 21 '21 at 13:44
  • When you are using Logistic regression, what are you setting the penalty argument as ? Similarly, for random forest, are you setting the depth of the tree etc – Akash Dubey Jun 21 '21 at 13:51
  • -RandomForestClassifier(n_estimators=200, class_weight='balanced', criterion='entropy', random_state= 0, verbose= 1, max_depth=2) -LogisticRegression(class_weight='auto') – Mimi Jun 21 '21 at 13:57
  • Choose these hyper params using a k fold cross validation on a validation set and optimise for recall. – Akash Dubey Jun 21 '21 at 14:00
  • cv_scores = [6.73861319e-01, 2.82771953e-01, 5.32073548e-01, 7.60492017e-01, 1.38087803e-04] – Mimi Jun 21 '21 at 15:17