Hyperparameter tuning in multiclass classification problem: which scoring metric?

Question

I'm working with an imbalanced multi-class dataset. I try to tune the parameters of a DecisionTreeClassifier, RandomForestClassifier and a GradientBoostingClassifier using a randomized search and a bayesian search.

For now, I used just accuracy for the scoring which is not really applicable for assessing my models performance (which I'm not doing). Is it also not suitable for parameter tuning?

I found that for example recall_micro and recall_weighted yield the same results as accuracy. This should be the same for other metrics like f1_micro.

So my question is: Is the scoring relevant for tuning? I see that recall_macro leads to lower results since it doesn't take the number of samples per class into account. So which metric should I use?

Yes, but for deciding purposes. The score helps you when to stop training. — Green Falcon, Apr 26 '18 at 13:25
So if I just use a maximum number of iterations to decide when to stop tuning, its irelevant if I use accuracy or recall? — Christian, Apr 26 '18 at 13:38
No, based on accuracy or recall you have to decide whether you stop your training or not, whether increase the number of iterations or not. — Green Falcon, Apr 26 '18 at 15:16
I think I dont fully understand your point. What does the scoring used in parameter tuning has to do with stopping the training? If I use a randomized parameter search for example the scoring metric is only used to rank the models and they have the same rank using accuracy or recall_weighted and recall_micro. — Christian, Apr 26 '18 at 15:20
Suppose that you have unbalanced data-set. `99%` of your training data has label `0` and `1%` of your data has label `1`. In this case if your model always outputs `0`, you will have a model with `99%` accuracy and you won't train in anymore. If you use `F1` score, your evaluation method tells you that you are in a wrong path and you continue training. :) — Green Falcon, Apr 26 '18 at 15:24
Ok, I‘m aware of this problem and thats why I posted my question :). I thought recall might be a better choice and then I was surprised because it lead to the exact same scores as accuracy. So you are suggesting I should use the f1 score for parameter tuning? Is it that important for parameter tuning anyway? I‘m not sure if a certain parameter setting favors f1 over accuracy for example. I evaluate the final model using lots of different metrics and the confusion matrix. Just not sure about parameter tuning. — Christian, Apr 26 '18 at 15:29
Choosing a good evaluation method depends on your task, for unbalanced data sets you have to use `F1` score. As a friend, I suggest you always track the confusion matrix which shows you everything for classification tasks. You can also take a look at [here](https://datascience.stackexchange.com/a/26855/28175). — Green Falcon, Apr 26 '18 at 15:32
Alright, when you say evaluation you include the scoring metric during parameter tuning? — Christian, Apr 26 '18 at 15:34
Score does not participate in training and even tuning, it's just for announcing the ML practitioner the status of training. — Green Falcon, Apr 26 '18 at 15:38
Well in tuning the models are ranked based on my scoring metric for a given test set. So if that ranking is different for different metrics it does participate? In the end I need to know which scoring metric I should use for my randomized and bayesian parameter search. — Christian, Apr 26 '18 at 15:42
For finding the appropriate method you have to search to find similar situations. But about the first sentence, I don't know what to say, if you do so, it may participate. Another fact is that don't use the term *metric*. Instead, use evaluation method or criterion. — Green Falcon, Apr 26 '18 at 17:11

Brian Spiering · Answer 1 · 2022-03-19T21:23:48.753

The evaluation metric depends on the goals of the project. Which outcomes are better or which outcomes worse? Some projects value precision over recall and other projects value recall over precision.

After you have clarity on project goals, pick a single metric to provide a consistent scorecard when comparing different algorithms and hyperparameters combinations. One common evaluation metric for multi-class classification is F-score. F-score has a β hyperparameter which weights recall and precision differently. You will have to choose between micro-averaging (biased by class frequency) or macro-averaging (taking all classes as equally important). For macro-averaging, two different formulas can be used:

The F-score of (arithmetic) class-wise precision and recall means.
The arithmetic mean of class-wise F-scores (often more desirable).

score 1 · Answer 2 · answered Sep 17 '20 at 20:46

1

You should use the same metric to evaluate and to tune the classifiers. If you wull evaluate the final classifier using accuracy, then you must use accuracy to tune the hyper parameters. If you think you should use macro-averaged F1 as the final evaluation of the classifier, use it also to tune them.

On a side, for multiclass problems I have not yet heard any convincing argument not to use accuracy, but that is just me.

answered Sep 17 '20 at 20:46

Jacques Wainer

251
1
1

1

There are many arguments against accuracy! https://stats.stackexchange.com/questions/368949/example-when-using-accuracy-as-an-outcome-measure-will-lead-to-a-wrong-conclusio https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email – Dave Jul 18 '21 at 03:35
Yes, if you can get calibrated probability estimates for belonging to each class that is better than accuracy. That is the blunt of the arguments @Dave posted. But if you do not have these calibrated probability estimates, and plan to use some of the other classification metrics, F1, AUC, G-means, etc, I do not know good arguments to use them in MULTICLASS problems, including because there is no STANDADRD extensions of these metrics to multiclass (see the macro- vs micro vs weighted distinction on F1 and other metrics – Jacques Wainer Jul 18 '21 at 15:41

score 0 · Answer 3 · answered Sep 18 '20 at 04:22

0

If your dataset is imbalance then you can calculate the kappa score.

answered Sep 18 '20 at 04:22

Rina

165
1
13

score 0 · Answer 4 · answered Oct 19 '20 at 16:02

A simple solution is to set importance weight in front of each class inversely proportional to the train set relative frequency of the class like $\frac{1}{freq}$ or $e^{-freq}$. The choice of the right formula depends on how much importance you would give to less frequent classes
e.g. $e^{-freq}$ give more importance to less frequent classes than $\frac{1}{freq}$

Hyperparameter tuning in multiclass classification problem: which scoring metric?

4 Answers4