This tag is meant to be used for questions related to how to evaluate a model performance, not only based on standard metrics, but also in the context of real use case applications. What is a good model might depend on many factors to take into account, to eventually get really useful data science applications.
Questions tagged [model-evaluations]
354 questions
284
votes
8 answers
Micro Average vs Macro average Performance in a Multiclass classification setting
I am trying out a multiclass classification setting with 3 classes. The class distribution is skewed with most of the data falling in 1 of the 3 classes. (class labels being 1,2,3, with 67.28% of the data falling in class label 1, 11.99% data in…
SHASHANK GUPTA
- 3,745
- 4
- 18
- 26
52
votes
3 answers
What is the difference between bootstrapping and cross-validation?
I used to apply K-fold cross-validation for robust evaluation of my machine learning models. But I'm aware of the existence of the bootstrapping method for this purpose as well. However, I cannot see the main difference between them in terms of…
Fredrik
- 967
- 2
- 9
- 11
44
votes
10 answers
When is precision more important over recall?
Can anyone give me some examples where precision is important and some examples where recall is important?
Rajat
- 1,017
- 2
- 9
- 10
20
votes
4 answers
Train/Test Split after performing SMOTE
I am dealing with a highly unbalanced dataset so I used SMOTE to resample it.
After SMOTE resampling, I split the resampled dataset into training/test sets using the training set to build a model and the test set to evaluate it.
However, I am…
Edamame
- 2,705
- 5
- 23
- 32
19
votes
4 answers
Macro- or micro-average for imbalanced class problems
The question of whether to use macro- or micro-averages when the data is imbalanced comes up all the time.
Some googling shows that many bloggers tend to say that micro-average is the preferred way to go, e.g.:
Micro-average is preferable if there…
Krrr
- 293
- 1
- 2
- 6
16
votes
1 answer
How many features to sample using Random Forests
The Wikipedia page which quotes "The Elements of Statistical Learning" says:
Typically, for a classification problem with $p$ features, $\lfloor \sqrt{p}\rfloor$ features are used in each split.
I understand that this is a fairly good educated…
Valentin Calomme
- 5,396
- 3
- 20
- 49
16
votes
1 answer
How to define a custom performance metric in Keras?
I tried to define a custom metric fuction (F1-Score) in Keras (Tensorflow backend) according to the following:
def f1_score(tags, predicted):
tags = set(tags)
predicted = set(predicted)
tp = len(tags & predicted)
fp =…
Hendrik
- 8,377
- 17
- 40
- 55
13
votes
1 answer
Irregular Precision-Recall Curve
I'd expect that for a precision-recall curve, precision decreases while recall increases monotonically. I have a plot that is not smooth and looks funny. I used scikit learn the values for plotting the curve. Is the curve below abnormal? If yes, why…
Anderlecht
- 251
- 2
- 7
12
votes
3 answers
Why is the F-measure preferred for classification tasks?
Why is the F-measure usually used for (supervised) classification tasks, whereas the G-measure (or Fowlkes–Mallows index) is generally used for (unsupervised) clustering tasks?
The F-measure is the harmonic mean of the precision and recall.
The…
Bruno Lubascher
- 3,488
- 1
- 11
- 35
12
votes
2 answers
Neural Networks - Loss and Accuracy correlation
I'm a bit confused by the coexistence of Loss and Accuracy metrics in Neural Networks. Both are supposed to render the "exactness" of the comparison of $y$ and $\hat{y}$, aren't they? So isn't the application of the two redundant in the training…
Hendrik
- 8,377
- 17
- 40
- 55
12
votes
3 answers
What are the disadvantages of accuracy?
I have been reading about evaluating a model with accuracy only and I have found some disadvantages. Among them, I read that it equates all errors. How could this problem be solved? Maybe assigning costs to each type of failure? Thank you very much…
PicaR
- 304
- 2
- 13
9
votes
2 answers
Difference between using RMSE and nDCG to evaluate Recommender Systems
What kind of error measures do RMSE and nDCG give while evaluating a recommender system, and how do I know when to use one over the other? If you could give an example of when to use each, that would be great as well!
covfefe
- 293
- 4
- 7
9
votes
3 answers
How do you evaluate ML model already deployed in production?
so to be more clear lets consider the problem of loan default prediction. Let's say I have trained and tested off-line multiple classifiers and ensembled them. Then I gave this model to production.
But because people change, data and many other…
tomtom
- 247
- 3
- 5
8
votes
1 answer
When do I have to use aucPR instead of auROC? (and vice versa)
I'm wondering if sometimes, to validate a model, it's not better to use aucPR instead of aucROC? Do these cases only depend on the "domain & business understanding" ?
Especially, I'm thinking about the "unbalanced class problem" where, it seems…
jmvllt
- 619
- 1
- 8
- 15
8
votes
2 answers
Do I need validation data if my train and test accuracy/loss is consistent?
I am trying to understand the purpose of a 3rd split in the form of a validation dataset. I am not necessarily talking about cross-validation here.
In the scenario below, it would appear that the model is overfit to the training dataset.
Train…
Kermit
- 519
- 5
- 16