Questions tagged [classification]

An instance of supervised learning that identifies the category or categories which a new instance of dataset belongs.

In machine learning and statistics, classification refers to the problem of predicting category memberships based on a set of pre-labeled examples. It is thus a type of supervised learning.

Some of the most important classification algorithms are support vector machines , logistic regression, naive Bayes, random forest and artificial neural networks .

When we wish to associate inputs with continuous values in a supervised framework, the problem is instead known as . The unsupervised counterpart to classification is known as (or cluster analysis), and involves grouping data into categories based on some measure of inherent similarity.

3226 questions
256
votes
10 answers

How to set class weights for imbalanced classes in Keras?

I know that there is a possibility in Keras with the class_weights parameter dictionary at fitting, but I couldn't find any example. Would somebody so kind to provide one? By the way, in this case the appropriate praxis is simply to weight up the…
Hendrik
  • 8,377
  • 17
  • 40
  • 55
80
votes
6 answers

Cosine similarity versus dot product as distance metrics

It looks like the cosine similarity of two features is just their dot product scaled by the product of their magnitudes. When does cosine similarity make a better distance metric than the dot product? I.e. do the dot product and cosine similarity…
ahoffer
  • 903
  • 1
  • 7
  • 7
62
votes
5 answers

How to get accuracy, F1, precision and recall, for a keras model?

I want to compute the precision, recall and F1-score for my binary KerasClassifier model, but don't find any solution. Here's my actual code: # Split dataset in train and test data X_train, X_test, Y_train, Y_test = train_test_split(normalized_X,…
ZelelB
  • 1,027
  • 2
  • 10
  • 14
51
votes
7 answers

Deep Learning vs gradient boosting: When to use what?

I have a big data problem with a large dataset (take for example 50 million rows and 200 columns). The dataset consists of about 100 numerical columns and 100 categorical columns and a response column that represents a binary class problem. The…
Nitesh
  • 1,615
  • 1
  • 12
  • 22
45
votes
4 answers

Early stopping on validation loss or on accuracy?

I am currently training a neural network and I cannot decide which to use to implement my Early Stopping criteria: validation loss or a metrics like accuracy/f1score/auc/whatever calculated on the validation set. In my research, I came upon articles…
qmeeus
  • 1,239
  • 1
  • 10
  • 13
42
votes
6 answers

When would one use Manhattan distance as opposed to Euclidean distance?

I am trying to look for a good argument on why one would use the Manhattan distance over the Euclidean distance in machine learning. The closest thing I found to a good argument so far is on this MIT lecture. At 36:15 you can see on the slides the…
40
votes
6 answers

Unbalanced multiclass data with XGBoost

I have 3 classes with this distribution: Class 0: 0.1169 Class 1: 0.7668 Class 2: 0.1163 And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight. But how is it handled for 'multiclass' case, and how can…
39
votes
5 answers

When to use Random Forest over SVM and vice versa?

When would one use Random Forest over SVM and vice versa? I understand that cross-validation and model comparison is an important aspect of choosing a model, but here I would like to learn more about rules of thumb and heuristics of the two…
Rohit
  • 545
  • 1
  • 4
  • 7
37
votes
5 answers

Are decision tree algorithms linear or nonlinear

Recently a friend of mine was asked whether decision tree algorithms are linear or nonlinear algorithms in an interview. I tried to look for answers to this question but couldn't find any satisfactory explanation. Can anyone answer and explain the…
35
votes
4 answers

Quick guide into training highly imbalanced data sets

I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test samples as a majority class. Some good answers…
IgorS
  • 5,444
  • 11
  • 31
  • 43
35
votes
1 answer

What is the best Keras model for multi-class classification?

I am working on research, where need to classify one of three event WINNER=(win, draw, lose) WINNER LEAGUE HOME AWAY MATCH_HOME MATCH_DRAW MATCH_AWAY MATCH_U2_50 MATCH_O2_50 3 13 550 571 1.86 3.34 …
SpanishBoy
  • 557
  • 1
  • 5
  • 11
31
votes
4 answers

What algorithms should I use to perform job classification based on resume data?

Note that I am doing everything in R. The problem goes as follow: Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify…
user1769197
  • 431
  • 1
  • 5
  • 5
30
votes
3 answers

What is difference between text classification and topic models?

I know the difference between clustering and classification in machine learning, but I don't understand the difference between text classification and topic modeling for documents. Can I use topic modeling over documents to identify a topic? Can I…
Ali
  • 361
  • 1
  • 4
  • 6
30
votes
2 answers

How to interpret classification report of scikit-learn?

As you can see, it is about a binary classification with linearSVC. The class 1 has a higher precision than class 0 (+7%), but class 0 has a higher recall than class 1 (+11%). How would you interpret this? And two other questions: what does…
user77241
30
votes
6 answers

What is the reason behind taking log transformation of few continuous variables?

I have been doing a classification problem and I have read many people's code and tutorials. One thing I've noticed is that many people take np.log or log of continuous variable like loan_amount or applicant_income etc. I just want to understand…
Sai Kumar
  • 601
  • 1
  • 8
  • 14
1
2 3
99 100