Questions tagged [dummy-variables]

44 questions
8
votes
2 answers

In which cases shouldn't we drop the first level of categorical variables?

Beginner in machine learning, I'm looking into the one-hot encoding concept. Unlike in statistics when you always want to drop the first level to have k-1 dummies (as discussed here on SE), it seems that some models needs to keep it and have k…
6
votes
3 answers

How to give a higher importance to certain features in a (k-means) clustering model?

I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple…
5
votes
3 answers

Obtaining consistent one-hot encoding of train / production data

I'm building an app that will require user input. Currently, on the training set, I run the following code, in which data is a pandas dataframe with a combination of categorical and numerical columns. dummified_data = data.get_dummies() train_data =…
Andrew Maurer
  • 308
  • 2
  • 9
3
votes
1 answer

Using pandas get_dummies() on real world unseen data

I made a ML model, trained and tested it with my data containing categorical variables. To create dummy variables I used pd.get_dummies() before the split. I now want to use my model on previously unseen data where, of course, I need to re create my…
3nomis
  • 531
  • 6
  • 17
3
votes
3 answers

How to obtain original feature names after using one-hot encoding

This question is on an implementation aspect of scikit-learn's DecisionTreeClassifier(). How do I get the feature names ranked in descending order, from the feature_importances_ returned by the scikit-learn DecisionTreeClassifier()? The problem is…
2
votes
1 answer

Dummy variable only for character value in a column (Neglecting float and integers)

My dataset consists of 3000 rows and 50 columns, out of which one column (ESTIMATE_FAMILY_CONTRIBUTION) contains all numerical value(around 2000 different values like 20,30,32....) but got one value as String e.g. 'No_information'. When I create…
2
votes
1 answer

How to interpret dummy variable in ML prediction?

I am working on a binary classification problem where I have a mix of continuous and categorical variables. Categorical variables were created by me using get_dummies function in pandas. Now my questions are, 1) I see that there is a parameter…
2
votes
3 answers

How to handle "year" variable for Machine Learning models

I have a "year" variable but I don't know which is the best way to handle it for a ML model, as it is a numerical variable, giving some sequence. Should I treat it as a categorical variable? Thanks in advance,
2
votes
2 answers

Prediction after one hot encoding

I have a regression model that I want to make prediction based on values that I will get from an end user. In my dataset, I have one categorical variable region which I one-hot encoded, which generated 53 new columns (54 regions). Now my data has…
IngridX
  • 33
  • 1
  • 4
2
votes
1 answer

How to deal with a potencially multiple categorical variable

I'm build a model that has, as inputs, some categorical variables. I had already dealt with this sort of data before, and applied different techniques as creation of dummy variables and factor scoring. However, I have now a different type of problem…
2
votes
1 answer

Dummy Variable trap in Linear Regression

The dummy variable trap is a common problem with linear regression when dealing with categorical variables, since one hot encoding introduces redundancy, so if we have m categories in our categorical variable we usually drop one dummy variable to…
1
vote
2 answers

what would be the correct representation of categorical variables like sex?

I have a doubt about what will be the right way to use or represent categorical variables with only two values like "sex". I have checked it up from different sources, but I was not able to find any solid reference. For example, if I have the…
Lila
  • 217
  • 2
  • 7
1
vote
1 answer

Use dummy variables to create a rank variable. R

I have a series of multiple response (dummy) variables describing causes for a canceled visits. A visit can have multiple reasons for the cancelation. My goal is to create a single mutually exclusive variable using the dummy variables in a…
Mar355
  • 37
  • 5
1
vote
1 answer

What exactly is a dummy trap? Is dropping one dummy feature really a good practice?

So I'm going through a Machine Learning course, and this course explains that to avoid the dummy trap, a common practice is to drop one column. It also explains that since the info on the dropped column can be inferred from the other columns, we…
1
vote
1 answer

Should I include all dummy variables or N-1 dummy variables (keep one as reference) in neural networks

I have a categorical variable with N factor levels (e.g. gender has two levels) in classification problem. I have converted it into dummy variables (male and female). I have to use neural network (nnet) to classify. I have two options - Include any…
SiH
  • 125
  • 5
1
2 3