How to select features for a ML model

Question

I have a dataset with 5K records for binary classification problem.

My features are min_blood_pressure, max_blood_pressure, min_heart_rate, max_heart_rate etc. Similarly, I have more than 15 measurements and each of them have min and max columns amounting to 30 variables.

When I ran correlation on the data, I was able to see that these input features are highly correlated. I mean min_blood_pressure is highly correlated (>80%) to max_blood_pressure. Each measurement with its min and max feature is highly correlated. Though their individual correlation to target variable is less.

So in this case, which one should I drop or how should I handle this scenario?

I guess there is min and max variables for a reason. How would you do in a situation like this?

Should we find the average of all the measurements and create a new feature?

Can anyone help me with this?

Ya converting those highly correlated features into new variables will definitely improve the model performance. But finding average doesn't sound good here. Already seen this https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features .? — IamTheRealFord, Dec 13 '19 at 10:23
Thanks for the link. But is there any other way to transform this other than Average? — The Great, Dec 13 '19 at 10:28
Could you please tell us more about the data? Does it have any categorical features? Have you tried any base models yet? — Piotr Rarus, Dec 13 '19 at 10:31
Yes, it has categorical features but it is one hot encoded. Refer the comment below to know what I triedc — The Great, Dec 13 '19 at 10:33

Piotr Rarus · Accepted Answer · 2019-12-13T10:50:21.263

I'd start here. Most basic idea is to run statistical tests to see how target variable depends on each feature. These include tests like chi-square or ANOVA. Tree-based models can also output feature importance. Check this post. There's plenty of posts on kaggle with code. Might be worth checking those:

As your data set isn't so drastically large, you could push grid search and check how your model behaves for different factors of PCA.

It's hard to tell a priori whether you should drop some features. I guess trying each combination of 30 features is completely out of scope, though you might try dropping most redundant ones.

As your data contains categorical features, it might be good idea to give catboost a try. They claim it handles categorical features better than other gradient boosters. Just keep in mind, that default number of estimators is 10 times of that in xgboost. You might lower it for experiments.

First, I'd create base model with all the features. Now comes the question: which method to choose? Gradient boosters poses ability of learning the feature importance, those redundant ones will get little weight and you might not see much of an improvement, when dropping features. You might get more insight using more vanilla methods, but in the end you'll be certainly deploying gradient boosting to production, so I don't see much sense in it. I'd stick with xgboost or catboost and perform experiments using same parameters.

Please keep in mind: though some features might be highly redundant, they may still contribute some knowledge to your model.

Yes, I already ran few feature selection algorithms like `SelectKbest`, `SelectFrom Model`, `RFE`, `Feature Importance` etc which outputs both `min` and `max`. For example - `Min_bp` and `Max_bp`. When I did a sanity check by running correlation, I was able to see that they all are correlated — The Great, Dec 13 '19 at 10:32
I guess whatever you suggested is done using the above algorithms that I mentioned. — The Great, Dec 13 '19 at 10:35
Sorry. Can I make sure I understand correctly? Are you suggesting that I don't drop any correlated features and just apply catboost algorithm for prediction? Not much is required on the feature selection. Is that what you are suggesting? — The Great, Dec 13 '19 at 10:41
Both the answers for this question are equally good and similar in terms of suggestions. But I can choose only one, So I go with Piotr Rarus's answer . Nonetheless, thank you to both who kindly answered all my questions — The Great, Dec 15 '19 at 05:55

score 2 · Answer 2 · answered Dec 13 '19 at 10:51

2

You said:

Yes, I already ran few feature selection algorithms like SelectKbest, SelectFrom Model, RFE, Feature Importance etc which outputs both min and max. For example - Min_bp and Max_bp. When I did a sanity check by running correlation, I was able to see that they all are correlated.

In general you have 2 options.

You can remove features that are not predictive for the target variable. This will include statistical tests such as ANOVA see here.

Then based on the F-values you can only keep the features that have the higher F-values meaning that they have high predictive ability for the target variable.

If you want to remove correlated features, for example when using a regression (you ideally need uncorrelated variables), then dimensionality reduction such as PCA can be used. In this case, the new features will not be correlated but you will not be able to project back to the original features. PCA will lead to a linear combination of the original features.

answered Dec 13 '19 at 10:51

seralouk

121
3

Thanks for the response. Upvoted. Your point 1 regarding Anova. Doesn't 'SelectKbest' does that? I have identified 10 best features through SelectKbest algorithm. my question is out of that 10 it contains 2 features which are highly correlated. – The Great Dec 13 '19 at 10:59
If I didn't have that, I could accommodate one more variable. Is it right to think this way? – The Great Dec 13 '19 at 11:01
If inside `SelectKbest` you used `f_classif` then yes, it's an ANOVA. This does not mean that the returned features are going to be uncorrelated. It just returns the best features based on the ANOVA. Read carefully my pint 1 and 2. – seralouk Dec 13 '19 at 11:13
Yes. I used f_classif.. So you're saying that pick those best features returned by Kbest and use those features subset during model building. Have I understood right? – The Great Dec 13 '19 at 11:19
Yes. This would be option 1. Option 2 would be NOT to do ANOVA but instead do a PCA to reduce the features. You can try both. Do you care to track the original features OR you just need a good model? If the second, try also PCA – seralouk Dec 13 '19 at 11:20
Good question. I am interested in retaining the original features. When I do PCA (which I did), it becomes combination of multiple features and am not sure whether it's intuitive enough for end users – The Great Dec 13 '19 at 11:36
I have another question which I would like to check with you. Let's say you have a diabetes dataset which has output label like whether patient had diabetes or not. Input features include min temp, Max temp, Min heart rate, Max heart rate, min blood pressure, Max blood pressure etc. Now after model building,i find out that "Min temp", "Min blood pressure" are significant predictors. But how do you explain this? As we are trying to predict patients will develop diabetes or not, how do you phrase this? Like how would you communicate – The Great Dec 13 '19 at 11:40
I see your point. The significance comes from what? The weights of the model for these variables? Can you be more specific? What model do you use? Is it regression or classification? – seralouk Dec 13 '19 at 12:37
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/102166/discussion-between-ssmk-and-makis). – The Great Dec 13 '19 at 12:46

How to select features for a ML model

2 Answers2

Linked