What would be a good n_estimators matrix and thus param_grid for this problem?

Question

I am using GridSearchCV for optimising my predictions

I am running a fairly large dataset and I am afraid I have not optimised the parameters enough.

df_train.describe():    
         Unnamed: 0           col1           col2           col3           col4          col5
count  8.886500e+05  888650.000000  888650.000000  888650.000000  888650.000000  888650.000000
mean   5.130409e+05       2.636784       3.845549       4.105381       1.554918       1.221922
std    2.998785e+05       2.296243       1.366518       3.285802       1.375791       1.233717
min    4.000000e+00       1.010000       1.010000       1.010000       0.000000       0.000000
25%    2.484332e+05       1.660000       3.230000       2.390000       1.000000       0.000000
50%    5.233705e+05       2.110000       3.480000       3.210000       1.000000       1.000000
75%    7.692788e+05       2.740000       3.950000       4.670000       2.000000       2.000000
max    1.097490e+06      90.580000      43.420000      99.250000      22.000000      24.000000

df_test.describe():
         Unnamed: 0      col1        col2        col3        col4        col5
count  390.000000  390.000000  390.000000  390.000000         0.0         0.0
mean   194.500000    3.393359    4.016821    3.761385         NaN         NaN
std    112.727548    4.504227    1.720292    3.479109         NaN         NaN
min      0.000000    1.020000    2.320000    1.020000         NaN         NaN
25%     97.250000    1.792500    3.272500    2.220000         NaN         NaN
50%    194.500000    2.270000    3.555000    3.055000         NaN         NaN
75%    291.750000    3.172500    4.060000    4.217500         NaN         NaN
max    389.000000   50.000000   18.200000   51.000000         NaN         NaN

The way I am using GridSearchCV is as follows:

rf_h = RandomForestRegressor()
rf_a = RandomForestRegressor()

# Using GridSearch for Optimisation
param_grid = {
'max_features': ['auto', 'sqrt', 'log2']
}

rf_g_h = GridSearchCV(estimator=rf_h, param_grid=param_grid, cv=3, n_jobs=-1)
rf_g_a = GridSearchCV(estimator=rf_a, param_grid=param_grid, cv=3, n_jobs=-1)

# Fitting dataframe to prediction engine
rf_g_h.fit(X_h, y_h)
rf_g_a.fit(X_a, y_a)

How can I optimise param_grid and hence determine best_params_ of the same?

What would be the best matrix for n_estimators for this dataset?

Don't tune `n_estimators` for random forests. https://stats.stackexchange.com/q/348245/232706 — Ben Reiniger, Jul 19 '21 at 15:19

score 1 · Accepted Answer · answered Jul 16 '21 at 10:30

1

In general there's no way to know the best values to try for a parameter. The only thing one can do is to try many possible values, but:

this mathematically requires more computing time (see this question about how GridSearchCV works)
there is a risk of overfitting the parameters, i.e. selecting a value which is optimal by chance on the validation set.

answered Jul 16 '21 at 10:30

Erwan

24,823
3
13
34

This is exactly what I wanted to see and understand.. I am a python noob but a soccer fan.. i wanted to build a prediction.. and have built one.. lol.. too much detail but I wanna say **Thank You** this will help me build a better model.. so beautifully written.. used discts on my code but when i saw your response, i unserstoos what a disc is actually is... so yeah thank you. but i cannot accept it as answer because it is not and **WANT** the best params.. – PyNoob Jul 16 '21 at 14:55
Again.. sorry.. not being rude.. but just being excited.. will wait for 7 more days, if completed, will mark yours as best answer. sorry sis not mean to come out rude.. but so goos answer.. thank you. Edit.. meant to say dict – PyNoob Jul 16 '21 at 14:56
Also, what is `max_features': ['auto', 'sqrt', 'log2']` in my code.. what are the various options there? How do they work..? I can understand its a lot asking but i wanna predict soccer matches.. and i want to have the best combinations possible in a classifier so I **have** to know what goes in there and it works per se.. `RandomForestRegressor` engine.. so i am not going in more detail.. – PyNoob Jul 16 '21 at 15:05
I am sorry, i am being verbose, possibly wrong place to ask to get in touch but I want to ask and any other place or email or github, page and all, there are chat forums but it would be an overkill, if you would like, i can ask a similar question on your page and continue to reply. – PyNoob Jul 16 '21 at 15:08
@PyNoob technically the default gridsearch is a [brute-force](https://en.wikipedia.org/wiki/Brute-force_search) algorithm, it just tries every possible value. The randomized version can be a good option in case you want to try many values. A more manual approach is to plot the performance for some values for a single parameter, for instance n=50,100,150... then by observing how the curve evolves based on the values it gives you a more precise idea of which range of values is optimal. you can do that iteratively, but it takes time of course. – Erwan Jul 16 '21 at 20:10
About `max_features` it's a a technical parameter for [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). Btw `auto` and `sqrt` are the same, you don't need to use both. It would be too long to explain this in a comment, ask a new question about it if you want more detail. Imho it's not a very important parameter. In any case parameter tuning rarely improves performance a lot compared to default parameters, usually it just improves a little bit. – Erwan Jul 16 '21 at 20:16
Just a general remark: predicting a problem like soccer games a difficult task, in this kind of problem the best you can hope is for the system to perform better than a random guess, it's not going to be right most of the time. Btw it's not really a problem to use comments here, don't worry. – Erwan Jul 16 '21 at 20:22

score 1 · Answer 2 · answered Jul 19 '21 at 12:05

1

Instead of GridSearchCV you should try Optuna. It is much faster than GridSearchCV.

But apart from that, coming to your question, there is no best value for a hyperparameter per se! Period! It depends on what kind of data you have. What hyperparameter value works for one dataset might not work for another dataset.

Also another point to keep in mind, there are many parameters for a model like Random Forest model. Including all of them with a wide range of values in your gridsearch will take forever. Instead include only those parameters that give the maximum improvement in your results (aka those that matter the most!). Here is a link to a blog that might help you: https://blog.dataiku.com/narrowing-the-search-which-hyperparameters-really-matter

Hope it helps!

answered Jul 19 '21 at 12:05

spectre

1,831
1
9
29

HO can I call and use optuna? Can I just replace `GridSearchCV` with `optuna`? – PyNoob Jul 20 '21 at 01:30
Edit: correction- How* – PyNoob Jul 20 '21 at 01:36
Here is an excellent article about the implementation of optuna: https://www.analyticsvidhya.com/blog/2020/11/hyperparameter-tuning-using-optuna/ If you found my answer useful kindly mark it as best! Happy Coding! – spectre Jul 20 '21 at 04:43

What would be a good n_estimators matrix and thus param_grid for this problem?

2 Answers2