Opinions on an LSTM hyper-parameter tuning process I am using

Question

I am training an LSTM to predict a price chart. I am using Bayesian optimization to speed things slightly since I have a large number of hyperparameters and only my CPU as a resource.

Making 100 iterations from the hyperparameter space and 100 epochs for each when training is still taking too much time to find a decent set of hyperparameters.

My idea is this. If I only train for one epoch during the Bayesian optimization, is that still a good enough indicator of the best loss overall? This will speed up the hyperparameter optimization quite a bit and later I can afford to re-train the best 2 or 3 hyperparameter sets with 100 epochs. Is this a good approach?

The other option is to leave 100 epochs for each training but decrease the no. of iterations. i.e. decrease the number of training with different hyperparameters.

Any opinions and/or tips on the above two solutions?

( I am using keras for the training and hyperopt for the Bayesian optimisation)

German C M · Accepted Answer · 2020-05-07T21:47:56.927

First of all you might want to know there is a "new" Keras tuner, which includes BayesianOptimization, so building an LSTM with keras and optimizing its hyperparams is completely a plug-in task with keras tuner :) You can find a recent answer I posted about tuning an LSTM for time series with keras tuner here

So, 2 points I would consider:

I would not loop only once over your dataset, it does not sound like enough times to find the right weights. I would rather control the number of possible hyperparams configurations as you said, which is something you can indicate in keras tuner via max_trials param

About using keras tuner with Bayesian tuner, you can find some code below as an example for tuning the units (nodes) in the hidden layers and the learning rate:

from tensorflow import keras
from kerastuner.tuners import BayesianOptimization

n_input = 6
def build_model(hp):
    model = Sequential()
    model.add(LSTM(units=hp.Int('units',min_value=32,
                                    max_value=512,
                                    step=32), 
               activation='relu', input_shape=(n_input, 1)))
    model.add(Dense(units=hp.Int('units',min_value=32,
                                    max_value=512,
                                    step=32), activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mse', metrics=['mse'], optimizer=keras.optimizers.Adam(
        hp.Choice('learning_rate',
                  values=[1e-2, 1e-3, 1e-4])))

return model

bayesian_opt_tuner = BayesianOptimization(
    build_model,
    objective='mse',
    max_trials=3,
    executions_per_trial=1,
    directory=os.path.normpath('C:/keras_tuning'),
    project_name='kerastuner_bayesian_poc',
    overwrite=True)

bayesian_opt_tuner.search(train_x, train_y,epochs=n_epochs,
     #validation_data=(X_test, y_test)
     validation_split=0.2,verbose=1)


bayes_opt_model_best_model = bayesian_opt_tuner.get_best_models(num_models=1)
model = bayes_opt_model_best_model[0]

You would get something like this, informing you about the searched configurations and evaluation metrics:

I know its not enough times to find the right weights. But the purpose of Bayesian is to find the right hyperparameters not weights so my question was if my approach is a good tradeoff to find the right hyperparameters. Once I find those I will obviously loop over the dataset more times to find the right weights. Is this a good approach or not? and why? — user134132523, May 08 '20 at 09:13
So you can experiment by yourself :) Here you have the code for an LSTM changing the epochs and max_trials (hyperparms. combinations) — German C M, May 14 '20 at 16:08

German C M · Answer 2 · 2020-05-14T16:19:00.337

Here you can find the code to train an LSTM via keras and tune it via keras tuner, bayesian option:

#2 epoch con 20 max_trials
from kerastuner import BayesianOptimization

def build_model(hp):
    model = keras.Sequential()
    model.add(keras.layers.LSTM(units=hp.Int('units',min_value=8,
                                        max_value=64,
                                        step=8), 
                   activation='relu', input_shape=x_train_uni.shape[-2:]))
    model.add(keras.layers.Dense(1))

    model.compile(loss='mae', optimizer=keras.optimizers.Adam(
            hp.Choice('learning_rate',
                      values=[1e-2, 1e-3, 1e-4])),
                   metrics=['mae'])
    return model

# define model
bayesian_opt_tuner = BayesianOptimization(
    build_model,
    objective='mae',
    max_trials=20,
    executions_per_trial=1,
    directory=os.path.normpath('C:/keras_tuning'),
    project_name='timeseries_temp_ts_test_from_TF_ex',
    overwrite=True)

EVALUATION_INTERVAL = 200
EPOCHS = 2

bayesian_opt_tuner.search(train_univariate, #X_train, y_train,
             epochs=EPOCHS,
             validation_data=val_univariate,
             validation_steps=50,
             steps_per_epoch=EVALUATION_INTERVAL
             #batch_size=int(len(X_train)/2)
             #validation_split=0.2,verbose=1)
             )

I did it with a temperatures dataset, changing both epochs and hyperparams combinations. I think it depends also on the dataset you are playing with, for the one I quickly tried (with no representative results since it should be repeated enough times to get results distributions for each case), I saw no much difference (we should check it via a hypothesis tester for a robust conclusion), but there you can play with it. My quick results:

20 epochs, 2 hyperparams combinations:

2 epochs, 20 hyperparams combinations:

Your answer really helped me out. I have one question though. We get 4 options in KerasTuner (randomized search, sklearn, bayesian or hyperband). It is said that bayesian and hyperband are more likely to provide globally optimized solutions. According to you, out of these two which one is the best? — spectre, Jan 26 '22 at 11:26
I am glad it is useful. You can find more details on that comparison in this paper: https://jmlr.org/papers/v18/16-558.html, where they also focus on a speed-up advantage of hyperband over Bayesian optimisation — German C M, Jan 28 '22 at 07:08

Opinions on an LSTM hyper-parameter tuning process I am using

2 Answers2

Linked