5

I understand this question can be strange, but how do I pick the final random_seed for my classifier?

Below is an example code. It uses the SGDClassifier from SKlearn on the iris dataset, and GridSearchCV to find the best random_state:

from sklearn.linear_model import SGDClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV

iris = datasets.load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


parameters = {'random_state':[1, 42, 999, 123456]}

sgd = SGDClassifier(max_iter=20, shuffle=True)
clf = GridSearchCV(sgd, parameters, cv=5)

clf.fit(X_train, y_train)

print("Best parameter found:")
print(clf.best_params_)
print("\nScore per grid set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

The results are the following:

Best parameter found:
{'random_state': 999}

Score per grid set:
0.732 (+/-0.165) for {'random_state': 1}
0.777 (+/-0.212) for {'random_state': 42}
0.786 (+/-0.277) for {'random_state': 999}
0.759 (+/-0.210) for {'random_state': 123456}

In this case, the difference from the best to second best is 0.009 from the score. Of course, the train/test split also makes a difference.

This is just an example, where one could argue that it doesn't matter which one I pick. The random_state should not affect the working of the algorithm. However, there is nothing impeding of a scenario where the difference from the best to the second best is 0.1, 0.2, 0.99, a scenario where the random_seed makes a big impact.

  • In the case where the random_seed makes a big impact, is it fair to hyper-parameter optimize it?
  • When is the impact too small to care?
Valentin Calomme
  • 5,396
  • 3
  • 20
  • 49
Bruno Lubascher
  • 3,488
  • 1
  • 11
  • 35

2 Answers2

7

TL:DR, I would suggest not to optimise over the random seed. A better investment of the time would be to improve other parts of your model, such as the pipeline, the underlying algorithms, the loss function... heck, even optimise the runtime performance! :-)


This is an interesting question, even though (in my opinion) should not be a parameter to optimise.

I can imagine that researchers, in their struggles to beat current state-of-the-art on benchmarks such as ImageNet, may well run the same experiments many times with different random seeds, and just pick/average the best. However, the difference should not be considerable.

If your algorithms has enough data, and goes through enough iterations, the impact of the random seed should tend towards zero. Of course, as you say, it may have a huge impact. Imagine I am categorising a batch of images, into cat or dog. If I have a batch size of 1, and only 2 images that are randomly sampled, and one is correctly classified, one is not, then the random seed governing which is selected will determine whether or not I get 100% or 0% acuracy on that batch.


Some more basic information:

The use of a random seed is simply to allow for results to be as (close to) reproducible as possible. All random number generators are only pseudo-random generators, as in the values appear to be random, but are not. In essence, this can be logically deduced as (non-quantum) computers are deterministic machines, and so if given the same input, will always produce the same output. Have a look here for some more information and relative links to literature.

n1k31t4
  • 14,663
  • 2
  • 28
  • 49
  • I understand that makes no sense to pick the random seed of my train/test split, since in the end I will train with all the data I have. But in this example, the `random seed` is on the fitting procedure. I can raise the `max_iter` to `2000` in the `SGDClassifier` and there will still be a difference on the results. In the end I need to pick a `seed` to fit my algorithm. Would it be possible that the `random seed` has indeed significant importance for the algorithm? – Bruno Lubascher Jul 22 '18 at 15:46
  • 1
    If you are doing everything right, and your dataset is not completely imbalanced in some way, the random seed really should not influence the results. Aditionally, it does not help to gain trust in a model, which delivers good or bad results depending on the random seed that was used. – n1k31t4 Jul 22 '18 at 17:23
2

You don't. It's random, you shouldn't control it. The parameter is only there so we can replicate experiments.

In cases of algorithms producing hugely different results with different randomness (such as the original K-Means [not the ++ version] and randomly seeded neural networks), it is common to run the algorithm multiple times and pick the one that performs best according to some metric. You can do that by just running the algorithm again, without re-seeding.

But do not treat the random seed as something you can control. If you want your model to be able to be replicated later, simply get the current seed (most operating systems use processor clock time I think) and store it. Choosing a random seed because it performs best is completely overfitting/happenstance.

Note this all assumes a decent implementation of a random number generator with a decent random seed. Some pairs of RNG and seed may produce some predictable or less than useful random sequences.

Mephy
  • 937
  • 6
  • 20
  • I agree I shouldn't control this parameter. But what in the case where some values perform very well and some poorly. In the end, I need to pick one for my 'production' model. Why should I pick any instead of the ones that perform well? – Bruno Lubascher Jul 22 '18 at 15:49
  • "Choosing a random seed because it performs best is completely overfitting/happenstance" - what is your justification for this statement please? – Matti Wens Jan 03 '19 at 22:05
  • @MattWenham choosing a random seed manually means choosing all the "randomly" generated values manually (that's how PRNG works). You're removing some parameter that was supposed to be random, and instead using values that perform best on your data, thus making your final model biased towards the data at hand. Fitting to the data at hand instead of the overall distribution of the data is the very definition of overfitting. If you have a model with enough random parameters, you could as well turn it into a lookup table for the test dataset. – Mephy Jan 03 '19 at 22:16
  • @Mephy Can you give an example of a '[hyper]parameter that was supposed to be random'? In such cases, I agree with your argument. But with e.g. Cross-Validation, the split of the data is determined by the random seed, and the actual results with different seeds can vary as much as using different hyperparameters. The seed, then, in some sense becomes another hyperparameter with a very large range of values! I am currently planning some experiments to determine whether averaging over otherwise identical runs using different seeds is advantageous. I can share the results if you're interested. – Matti Wens Jan 04 '19 at 19:38
  • @MattWenham hyperparameters are never random (maybe randomly chosen, but not random). And a production model does not depend on the validation method used, cross-validation or otherwise. An example of a random parameter is the choice of features for a specific tree in a random forest classifier. This choice is made over and over again in the learning process, so changing the seed should not produce a meaningful change in performance. Another example are the mutation operations in genetic algorithms. – Mephy Jan 04 '19 at 20:48
  • @Mephy In my experience, "should not" and "does not" are two very different things! The choice of seed will affect the structure of the model and hence its results. – Matti Wens Jan 05 '19 at 10:32
  • @MattWenham that's my point. It should not be different, any significant measurable difference is entirely dependant of your test data, thus it's overfitted. The results are different but you have no basis for choosing one over the other. – Mephy Jan 05 '19 at 14:30
  • 1
    @Mephy So how should one choose between models when the performance of any such model is dependent upon the random seed? How should we choose a set of hyperparameters when the outcome is not dependent _only_ upon the hyperparameters themselves? – Matti Wens Jan 05 '19 at 19:00