3

Fundamentally, under what circumstance is it reasonable to do HPO only on a subsample of the training set?

I am using Population Based Training to optimise hparameters for a sequence model. My dataset consists of 20M sequences and was wondering if it would make sense to optimise on a subsample due to restricted budget.

hH1sG0n3
  • 1,978
  • 7
  • 27
  • Does this post answers your question ? [Is sampling a valid way to reduce complexity?](https://datascience.stackexchange.com/questions/85120/is-sampling-a-valid-way-to-reduce-complexity) – etiennedm Dec 16 '20 at 08:07
  • Thanks @etiennedm, I am not very sure if this applies to supervised learning as in my case? I have updated my question to make my problem more clear. – hH1sG0n3 Dec 16 '20 at 10:42
  • That is true, in your case, you could benefit from the fact that it is supervised learning by ensuring to keep the classes distribution. – etiennedm Dec 16 '20 at 12:02

1 Answers1

1

Your subsample has to be representative of your original dataset.

To do so, as you are in a supervised case, I would get a random subsample that keeps the classes distribution (for instance getting randomly 40% of each class).

Note
If you have classes with too few examples, I would also try not to sample them. Risk is even with random sampling you could loose information when a cluster is too small. Plus, if your problem is computation time, that won't be a problem to keep the too small clusters while sampling the bigs.

etiennedm
  • 1,345
  • 5
  • 13
  • Sure that makes sense and that has been my strategy. However, I am not sue that a config that applies to 1M datapoints generalises to 15M. I would expect it to be correct if model performance for 1M is the same for 10M (same signals), but that is just not the case in most of the cases. – hH1sG0n3 Dec 16 '20 at 14:43
  • 1
    I just got your point, you'd like to retrain a final model with your 20M points keeping the HP of an optimisation made with 1M, is that correct ? In my answer I thought you wanted to do HP tuning to train a model on 1M and then use it on the 20M points – etiennedm Dec 16 '20 at 14:56