1

I am going through the "Text classification with TensorFlow Hub" tutorial. In this tutorial, a total of 50,000 IMDb reviews are split into 25,000 reviews for training and 25,000 reviews for testing.

I am surprised by this way of splitting the data, since I learned in Andrew Ng's course that for fairly small datasets (<10,000 examples) the "old-fashioned" rule of thumb was to consider 60% or 70% of the data as training examples and the remainder as dev/test examples.

Is there a reason behind this 50:50 split?

  • Is it common practice when working with text?
  • Does it have anything to do with using a "pre-trained" TensorFlow Hub layer?
Sheldon
  • 195
  • 1
  • 9

2 Answers2

2

Is it common practice when working with text?

No, you could split dataset as you wish, in general in real-world problem you should use cross validation.

Does it have anything to do with using a "pre-trained" TensorFlow Hub layer?

No, it doesn't.

fuwiak
  • 1,355
  • 8
  • 13
  • 26
  • 2
    Thanks for these clarifications, fuwiak! – Sheldon Apr 01 '20 at 13:15
  • Only if the true distribution does not change significantly through time. e.g. time series data. You don't want to be validating today's economics with economics from the 1950s. – NoName Dec 04 '21 at 21:48
2

A safer method is to use the integer part of the fraction (after truncating) $n_c \approx n^{3 \over 4}$ examples for training, and $n_v \equiv n - n_c$ for validation (a.k.a. testing). If you are doing cross-validation, you could perform that whole train-test split at least $n$ times (preferably $2n$ if you can afford it), recording average validation loss at the end of each cross-validation "fold" (replicate), which is what tensorflow records anyway; see this answer for how to capture it). When using Monte Carlo cross-validation (MCCV) then for each of the $n$ (or $2n$ if resource constraints permit) replicates, one could randomly select (without replacement to make things simpler) $n_c$ examples to use for training and use the remaining $n_v$ examples for validation, without even stratifying the subsets (based on class, for example, if you are doing classification).

This is based on a 1993 paper (look at my answer here for more information) by J. Shao in which he proves that $n_c \approx n^{3 \over 4}$ is optimal for linear model selection. At that time, non-linear models such as machine learning (see this answer for yet another discussion on that) were not as popular, but as far as I know (would love to be proven wrong) nobody has taken the time to prove anything similar for what is in popular use today, so this is the best answer I can give you right now.

UPDATE: Knowing that GPUs work most efficiently when they are fed a batch sized to be a power of two, I have calculated different ways to split data up into training and validation which would follow Jun Shao's strategy of making the training set size $n_c \approx n^{\frac{3}{4}}$ and where both $n_c$ and $n_v \equiv n - n_c$ are close to powers of two. An interesting note is that for $n = 640$, $n_c \approx 127$ and therefore $n_v \approx 513$; because $127 \approx 2^7$ and $513 \approx 2^9$ I plan to go ahead and use those as my training and validation test sizes whenever I am generating simulated data.

brethvoice
  • 200
  • 8
  • 1
    Thanks for your detailed answer! – Sheldon Apr 02 '20 at 07:20
  • Hi @Sheldon you should know that most people assume you need to train with more data than what you validate/test with. I have never been able to figure out why though; maybe someone else will answer/comment and elucidate. NOTE: When you actually USE the data for training a model for real predictions, you would train on all available data, possibly even augmented/bootstrapped. The intent of cross-validation is to avoid overfitting when selecting a base algorithm/model to use in the future on unknown data. – brethvoice Apr 02 '20 at 14:17
  • 1
    Indeed, my understanding is that the rule of thumb when working with a very large number of examples is actually to use more than 90% of those as training examples. – Sheldon Apr 02 '20 at 20:12
  • 1
    By the way, since you mention bootstrap in your latest comment: is bootstrap used as a way to validate the model performance or simply to "shuffle" the dataset prior to applying another cross-validation method (MCCV, K-fold, LOOCV)? – Sheldon Apr 02 '20 at 20:24
  • @Sheldon I believe bootstrapping is one specific form of so-called "data augmentation" and it does require shuffling to perform random selection of existing data to be repeated. However, the only one of the three CV methods you mentioned that requires shuffling is MCCV and that is by definition; bootstrapping happens before any cross-validation occurs to "plus up" the size of the training *and* validation/testing data sets. – brethvoice Apr 02 '20 at 22:51
  • BTW @Sheldon I am currently thinking using a Bayesian marginal likelihood (after some initial training to force a useful prior, details fuzzy) instead of Shao's method. Based solely on this paper: https://arxiv.org/pdf/1905.08737v1.pdf – brethvoice Sep 30 '20 at 19:22
  • 1
    Thanks for your feedback! Were you able to implement this strategy? How do the results compare to those obtained using Shao's method? – Sheldon Oct 20 '20 at 05:53
  • @Sheldon yes I am having some pretty good success avoiding cross-validation altogether by approximating Bayesian optimization another way: see https://stackoverflow.com/a/64361230/12763497 (SO) and https://stats.stackexchange.com/a/491268/272731 (stats.SE) to check whether anyone else has tried it. Github repo with demo code is at https://github.com/brethvoice/optuna_demo_MNIST if you want to give it a whirl (demo runs in about 30 minutes, or the "simple" one is set to run overnight) – brethvoice Oct 20 '20 at 12:47
  • @Sheldon I just posted this question which proposes using some of Shao's theory within the context of machine learning validation during the training loop. I encourage you to chime in there if you are a member of Cross-Validated SE: https://stats.stackexchange.com/q/494109/272731 – brethvoice Oct 28 '20 at 20:20