1

I have been reading the early paper on pre-training in NLP (https://arxiv.org/abs/1511.01432) and I can't understand what random word dropout means. The authors completely ignore explaining this method as if it was a standard thing. Can someone explain what they really do and what is the purpose of that?

WoofDoggy
  • 343
  • 1
  • 2
  • 11
  • 1
    Does this answer your question? [Meaning of dropout](https://datascience.stackexchange.com/questions/37835/meaning-of-dropout) – Tom M. Feb 24 '20 at 22:04
  • 1
    @Tom M. Not exactly. I know what a dropout is, but how do I apply it to words? Do I just randomly remove some of them (do I sample uniformly or based on frquency?) or maybe set them to some special token. If I have a sentence of length 5 words and 150 words then if I just remove 50% of words at random then effect maybe be different in those two cases. In a standard dropout size of the layer is the same for each train example – WoofDoggy Feb 24 '20 at 22:11

1 Answers1

3

It is not uncommon that we can make sense of a sentence without reading it completely. Or when you are having a quick look at a document, you tend to oversee some words and still understand the main point. This is the intuition behind the word dropout.

Generally this is done by randomly dropping each word in a sequence following for example a Bernoulli distribution:

$X \leftarrow X \odot \vec{e}, \vec{e} ∼ B(n, p)$

where X is the index of the word token, n is the lenth of the sequence, and $\vec{e}$ is a vector with each word dropout state.

This is usually done after calculating the word embeddings, and the words selected to be left out are normally changed to the <UNK> equivalent embedding.

By doing this, we allow out model to learn more flexible ways of writing/convey meaning.

Tom M.
  • 663
  • 3
  • 9
TitoOrt
  • 1,832
  • 12
  • 22
  • all right then, so I choose words at random from sequence and remove them. Yoav Goldberg in his NLP book says: "word dropout may also be beneficial for preventing overfitting and improving robustness by not letting the model rely too much on any single word being present". – WoofDoggy Feb 25 '20 at 11:38
  • that's the idea – TitoOrt Feb 25 '20 at 12:10