9

I am reading this paper "Sequence to Sequence Learning with Neural Networks" http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

Under "2. The Model" it says:

The LSTM computes this conditional probability by first obtaining the fixed dimensional representation v of the input sequence (x1, . . . , xT ) given by the last hidden state of the LSTM, and then computing the probability of y1, . . . ,yT′ with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of x1, . . . , xT:

I know what an LSTM is, but what's an LSTM-LM? I've tried Googling it but can't find any good leads.

  • But this sentence is still puzzling to me. if I put it into equation if makes [![](https://i.stack.imgur.com/0Lv8L.png)](https://i.stack.imgur.com/0Lv8L.png) [![(https://i.stack.imgur.com/et5Sf.png)](https://i.stack.imgur.com/et5Sf.png) with c the last hidden state of the encoder. then the first hidden state represents the information provided by the encoder but the next ones represent the probability distribution of the target sequence's elements : something of a radically different nature. Also the cell state state initialisation is not given and the figure 1 let believe that the LSTM provid – Charles Englebert Sep 18 '18 at 12:40

2 Answers2

12

The definition of a Language Model (LM) is a probability distribution over sequences of words.

The simple illustration of an LM is predicting the next word given the previous word(s).

For example, if I have a language model and some initial word(s):

  • I set my initial word to My
  • My model predicts there is a high probability that name appears after My.
  • By setting the initial words to My name, my model predicts there is a high probability that is appears after My name.
  • So it's like: My -> My name -> My name is -> My name is Tom, and so on.

You can think of the autocompletion on your smartphone keyboard. In fact, LM is the heart of autocompletions.

So, LSTM-LM is simply using an LSTM (and softmax function) to predict the next word given your previous words.

By the way, Language Model is not limited to LSTM, other RNNs (GRU), or other structured models. In fact, you can also use feedforward networks with context/sliding/rolling window to predict the next word given your initial words.

Rizky Luthfianto
  • 2,176
  • 2
  • 19
  • 22
1

In this context I think it means you take the output representation and learn an additional softmax layer that corresponds to the tokens in your language model (in this case letters).

Bhav Ashok
  • 11
  • 1