I've just realized my prediction approach for LSTM might not be correct.
I am trying to predict character by character, by reading over the book. The way I've approached the problem is as follows:
b c d e ^ carry cell state forward ^ ^ ^ LSTM_t0 -------------------------> LSTM_t1 -----> LSTM_t2 -----> LSTM_t3 ^ ^ ^ ^ a b c d
This means I have 4 timesteps, at at each one I feed next letter into LSTM, expecting it to immediately predict the next letter.
Should I instead do this:
ignore ignore ignore e ^ ^ ^ ^ LSTM_t0 ----> LSTM_t1 -----> LSTM_t2 -----> LSTM_t3 ^ ^ ^ ^ a b c d
In the first case, I am able to get 4 loss-values, but in the second example, I only have 1 source of gradient, at _t3
My main concern is in first example, I demand LSTM to make prediction of 'b' and 'c' without supplying it enough previous context. It's fine for 'd' and 'e', but asking for answer at timestep 0 and 1 is a bit unfair?
What would be best for this particular example?