5

I'm trying to find an algorithm that would fit this use case:

My data: a bunch of fixed-size integer arrays, e.g.

[0,2,3,4,5]
[1,2,3,1,5]
[4,1,2,4,5]
...

Input: an array of integers, and output: prediction for the rest of the array, e.g.

[0,1] -> [2,4,2]
[3] -> [1,3,2,5,5]
[3,2,4,1] -> [4]
nwaldo
  • 371
  • 2
  • 10
  • 1
    cool question! wish i had the answer... – Dave Kielpinski Apr 02 '20 at 01:51
  • 1
    This is really too broad a question to answer without some knowledge of the domain (ie how the data was constructed). For example, is there significance to the position in the array - is the probability of 2 following an initial zero related to the possibility of 2 after a zero in second place? – Josh Friedlander Apr 02 '20 at 21:18
  • 2
    If so, this sounds parallel to next word/ next n-gram models in NLP, except with a vocab of only 10 words. HMMs and LSTMs could work for this. – Josh Friedlander Apr 02 '20 at 21:20
  • 1
    Are you using a rolling window as your input data? – Donald S Jun 14 '20 at 05:04
  • If so, do you expect the same performance for each test example? – Donald S Jun 14 '20 at 05:05

2 Answers2

1

So, the question is how to model the following problem: input a sub section of a sequence, then output the rest of the sequence. This is a sequence-to-sequence problem.

For this, as Derek O has eluded to, I would suggest a encoder-decoder architecture. In the encoder, you would encode your input sequence (using RNN/LSTM), which will give you a "hidden representation", then pass your hidden representation through a decoder (RNN/LSTM) to then decode the hidden representation into the resining sequence of integers.

Here is an article that provides more detail around encoder-decoder models: https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346

shepan6
  • 1,398
  • 5
  • 14
0

It depends on what these numbers represent.

If there is sequential / time dependence (i.e. the subsequent outputs depend on previous inputs) I would suggest an LSTM-RNN.

Derek O
  • 354
  • 1
  • 10