5

I have a database of 3190 instances of DNA consisting of 60 sequential DNA nucleotide positions classified according to 3 types: EI, IE, Other.

I want to formulate a supervised classifier.

My present approach is to formulate a 2nd order Markov Transition Matrix for each instance and apply the resulting data to a Neural Network.

How best to approach this classification problem, given that the Sequence of the data should be relevant? Is there a better approach than the one I came up with?

akellyirl
  • 723
  • 1
  • 6
  • 9
  • 1
    From a supervised learning perspective, is this a standard problem with 3190 instances and 60 columns where each of the columns is a categorical variable? If that's the case, why not feed in the transition probabilities as additional features and throw your favorite classifiers at it. It could be a neural net, but it could might as well be a GBM? Also, how do you know that its a 2nd order Markov chain? What method are you using to estimate the order (eg. BIC, AIC, etc.)? – Nitesh Nov 10 '14 at 18:10
  • @nitesh the 2nd order Markov chain is based on the observation that DNA sequences are composed of Codons composed of 3 nucleotides. – akellyirl Nov 10 '14 at 21:27

1 Answers1

3

One way would be to create 20 features (each feature representing a codon). In this way, you would have a dataset with 3190 instances and 20 categorical features. There is no need to treat the sequence as a Markov chain.

Once the dataset has been featurized as suggested above, any supervised classifier can work well. I would suggest using a gradient boosting machine as it might be better suited to handle categorical features.

Nitesh
  • 1,615
  • 1
  • 12
  • 22
  • One of the problems with this is that you don't know which nucleotide starts the codon. So you'd need to create 58 categorical features. Worth a try though. – akellyirl Nov 17 '14 at 10:47
  • That is a good point. However, since 30 are in E and 30 are in I (or the reverse), we can assume that the first neucleotide starts the codon? I might be totally wrong here so please correct my hypothesis/ reasoning. – Nitesh Nov 17 '14 at 16:19