Classification of DNA Sequences

Question

I have a database of 3190 instances of DNA consisting of 60 sequential DNA nucleotide positions classified according to 3 types: EI, IE, Other.

I want to formulate a supervised classifier.

My present approach is to formulate a 2nd order Markov Transition Matrix for each instance and apply the resulting data to a Neural Network.

How best to approach this classification problem, given that the Sequence of the data should be relevant? Is there a better approach than the one I came up with?

From a supervised learning perspective, is this a standard problem with 3190 instances and 60 columns where each of the columns is a categorical variable? If that's the case, why not feed in the transition probabilities as additional features and throw your favorite classifiers at it. It could be a neural net, but it could might as well be a GBM? Also, how do you know that its a 2nd order Markov chain? What method are you using to estimate the order (eg. BIC, AIC, etc.)? — Nitesh, Nov 10 '14 at 18:10
@nitesh the 2nd order Markov chain is based on the observation that DNA sequences are composed of Codons composed of 3 nucleotides. — akellyirl, Nov 10 '14 at 21:27

score 3 · Accepted Answer · answered Nov 13 '14 at 07:17

3

One way would be to create 20 features (each feature representing a codon). In this way, you would have a dataset with 3190 instances and 20 categorical features. There is no need to treat the sequence as a Markov chain.

Once the dataset has been featurized as suggested above, any supervised classifier can work well. I would suggest using a gradient boosting machine as it might be better suited to handle categorical features.

answered Nov 13 '14 at 07:17

Nitesh

1,615
1
12
22

One of the problems with this is that you don't know which nucleotide starts the codon. So you'd need to create 58 categorical features. Worth a try though. – akellyirl Nov 17 '14 at 10:47
That is a good point. However, since 30 are in E and 30 are in I (or the reverse), we can assume that the first neucleotide starts the codon? I might be totally wrong here so please correct my hypothesis/ reasoning. – Nitesh Nov 17 '14 at 16:19

Classification of DNA Sequences

1 Answers1