0

If I have a matrix of co-occurring words in conversations of different lengths, is it appropriate to standardize / normalize the data prior to training?

My matrix is set up as follows: one row per two-person conversation, and columns are the words that co-occur between speakers. I cannot help but think that, as a longer conversation will likely comprise more shared words than shorter ones, I should factor this in somehow.

cookie1986
  • 179
  • 1
  • 5
  • So your matrix is comprised of words as columns, and they take the value of 1 if they appear in the conversation ( row ) ? – Blenz Oct 21 '19 at 13:13
  • They appear as co-occurring word counts. So if the word 'hello' co-occurs more than once in the conversation it is counted for each additional repetition – cookie1986 Oct 21 '19 at 13:15

1 Answers1

4

Thanks for the clarification by commenting. Tree-based models do not care about the absolute value that a feature takes. They only care about the order of the values. Hence, normalization is used mainly in linear models/knn/neural networks because they're affected by absolute values taken by features.

You don't need to normalize/standardize.

Check this post.

Blenz
  • 2,044
  • 10
  • 28