Normalize / Standardize in a Random Forest?

Question

If I have a matrix of co-occurring words in conversations of different lengths, is it appropriate to standardize / normalize the data prior to training?

My matrix is set up as follows: one row per two-person conversation, and columns are the words that co-occur between speakers. I cannot help but think that, as a longer conversation will likely comprise more shared words than shorter ones, I should factor this in somehow.

So your matrix is comprised of words as columns, and they take the value of 1 if they appear in the conversation ( row ) ? — Blenz, Oct 21 '19 at 13:13
They appear as co-occurring word counts. So if the word 'hello' co-occurs more than once in the conversation it is counted for each additional repetition — cookie1986, Oct 21 '19 at 13:15

score 4 · Accepted Answer · answered Oct 21 '19 at 13:20

Thanks for the clarification by commenting. Tree-based models do not care about the absolute value that a feature takes. They only care about the order of the values. Hence, normalization is used mainly in linear models/knn/neural networks because they're affected by absolute values taken by features.

You don't need to normalize/standardize.

Check this post.

Normalize / Standardize in a Random Forest?

1 Answers1