I had a face to face interview for a data scientist job a few days ago. One of the questions I was asked was: in the case of classifier predicting the brand of TV from some features (price, size, specs, ...) out of 4 possible brands, how do you encode the brand variable? My answer was one hot encoding, it was accepted but then they asked me to do it explicitly and I sketched something like:
brand A -> [1,0,0,0]
brand B -> [0,1,0,0]
brand C -> [0,0,1,0]
brand D -> [0,0,0,1]
And then, I was corrected under the reason that these columns were not independent. And that the solution should have been three binary columns instead.
Later it hit me that I do not know why independence is required, and also that three binary variables are not independent. Two would be.
Can someone provide some explanation to help with my confusion?