7

I would like to train a binary classifier on feature vectors. One of the features is categorical feature with string, it is the zip codes of a country.

Typically, there is thousands of zip codes, and in my case they are strings. How can convert this feature into numerical?

I do not think that using one-hot-encoding is good as a solution for my case. Am I right by saying that? If yes, what would be a suitable solution?

Rami
  • 594
  • 1
  • 5
  • 16
  • Do you need the granularity of zip codes? Could you calculate/infer the distances from them? The reason I ask is that you may not have enough observations for some zip codes depending on the density of the data – ImperativelyAblative Mar 03 '16 at 20:31
  • No, the distance between zip codes is irrelevant in my case. – Rami Mar 03 '16 at 20:32
  • could you please elaborate the problem of your system and how the zip codes hold importance in the data? Are they helping in any inference that does not have a dead end? M – RahulS Mar 03 '16 at 21:12
  • yes, the zip code is an information that wight help in the classification task that I am trying to solve. – Rami Mar 03 '16 at 21:16
  • what technology are you using for the problem? – RahulS Mar 03 '16 at 21:23
  • I am using Spark 1.6 – Rami Mar 03 '16 at 21:25
  • Did you vectorize the column containing the zipcodes? – RahulS Mar 03 '16 at 21:30
  • this is my question actually, what is the best way to deal with this column. – Rami Mar 03 '16 at 21:31
  • Ok, so you are talking about the hashing trick? and the TF-IDF vector will be a very sparse vector with one non-zero dimension that corresponds to the IDF only in this case. right? – Rami Mar 03 '16 at 21:40
  • 1
    Welcome to the site @RahulS. Please leave comments such as this below the initial post, not as an "answer". – Emre Mar 03 '16 at 22:23

3 Answers3

8

This is an old question. I am surprised that I don't see anyone mentioned Mean Encoding (a.k.a Target Encoding). It is very popular in supervised learning problems. Besides, I have seen people use frequency or the cdf of the frequency (to avoid noise generated by heavy-tailed pdf), and they achieved pretty good results with lightGBM. However, i cannot really explain why it works rigorously.

Diansheng
  • 181
  • 1
  • 4
  • 3
    Would mean encoding increase the risk of overfitting? Say, in reality the target label does not depend on the categorical feature, but by using mean encoding we introduce the artificial dependency of the target on the categorical feature. – Michael Larionov Jul 15 '18 at 20:54
  • 2
    @MichaelLarionov, good thought! I didn't think of it until you mention. The answer is NO! If the target label does not depend on a feature, then the distribution of this feature is independent of the target. So in asymptotic cases, the values of all mean encoded category feature are the same. For example, if a disease has nothing to do with age groups, then the infection rate will be the same across all age groups. And this does not affect most of the models. – Diansheng Jul 16 '18 at 03:18
7

One-hot-encoded ZIP codes shouldn't present a problem with modern tools, where features can be much wider (millions, billions even), but if you really want you could aggregate area codes into regions, such as states. Of course, you should not use strings, but bit vectors. Two other dimensionality reduction options are MCA (PCA for categorical variables) and random projection.

Emre
  • 10,481
  • 1
  • 29
  • 39
  • 1
    thanks, so you mean treating a binary vector of million (even billions) dimensions is not a problem for the actual tools (spark for example)? – Rami Mar 03 '16 at 21:12
  • 1
    Not in terms of the raw capacity to process it, but you have to be mindful of overfitting and the curse of dimensionality in such situations. If your training error significantly differs from your test error, use more regularization. – Emre Mar 03 '16 at 22:04
  • thanks Emre, ok so One-hot-encoder + dimensionality reduction is reasonable choice then. I also like what @RahulS suggested in applying TF-IDF to the column. I will try both and compare if I have time. thanks again – Rami Mar 04 '16 at 09:16
  • 2
    I wanted to add that while one-hot encoding zip will work just fine, a zip code is a content rich feature, which is ripe for value-added feature engineering. So you should think about the things it could add to your data if you inner join it to other zip code data sets. States can be extracted, latitude and longitudes can be extracted, average summer high temperatures, days per year of rain, gun ownership, socio-economics, upward mobility, average home values, average education level, degree of urbanization, ... The list goes on and on, so zip can be highly rich in adding information. – AN6U5 Jul 29 '16 at 03:54
  • 1
    There are rather more effective algorithms than just OHE. For example, you can go with Empirical Bayes probablity per category of dependent target. Search for Daniele Barreca "High-cardinality categorical preprocessing" or entitiy embeddings https://arxiv.org/pdf/1604.06737.pdf – Novitoll Oct 02 '17 at 13:26
  • On computation speed perspective, modern tools can handle problem however doesn't it still suffer from performance accuracy and also when clustering? – haneulkim Sep 08 '21 at 06:31
3

You can use embedding which is mentioned in the comments. e.g. A general blog post, Keras documentation for embedding layer which can be used to learn the embedding. This is widely used by deep learning models when you need to reduce the number of features and it works for one categorical feature as well.

eSadr
  • 131
  • 4