1

What is the best encoder for categorical data in unsupervised learning?

I am using unsupervised learning on mixed data (such as K-means).

Before running my unsupervised algorithm, I am using dimension reduction of my data using FAMD (PCA for mixed data) which allows me to obtain coordinates and reduce the dimension of my dataset.

The FAMD required One-hot encoding (aka Dummies Variable) and it's based on SVD. The SVD could be very time-consuming if the number of dimensions is high, and this is my case when I have categorical variables with a large number of modalities.

So I am looking for an encoder suitable for categorical data in the context of unsupervised learning.

I've found some pretty good encoders for supervised learning (target encoding as an example) but it's not adapted for unsupervised learning.

related post: Unsupervised encoding of categorical features

1 Answers1

0

Different unsupervised machine learning algorithms have different assumptions. K-means clustering requires computing euclidean distance thus all encodings have to be consistent with euclidean distance computation.

Encoding categorical variables into a continuous embedding space (e.g., word2vec) would be consistent with euclidean distance computation.

One hot encoding is not consistent with euclidean distance computation.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
  • I'll have a look to word2vec. See FAMD, where you can use euclidian distance with categegorical feature after the transformation: https://en.wikipedia.org/wiki/Factor_analysis_of_mixed_data http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/117-hcpc-hierarchical-clustering-on-principal-components-essentials/ – Julien PETOT Dec 05 '22 at 08:39