0

I've noticed that UMAP is often used in combination with other clustering algorithms, such as K-means, DBSCAN, HDBSCAN. However, from what I've understood, UMAP can be used for clustering tasks. So why I've noticed people using it primarily as a dimensionality reduction technique?

Here an example of what I'm talking about: [https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504][1]

Am I getting something wrong? Can UMAP be used for clustering tasks alone? What are the benefits of using it in combination with other clustering algorithms?

coelidonum
  • 103
  • 2
  • please consider accepting one of the answers or, alternatively, please let us know why you think they are not correct or what is not clear enough. – noe Mar 31 '23 at 12:59

2 Answers2

2

UMAP is not for clustering. It is just for dimensionality reduction.

In the link you provided, UMAP is not used for clustering, just for dimensionality reduction. Specifically, they use BERTopic, which is a topic modeling technique that relies on UMAP. BERTopic takes sentence empeddings, applies dimensionality reduction with UMAP and does clustering with HDBSCAN.

noe
  • 22,074
  • 1
  • 43
  • 70
1

As mentioned Noe, UMAP is 100% for dimensional reduction, ONLY to group similar data and separate different ones, THEN we apply a clustering algorithm such as DBSCAN to identify those groups in the projected space.

Multi-dimensional reduction is the most difficult part because you need to detect similarities and differences among plenty of variables and then project them in a lower dimensional space, generally in 2D.

PCA was initially one of the first dimensional reduction algorithm (even if it is not exactly a dimensional reduction one) but it has a problem: it is linear.

UMAP is a non-linear reduction algorithm, meaning it can detect complex correlations between features. Therefore, it uses Riemannian manifolds that are useful for representing complex, non-linear geometries that are difficult to capture with simple Euclidean distance measures.

UMAP constructs a weighted graph representation of the data, where the weights between each pair of data points are determined by a distance measure that considers the geometry of the Riemannian manifold. This graph is then optimized using a gradient-based algorithm to produce a low-dimensional embedding that preserves the high-dimensional geometric structure of the data.

More info: https://pair-code.github.io/understanding-umap/

Nicolas Martin
  • 4,509
  • 1
  • 6
  • 15