3

My data are places for which I know all pairwise travel times (='distances'), and I want to cluster those places minimising the total pairwise travel time inside a cluster.

  • K-means can't be used because it's centroid-based and the 'distance' is a duration that is provided, not computed thanks to coordinates
  • DBSCAN can't be used because it excludes outliners and I want to include each places in a cluster (not 100% sure for this one)

Bonus: a Java library would be much appreciated

ThCollignon
  • 133
  • 3
  • have you tried hierarchical clustering? I'm not a java programmer but in python, you can input the distance matrix consisting of pair-wise distances. – jtitusj Oct 29 '20 at 15:23

1 Answers1

3

Yes, this is a classic Graph-Based Clustering problem in which each location is a node and you have the distances between them. Forgetting about concept of graphs and graph-based algorithm which might be complicating, I directly jump to your answer.

The most well-known algorithm is Spectral Clustering. There are tones of tutorial out there and it is well implemented in all programming languages including Java.

I briefly explain but do not Panic if you are not familiar with Math terms. They are pretty intuitive and easy. You just need to follow a good tutorial.

  1. Calculate the Similarity Matrix (often referred to as Affinity Matrix in literature). As you have distance matrix you have several ways to do it. Simplest starts from saying $sim = \frac{1}{dist}$ till more sophisticated ones which use Gaussian Kernels for calculating similarity.
  2. Calculate the Laplacian Matrix out of similarity matrix.
  3. Calculate eigenvectors and eigenvalues in order to embedd your data points into their eigen-space (like exactly what we do in PCA)
  4. Use a simple clustering algorithm like K-Means to cluster points in that space
Kasra Manshaei
  • 6,502
  • 1
  • 20
  • 45