4

I have generated a dataset of pairwise distances as follows:

id_1 id_2 dist_12
id_2 id_3 dist_23

I want to cluster this data so as to identify the pattern. I have been looking at Spectral clustering and DBSCAN, but I haven't been able to come to a conclusion and have been ambiguous on how to make use of the existing implementations of these algorithms. I have been looking at Python and Java implementations so far.

Could anyone point me to a tutorial or demo on how to make use of these clustering algorithms to handle the situation in hand?

DaL
  • 2,623
  • 11
  • 13
t_m
  • 143
  • 5
  • I just added an answer assuming that you want to cluster the samples `id_1`...`id_n` *based on their distances*. If you do want to cluster *the distances themselves*, you just need to use them as a 1-dimensional array. – logc Jul 08 '14 at 09:28

1 Answers1

2

In the scikit-learn implementation of Spectral clustering and DBSCAN you do not need to precompute the distances, you should input the sample coordinates for all id_1 ... id_n. Here is a simplification of the documented example comparison of clustering algorithms:

import numpy as np
from sklearn import cluster
from sklearn.preprocessing import StandardScaler

## Prepare the data
X = np.random.rand(1500, 2)
# When reading from a file of the form: `id_n coord_x coord_y`
# you will need this call instead:
# X = np.loadtxt('coords.csv', usecols=(1, 2))
X = StandardScaler().fit_transform(X)

## Instantiate the algorithms
spectral = cluster.SpectralClustering(n_clusters=2,
                                      eigen_solver='arpack',
                                      affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=.2)

## Use the algorithms
spectral_labels = spectral.fit_predict(X)
dbscan_labels = dbscan.fit_predict(X)
logc
  • 731
  • 3
  • 12