0

I have data of the format [dataset_N, dataset_N+1, similarity] where two different datasets have been compared, resulting in a similarity score. A score is generated for each possible (unique) pair of datasets (~7e6 pairs).

I am trying to determine how to cluster these datasets using just the similarity scores.

Something like K-means can't be used directly because the available information (similarity scores and which pair was compared) doesn't match its inputs.

Is there an existing method that can be used for this?

Edit: reworded for clarity

Edit 2: removed references to labels, since that was causing confusion

Mandias
  • 1
  • 2
  • Show a few lines of the dataset itself and it might be easier to understand what you are asking – sconfluentus Oct 07 '22 at 03:09
  • What do you mean by different label sets? I am a little confused since you use the word label to describe your variables. Are you trying to cluster the samples? Also when you say "it interprets the labels as location coordinates," what is "it"? Are you using python, R , matlab or something else? why would k-means interpret the labels as location coordinates? Can you post your code and/or some of your dataset? Is this a biological dataset? Like @sconfluentus said, it will be easier to answer your question if you provide more detail – fractalnature Oct 07 '22 at 19:19
  • I've reworded the question. What I meant by "label sets" is that the clusters would group the dataset labels together based on dataset similarity, resulting in a set of labels per cluster. I'm using Python, though I assumed most common algorithms have been implemented in multiple languages, so that was a secondary concern. The datasets have already been compared and scores generated, so in theory they are not relevant for this part of the analysis. – Mandias Oct 08 '22 at 22:12
  • Why do you think that KNN does not work with 3 features? – liakoyras Oct 09 '22 at 17:28
  • The labels are not appropriate inputs because they just identify the dataset. Each is unique, so not appropriate for spatial clustering. – Mandias Oct 10 '22 at 18:37
  • Have you tried creating a similarity matrix and clustering that? – Andy Oct 10 '22 at 19:35
  • @Andy No, I haven't tried graphs yet. Thanks for the question since this has pointed me towards METIS / NetworkX, and weighted graph-clustering, which seems to be what I need. – Mandias Oct 10 '22 at 20:57

1 Answers1

0

Thanks to the clues from the comments and this stack exchange post outlining graph clustering and modularity scores I was able to find a suitable method. I went with the NetworkX implementation of the Louvain Communities algorithm.

Mandias
  • 1
  • 2