8

I'm trying to use scikit-learn and pyssim for clustering a set of images - less than 100.

The end goal is to place the images into several buckets (clusters) according to the calculated similarity measures - CW-SSIM.

The task seems to be trivial, but I can't figure out the best way to handle "similarity based" clustering in scikit-learn. K-Means clustering looks like a good choice, but it doesn't accept any "comparison functions" or custom distance functions.

So how to handle the comparison based (similarity based) clustering in scikit-learn?

I was thinking about "comparison matrices" with 1 (similar) or 0 (not similar) per cell according to the calculated CW-SSIM similarity values. This matrix will be used for fitting into K-Means clustering. But then we will face the scalability issue, because such matrix will have dimensions equal to the amount of images ... which might grow to 1+ million in the future.

If there is an easier option in R than in Python, then I'm ready to review as well.

Thanks in advance.

UPDATE from Jan 18, 2016

I've created some code on GitHub about this topic: https://github.com/llvll/imgcluster

This project also includes IP[y] Notebook with step-by-step instructions and extra comments: https://github.com/llvll/imgcluster/blob/master/ip%5By%5D/imgcluster.ipynb

Oleg Puzanov
  • 111
  • 1
  • 4
  • If you want a really nice way of comparing images, I suggest reading [Supervised Learning of Semantics-Preserving Hashing via Deep Neural Networks for Large-Scale Image Search](http://arxiv.org/abs/1507.00101). There is also some [example code](https://github.com/kevinlin311tw/Caffe-DeepBinaryCode) – Martin Thoma Jan 12 '16 at 17:56
  • @moose, thanks for your help. I will definitely explore the neural networks for image clustering. Meanwhile, please see the latest update to the original post - I've shared my project on GitHub about this topic. – Oleg Puzanov Jan 17 '16 at 23:53

2 Answers2

4

I would use a regular clustering algorithm and replace the objective function, which is usually the MSE, with a differentiable loss function of your choice. Another way is to learn an embedding that optimizes your similarity metric using a neural network and just cluster that.

If you would rather do similarity-based clustering, here are some papers:

  • A Similarity-Based Robust Clustering Method
  • A Discriminative Framework for Clustering via Similarity Functions
  • Similarity-Based Clustering by Left-Stochastic Matrix Factorization

sklearn implements two similarity clustering methods: Affinity propagation, and spectral clustering.

Emre
  • 10,481
  • 1
  • 29
  • 39
4

It seems like you do not have fixed numbers of centroid(clusters) so centroid based clustering for example k-means can not be used in your case. However, you can use density based clustering for example DBSCAN.