Questions tagged [k-means]

k-means is a family of cluster analysis methods in which you specify the number of clusters you expect. This is as opposed to hierarchical cluster analysis methods.

442 questions
200
votes
13 answers

K-Means clustering for mixed numeric and categorical data

My data set contains a number of numeric attributes and one categorical. Say, NumericAttr1, NumericAttr2, ..., NumericAttrN, CategoricalAttr, where CategoricalAttr takes one of three possible values: CategoricalAttrValue1, CategoricalAttrValue2 or…
IgorS
  • 5,444
  • 11
  • 31
  • 43
67
votes
9 answers

Clustering geo location coordinates (lat,long pairs)

What is the right approach and clustering algorithm for geolocation clustering? I'm using the following code to cluster geolocation coordinates: import numpy as np import matplotlib.pyplot as plt from scipy.cluster.vq import kmeans2,…
rok
  • 813
  • 1
  • 7
  • 6
25
votes
3 answers

K-means incoherent behaviour choosing K with Elbow method, BIC, variance explained and silhouette

I'm trying to cluster some vectors with 90 features with K-means. Since this algorithm asks me the number of clusters, I want to validate my choice with some nice math. I expect to have from 8 to 10 clusters. The features are Z-score scaled. Elbow…
marcodena
  • 1,667
  • 4
  • 14
  • 17
20
votes
4 answers

K-means: What are some good ways to choose an efficient set of initial centroids?

When a random initialization of centroids is used, different runs of K-means produce different total SSEs. And it is crucial in the performance of the algorithm. What are some effective approaches toward solving this problem? Recent approaches are…
ngub05
  • 333
  • 1
  • 2
  • 8
17
votes
2 answers

K-means vs. online K-means

K-means is a well known algorithm for clustering, but there is also an online variation of such algorithm (online K-means). What are the pros and cons of these approaches, and when should each be preferred?
Rubens
  • 4,097
  • 5
  • 23
  • 42
14
votes
2 answers

Fast k-means like algorithm for $10^{10}$ points?

I am looking to do k-means clustering on a set of 10-dimensional points. The catch: there are $10^{10}$ points. I am looking for just the center and size of the largest clusters (let's say 10 to 100 clusters); I don't care about what cluster each…
Alex I
  • 3,142
  • 1
  • 21
  • 27
12
votes
1 answer

What are practical differences between kernel k-means and spectral clustering?

I've been lately wondering about kernel k-means and spectral clustering algorithms and their differences. I know that spectral clustering is a more broad term and different settings can affect the way it works, but one popular variant is using…
Kuba_
  • 264
  • 1
  • 10
12
votes
1 answer

How to measure the similarity between two images?

I have two group images for cat and dog. And each group contain 2000 images for cat and dog respectively. My goal is try to cluster the images by using k-means. Assume image1 is x, and image2 is y.Here we need to measure the similarity between any…
jason
  • 309
  • 2
  • 4
  • 9
12
votes
2 answers

Clustering high dimensional data

TL;DR: Given a big image dataset (around 36 GiB of raw pixels) of unlabeled data, how can I cluster the images (based on the pixel values) without knowing the number of clusters K to begin with? I am currently working on an unsupervised learning…
sunside
  • 223
  • 1
  • 2
  • 8
12
votes
3 answers

How to get the probability of belonging to clusters for k-means?

I need to get the probability for each point in my data set. The idea is to compute distance matrix (first column contsins distances to first cluster, second column conteins distances to second cluster and etc). The closest point has probability =…
11
votes
4 answers

Clustering for mixed numeric and nominal discrete data

My data includes survey responses that are binary (numeric) and nominal / categorical. All responses are discrete and at individual level. Data is of shape (n=7219, p=105). Couple things: I am trying to identify a clustering technique with a…
kms
  • 310
  • 1
  • 4
  • 14
10
votes
1 answer

Convergence in Hartigan-Wong k-means method and other algorithms

I have been trying to understand the different k-means clustering algorithms mainly that are implemented in the stats package of the R language. I understand the Lloyd's algorithm and MacQueen's online algorithm. The way I understand them is as…
Sid
  • 101
  • 1
  • 5
10
votes
1 answer

Confused about how to apply KMeans on my a dataset with features extracted

I am trying to apply a basic use of the scikitlearn KMeans Clustering package, to create different clusters that I could use to identify a certain activity. For example, in my dataset below, I have different usage events (0,...,11), and each event…
Gary
  • 529
  • 2
  • 5
  • 12
8
votes
2 answers

Image clustering by similarity measurement (CW-SSIM)

I'm trying to use scikit-learn and pyssim for clustering a set of images - less than 100. The end goal is to place the images into several buckets (clusters) according to the calculated similarity measures - CW-SSIM. The task seems to be trivial,…
Oleg Puzanov
  • 111
  • 1
  • 4
8
votes
1 answer

Bag of Visual Words

What I am trying to do: I am trying to classify some images using local and global features. What I have done so far: I have extracted sift descriptors for each image and I am using this as my input for k-means to create my vocabulary from all of…
Kevin
  • 261
  • 3
  • 7
1
2 3
29 30