Sklearn: unsupervised knn vs k-means

Question

Sklearn has an unsupervised version of knn and also it provides an implementation of k-means.

If I am right, kmeans is done exactly by identifying "neighbors" (at least to a centroid which may be or may not be an actual data) for each cluster. But in a very rough way this looks very similar to what the unsupervised version of knn does.

Then what is the difference between k-means and unsupervised knn?

score 8 · Accepted Answer · answered Jul 06 '18 at 03:27

Unsupervised k-NN

Unlike k-means, the unsupervised k-nn does not associate a label to instances. All it can do is tell you what instances in your training data is k-nearest to the point you are polling for.

For example:

import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]

neigh = NearestNeighbors(2, 0.4)
neigh.fit(samples) 

neigh.kneighbors([[0, 0, 1.3]], 2, return_distance=False)

array([[2, 0]]...)

You can see that this returned the index of the k-nearest points, and not the label.

k-means

This algorithm is completely different. The k here denotes the number of assumed classes that exist in your dataset. For example if you have unlabeled pictures of red and green apples, you know that $k=2$. The algorithm will then move the centroids (the average of the cluster distributions) to a stable solution.

Here is an example:

Let's first make some artificial Gaussian distributed data.

import numpy as np
import matplotlib.pyplot as plt

params = [[[ 0,1],  [ 0,1]], 
          [[ 5,1],  [ 5,1]], 
          [[-2,5],  [ 2,5]],
          [[ 2,1],  [ 2,1]],
          [[-5,1],  [-5,1]]]

n = 300
dims = len(params[0])

data = []
y = []
for ix, i in enumerate(params):
    inst = np.random.randn(n, dims)
    for dim in range(dims):
        inst[:,dim] = params[ix][dim][0]+params[ix][dim][1]*inst[:,dim]
        label = ix + np.zeros(n)

    if len(data) == 0: data = inst
    else: data = np.append( data, inst, axis= 0)
    if len(y) == 0: y = label
    else: y = np.append(y, label)

num_clusters = len(params)

print(y.shape)
print(data.shape)

(1500,)
(1500, 2)

plt.scatter(data[:,0], data[:,1])
plt.show()

The k-means algorithm from scratch

class Kmeans(object):

    def __init__(self, k=1):
        self.k = k

    def train(self, data, verbose=1):

        shape = data.shape

        ranges = np.zeros((shape[1], 2))
        centroids = np.zeros((shape[1], 2))

        for dim in range(shape[1]):
            ranges[dim, 0] = np.min(data[:,dim])
            ranges[dim, 1] = np.max(data[:,dim])

        if verbose == 1:
            print('Ranges: ')
            print(ranges)

        centroids = np.zeros((self.k, shape[1]))
        for i in range(self.k):
            for dim in range(shape[1]):
                centroids[i, dim] = np.random.uniform(ranges[dim, 0], ranges[dim, 1], 1)

        if verbose == 1:
            print('Centroids: ')
            print(centroids)

            plt.scatter(data[:,0], data[:,1])
            plt.scatter(centroids[:,0], centroids[:,1], c = 'r')
            plt.show()

        count = 0
        while count < 100:
            count += 1
            if verbose == 1:
                print('-----------------------------------------------')
                print('Iteration: ', count)

            distances = np.zeros((shape[0],self.k))
            for ix, i in enumerate(data):
                for ic, c in enumerate(centroids):
                    distances[ix, ic] = np.sqrt(np.sum((i-c)**2))

            labels = np.argmin(distances, axis = 1)

            new_centroids = np.zeros((self.k, shape[1]))
            for centroid in range(self.k):
                temp = data[labels == centroid]
                if len(temp) == 0:
                    return 0
                for dim in range(shape[1]): 
                    new_centroids[centroid, dim] = np.mean(temp[:,dim])

            if verbose == 1:
                plt.scatter(data[:,0], data[:,1], c = labels)
                plt.scatter(new_centroids[:,0], new_centroids[:,1], c = 'r')
                plt.show()

            if np.linalg.norm(new_centroids - centroids) < np.finfo(float).eps:
                print("DONE!")
                break

            centroids = new_centroids
        self.centroids = centroids
        self.labels = labels
        if verbose == 1:
            print(labels)
            print(centroids)
        return 1

    def getAverageDistance(self, data):

        dists = np.zeros((len(self.centroids),))
        for ix, centroid in enumerate(self.centroids):
            temp = data[self.labels == ix]
            dist = 0
            for i in temp:
                dist += np.linalg.norm(i - centroid)
            dists[ix] = dist/len(temp)
        return dists

    def getLabels(self):
        return self.labels

And the results

kmeans = Kmeans(5)
kmeans.train(data)

Ranges:
[[-15.42553872 14.88894099]
[-13.33192554 16.15415347]]
Centroids:
[[-11.39200726 -10.71208054]
[ 3.73634888 -8.9230959 ]
[ 6.17589734 -10.66376228]
[ 0.78973744 -0.44245535]
[ 9.29524253 9.59127574]]

Initialize, the red points are the random centroids

Iteration 1

and after a few iterations

Thank you for your very detailed answer(upvote). Now it is more clear that unsupervised knn is more about distance to neighbors of each data whereas k-means is more about distance to centroids (and hence clustering). However, my point is that through this distance to neighbors of the unsupervised knn you may come up with a clustering of the whole dataset in a way similar to kmeans. In the unsupervised knn, if you find the neighbors of the data then the ones which are more neighbors to ones than to others belong to the same cluster. — Universalis0, Jul 06 '18 at 14:40
But to be honest this is a very rough idea about it whereas there are or there may be many technical differences between them. — Universalis0, Jul 06 '18 at 14:43

Sklearn: unsupervised knn vs k-means

1 Answers1

Unsupervised k-NN

k-means