What's the difference between data classification and clustering (from a Data point of view)

Question

What are the differences and the similarities between data classification (using dedicated distance-based methods) and data clustering (which has certain defined methods such as k-means)

Is data classification a sub-topic of data clustering ?

score 5 · Accepted Answer · answered Dec 27 '20 at 09:59

5

Classification is a problem where your input data consists of elements with 2 parts:

Some data features that reflect the traits of an entity
A label that assigns the entity to a group or class.

With that kind of data, you can train a model that receives the data features (first part) and generates the label (second part). This kind of training, where you train a system to generate some output when it receives a specific input is called "supervised learning".

On the other hand, in Clustering, your dataset only has the data features, that is, your dataset does not have the labels. Clustering methods allow you to group the entities in classes without having any labels, normally by defining a priori how many groups you want, and then grouping the entities by their similarity. This kind of training, where there are no labels and you have to learn just from the entity data features is called "unsupervised learning"

answered Dec 27 '20 at 09:59

noe

22,074
1
43
70

Thanks for the reply, but then clustering has the same objectives as classification right ? except classification is more precise about the data. Also clustering in this case could come first right before classifying the data right ? – Sam Dec 27 '20 at 10:22
1

Yes, both try to assign data to groups. The difference is that for classification you can actually know how well your classifier works (because you have data to train and test against), and for clustering not. – noe Dec 27 '20 at 10:33
Is data classification a sub-topic of data clustering? This question seems interesting to me. Imagine a case in which you have a binary classification task, if you apply any clustering algorithm to this data (X features only) with the number of clusters = 2, you essentially are trying to separate both classes but without taking into account the y label, so in practical terms for this specific case, classification is indeed a sub-topic of clustering – Multivac Dec 28 '20 at 15:56
Academically classification refers to data that has labels present. – Prometheus Dec 30 '20 at 16:07

Erwan · Answer 2 · 2020-12-27T11:15:06.040

[Note: essentially my answer is the same as @ncasas, just an alternative phrasing]

Classification belongs to supervised learning whereas clustering belongs to unsupervised learning:

In supervised learning there is a training stage during which some instances (examples) are provided together with their answer (the target). During training the model "studies" all the examples in the training data (represented with features) in order to be able to find the target from the features. After it has been trained, the model can be applied to new instances and use their features to predict their target. In short the main characteristics of supervised learning are:
- The goal is to predict a specific piece of information defined from the start (the target).
- It requires some training data: features and answers for a large set of instances.
In unsupervised learning the goal is to discover the patterns within the data. There is no predefined target and no training stage (thus no need for annotated data). Unsupervised learning can only do general tasks based on comparing instances, such as clustering (grouping similar instances together) or ranking (ordering instances relatively to each other).

This is the fundamental difference between classification and clustering. Based on this understanding:

What's the difference between data classification and clustering (from a Data point of view)

From a strict data point of view, the difference is the requirement for annotated data in classication. There is no such requirement for clustering.

Is data classification a sub topic of data clustering ?

No because they belong to different families of ML which have different goals.

Example:

In spam classification (supervised task) a model is trained with some documents (usually emails) labelled as spam or not spam. The resulting model can predict whether a new document is spam or not.
In topic modelling (unsupervised task) a model groups semantically similar documents together, based on the words they contain.

The first task separates documents into classes, but these classes are predefined: here spam vs. non-spam. The model uses features specifically as indicators for this goal. It would use features in a completely different way if the classes were news vs. entertainment, business vs. personal, or sci-fi vs. romance. Hence the term supervised learning: the model focuses on what it is told (trained) to focus on.

Topic modelling separates documents into several clusters, but even if we assume exactly two clusters these are extremely unlikely to correspond to spam vs. non-spam (or news vs. entertainment, etc.). A clustering algorithm follows a neutral similarity method which uses the features indiscriminately. The main outcome are the clusters themselves, which represent unknown patterns in the data. For example applying topic modelling in a large collection of documents may lead to discover what are the main categories of documents: the new knowledge is the existence of these groups. Clustering is unsupervised because it doesn't follow a predetermined goal.

I am not talking about learning and prediction (nor even machine learning in my question) but rather the process defining each of classification and clustering data — Sam, Dec 27 '20 at 10:27
@Sam I added some details in my answer. It's necessary to understand the difference between supervised and unsupervised learning in order to understand why classification and clustering are fundamentally different. — Erwan, Dec 27 '20 at 11:17

score 0 · Answer 3 · answered Dec 29 '20 at 13:54

Just to put together the good answers and comments and trying to answer more explicitly the part of the question about classification being a subtopic of clustering.

As pointed out by @ncasas from the view point of data, classification requires labeled data for training of the model (supervised learning) while clustering can make use of unlabeled data (unsupervised learning).

You can actually take a labeled dataset and use a clustering algorithm (you'd just discard the information contained in the labels). This would indeed produce a partition of samples in groups as a classification algorithm would do. However the result is not guaranteed to be the same nor similar (even if using the same number of partitions). This is because clustering algs try to build groups of samples that are similar to themselves and different to samples of other groups, while classification algs try to minimize some function of misclassification (how different the proposed partition is compared to that of the labels). As a simple example you can imagine a dataset of several face images from two individuals with different facial expressions (e.g. sad and happy); let's say you have labels for facial expression, you can do a classification and this will try to reproduce as good as possible the sad/happy label for each image; if you try to do clustering on the same data (without labels) using k=2 clusters you might found the two clusters to correspond to images of the two individuals (since images of the same face tend to be very similar).

Without entering into a debate of what constitute a "subtopic" I wanted to remark that clustering and classification are actually different in their objectives.

What's the difference between data classification and clustering (from a Data point of view)

3 Answers3