1

I was thinking about this lately. Let's say that we have a very complex space, which makes it hard to learn a classifier that can efficiently split it. But what if this very complex space is actually made up of a bunch of "simple" subspaces. By simple, I mean that it would be easier to learn a classifier for that subspace.

In this situation, would clustering my data first, in other words finding these subspaces, help me learn a better classifier? This classifier would essentially be an ensemble of each subspace's classifier.

To clarify, I don't want to use the clusters as additional features and feed it to a big classifier, I want to train on each cluster individually.

Is this something that's already been done/proven to work/proven to not work? Are there any papers on it? I've been trying to search for things like this but couldn't find anything relevant so I thought I'd ask here.

Valentin Calomme
  • 5,396
  • 3
  • 20
  • 49
  • 2
    The method you mentioned is already an existing technique. It is used to boost accuracies of various classifiers. I recall it was called a hybrid approach. Yes, there are papers on it. I will link it in the comments as soon as I locate it. – Rahul Aedula Oct 02 '17 at 08:55

1 Answers1

3

It is absolutely a way to improve your classifier's accuracy. Actually a "strong" enough classifier such as a neural network could be able to learn by itself these clusters. However, you would need a substancially deeper network.

The "smartest" way to do this, if you know there are many groups/clusters in your data is to actually perform a 2-steps process:

  • Cluster your data
  • Train X models, one for each of your clusters

A nice way to visualise this is the following problem, you want to build a recommendation engine for a Netflix-like application, you don't want to build one model per person, how would you do this ?

  • First find clusters of similar users (geeks, SF fans, teenagers, etc.)
  • Fit one model for each of these clusters
Jonathan DEKHTIAR
  • 590
  • 2
  • 5
  • 10
  • 1
    Thanks Jonathan for your answer. What I'd also be interested to know is if there are "supervised" clustering techniques. By that, I mean clustering techniques that will cluster the elements in certain groups "because" it improves the overall accuracy? – Valentin Calomme Oct 02 '17 at 10:57
  • 1
    It is a bit of antinomic ... If it is in a supervised approach... You already know what are the clusters in your dataset and how many of them you have. So it is just a two-steps learning process with a classification step to infer on the cluster then train a model for each cluster. You can have a look to GMM, I don't really understand the goal of your question. – Jonathan DEKHTIAR Oct 02 '17 at 11:03
  • 1
    To make my example a bit clearer, let's say that I have access to a black box classifier, which I can apply on any dataset, but I cannot modify it, I have to use it as is. I guess that my question would be, how do I cluster my data such that my performance improves? Because perhaps training the classifier on all data will result in a worse performance than if I trained it separately on two subsets of the data. It's a model selection problem I guess, here my hyperparameter is my clustering algorithm as a whole. – Valentin Calomme Oct 02 '17 at 11:17
  • 1
    Just Design as many solutions as you could think of and compare them with cross validation. If you wish any more details please ask a new question. The topic is different – Jonathan DEKHTIAR Oct 02 '17 at 11:27