Is sampling a valid way to reduce complexity?

Question

I'm facing an issue where I have a massive amount of data that I need to cluster. As we know, clustering algorithms can have a very high O complexity, and I'm looking for ways to reduce the time my algorithm is running.

I want to try a few different approaches, like pre-clustering (canopy clustering) or subspace clustering, correlation clustering etc.

However, something that I haven't heard about, and I wonder why - Is it viable to simply get a representative sample from my dataset, run the clustering on that, and generalize this model to the whole dataset? Why/why not is this a viable approach? Thank you!

Have you considered mini-batch K-means? this version of the algo does not use all the data, but operate on a batch level. Here is a link for some docs https://scikit-learn.org/stable/modules/clustering.html#mini-batch-kmeans — Yaroslaw Homenko, Nov 09 '20 at 10:33
@YaroslawHomenko Yes, I am considering several methods including this one, was just wondering if sampling would work at all. — lte__, Nov 09 '20 at 13:10
mini-batch K-means is basically doing the random sampling, so your question is valid. — Yaroslaw Homenko, Nov 10 '20 at 16:20

score 4 · Accepted Answer · answered Nov 09 '20 at 22:44

I would get a sufficiently large random/representative sample and cluster that.

To see what is such a sample, you will have to get two such samples and cluster them to get cluster solutions c1 and c2. If the matching clusters of c1 and c2 have the same model parameters, then you probably have representative samples.

You can match the clusters by looking at how c1 and c2 assign drawn data to clusters.

score 2 · Answer 2 · edited Nov 10 '20 at 04:05

2

It's definitely viable, just that there is catch 22.

In order to get this representative sample from your dataset, you have to sample from every cluster. But if you already can sample from every cluster, you already know them, hence you don't need unsupervised learning.

edited Nov 10 '20 at 04:05

Shayan Shafiq

1,012
4
11
24

answered Nov 09 '20 at 08:46

Noah Weber

5,609
1
11
26

Isn't a random uniform sample going to make sure the sample has the same properties as the original set, so it will roughly have the same clusters? – lte__ Nov 09 '20 at 13:09
depends on the number of samples pro cluster. – Noah Weber Nov 09 '20 at 13:49
@NoahWeber could you detail why it depends of the number of samples pro cluster please ? I see a potential issue using random uniform sampling with very small clusters vs ratio of sampling, but if clusters are relatively large, random uniform sampling should keep the same properties as the original set, shouldn't it ? – etiennedm Nov 10 '20 at 07:59
that is correct. – Noah Weber Nov 10 '20 at 08:06

Is sampling a valid way to reduce complexity?

2 Answers2

Linked