What's the fastest clustering package in Python?

Question

I'd like to perform clustering analysis on a dataset with 1,300 columns and 500,000 rows.

I've seen that clustering algorithms are available in SciKit-Learn. But I'm worried that the algorithms will be inefficient on a dataset of this size.

Is SciKit-Learn slow, and, if it is, what's the best (fastest) clustering package available in Python?

score 2 · Accepted Answer · answered Mar 09 '23 at 09:11

2

Depending on your platform, processor, memory, etc, you may want to check out https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html Some of the clustering algorithms are highly optimized.

answered Mar 09 '23 at 09:11

brewmaster321

655
1
9

Excellent! Thank you, what do you think of the standard SciKit-Learn library? Is it good enough without the optimisations? – Connor Mar 09 '23 at 09:20
1

Scikit learn has been around, is robust, and widely used. However, many clustering algorithms by their very nature are not blazingly fast as they may require measuring distances between every instance and every cluster centroid on every pass. There is also FAISS (https://towardsdatascience.com/how-to-speed-up-your-k-means-clustering-by-up-to-10x-over-scikit-learn-5aec980ebb72) and scikit-learn optimizations - https://scikit-learn.fondation-inria.fr/implementing-a-faster-kmeans-in-scikit-learn-0-23/ though I haven't used either of these directly. – brewmaster321 Mar 09 '23 at 12:26

nammerkage · Answer 2 · 2023-03-20T13:27:46.467

Should be fairly easy to asses the computational requirements - just try it out without worrying about the accuracy of the model.

I don't know if the packages comes with the specific clustering algorithm you are searching for, but you can implement a very fast clustering method by accelerating Python with the GPU while making sure you are setting up the code to be parallelizable. This can be done with packages such as Numba, CuPy, PyTorch or PyCUDA.

score 1 · Answer 3 · answered Mar 18 '23 at 12:28

1

I would go with HDBSCAN, a hierarchical version of the DBSCAN algo. It is not necessarily easy to install so might want to go with the sklearn DBSCAN implementation.

answered Mar 18 '23 at 12:28

Lucas Morin

2,513
5
19
39

How fast is sklearn's DBSCAN? I've heard of hierarchical clustering, how does it compare to spectral clustering? – Connor Mar 18 '23 at 12:51

What's the fastest clustering package in Python?

3 Answers3