2

I'd like to perform clustering analysis on a dataset with 1,300 columns and 500,000 rows.

I've seen that clustering algorithms are available in SciKit-Learn. But I'm worried that the algorithms will be inefficient on a dataset of this size.

Is SciKit-Learn slow, and, if it is, what's the best (fastest) clustering package available in Python?

Connor
  • 597
  • 1
  • 15

3 Answers3

2

Depending on your platform, processor, memory, etc, you may want to check out https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html Some of the clustering algorithms are highly optimized.

brewmaster321
  • 655
  • 1
  • 9
  • Excellent! Thank you, what do you think of the standard SciKit-Learn library? Is it good enough without the optimisations? – Connor Mar 09 '23 at 09:20
  • 1
    Scikit learn has been around, is robust, and widely used. However, many clustering algorithms by their very nature are not blazingly fast as they may require measuring distances between every instance and every cluster centroid on every pass. There is also FAISS (https://towardsdatascience.com/how-to-speed-up-your-k-means-clustering-by-up-to-10x-over-scikit-learn-5aec980ebb72) and scikit-learn optimizations - https://scikit-learn.fondation-inria.fr/implementing-a-faster-kmeans-in-scikit-learn-0-23/ though I haven't used either of these directly. – brewmaster321 Mar 09 '23 at 12:26
2

Should be fairly easy to asses the computational requirements - just try it out without worrying about the accuracy of the model.

I don't know if the packages comes with the specific clustering algorithm you are searching for, but you can implement a very fast clustering method by accelerating Python with the GPU while making sure you are setting up the code to be parallelizable. This can be done with packages such as Numba, CuPy, PyTorch or PyCUDA.

nammerkage
  • 123
  • 4
1

I would go with HDBSCAN, a hierarchical version of the DBSCAN algo. It is not necessarily easy to install so might want to go with the sklearn DBSCAN implementation.

Lucas Morin
  • 2,513
  • 5
  • 19
  • 39
  • How fast is sklearn's DBSCAN? I've heard of hierarchical clustering, how does it compare to spectral clustering? – Connor Mar 18 '23 at 12:51