sklearn & Meanshift for NLP only returns 1 cluster

Question

I am using sklearn.clustering to work with some text data and the MeanShift algorithm. I have:

Done all standard NLP data prep like lemmatizing, removing stop words, etc.
Used the TfidfVectorizer to create my word vectors on 80k-plus records
The vectorizer gives me a sparse array so I converted it using a standard .toarray() command
I made a call to sklearn Meanshift and then accepted all of the default parameters. The call looks like meanshift = MeanShift().fit(fitted_vector_data.toarray()) and results in the following output when I call the model: MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, min_bin_freq=1, n_jobs=1, seeds=None)

The problem is that no matter what data I pass in (whether it's 10 records or 10k records, it always just gives me 1 cluster when I should be getting hundreds of clusters.

This is my first time using MeanShift, so I'm guessing there is a problem with how I'm setting up my data and/or parameters? I should also point out, I have used other models like k-means and affinity propogation - with the same data prep - and those models gave multiple clusters.

Choose other parameters. The defaults are not appropriate for such data (but I'd argue that it isn't an appropriate algorithm anyway). — Has QUIT--Anony-Mousse, Apr 20 '19 at 07:12
K-means *always* returns k clusters, because they is hardwired... No surprise there, you get what you ask for, whether it is in the data, or not. You also get k clusters in uniform data that *shouldn't* have any clusters... — Has QUIT--Anony-Mousse, Apr 20 '19 at 07:13

sklearn & Meanshift for NLP only returns 1 cluster

0 Answers0