0

I am using sklearn.clustering to work with some text data and the MeanShift algorithm. I have:

  1. Done all standard NLP data prep like lemmatizing, removing stop words, etc.
  2. Used the TfidfVectorizer to create my word vectors on 80k-plus records
  3. The vectorizer gives me a sparse array so I converted it using a standard .toarray() command
  4. I made a call to sklearn Meanshift and then accepted all of the default parameters. The call looks like meanshift = MeanShift().fit(fitted_vector_data.toarray()) and results in the following output when I call the model: MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, min_bin_freq=1, n_jobs=1, seeds=None)

The problem is that no matter what data I pass in (whether it's 10 records or 10k records, it always just gives me 1 cluster when I should be getting hundreds of clusters.

This is my first time using MeanShift, so I'm guessing there is a problem with how I'm setting up my data and/or parameters? I should also point out, I have used other models like k-means and affinity propogation - with the same data prep - and those models gave multiple clusters.

Ethan
  • 1,625
  • 8
  • 23
  • 39
I_Play_With_Data
  • 2,079
  • 2
  • 16
  • 39
  • Choose other parameters. The defaults are not appropriate for such data (but I'd argue that it isn't an appropriate algorithm anyway). – Has QUIT--Anony-Mousse Apr 20 '19 at 07:12
  • K-means *always* returns k clusters, because they is hardwired... No surprise there, you get what you ask for, whether it is in the data, or not. You also get k clusters in uniform data that *shouldn't* have any clusters... – Has QUIT--Anony-Mousse Apr 20 '19 at 07:13

0 Answers0