Improve k-means accuracy

Question

Our weapons:

I am experimenting with k-means and Hadoop, where I am chained to these options for various reasons (e.g. Help me win this war!).

The battlefield:

I have articles, which belong to c categories, where c is fixed. I am vectorizing the contents of the articles to TF-IDF features. Now I am running a naive k-means algorithm, which takes c centroids to begin with and starts, iteratively, grouping articles (i.e. rows of the TF-IDF matrix, where you can see here how I built it), until converenge occurs.

Special notes:

Initial centroids: Tried with random from within each category or with the mean of all the articles from each category.
Distance function: Euclidean.

Question(s):

The accuracy is poor, as expected, can I do any better, by making another choice for the initial centroids, or/and pick another distance function?

_{print "Hello Data Science site!" :)}

Hello @Dawny33 (Pikachu), Lapras here. :) Thank you! – gsamaras Feb 02 '16 at 01:59 — gsamaras, Feb 02 '16 at 01:59

score 4 · Accepted Answer · answered Feb 02 '16 at 17:25

Great question, @gsamaras! The way you've set up this experiment makes a lot of sense to me, from a design point of view, but I think there are a couple aspects you can still examine.

First, it's possible that uninformative features are distracting your classifier, leading to poorer results. In text analytics, we often talk about stop word filtering, which is just the process of removing such text (e.g., the, and, or, etc.). There are standard stop word lists you can easily find online (e.g., this one), but they can sometimes be heavy-handed. The best approach is to build a table relating feature frequency to class, as this will get at domain-specific features that you won't likely find in such look-up tables. There is varying evidence as to the efficacy of stop word removal in the literature, but I think these findings mostly have to do with classifier-specific (for example, support vector machines tend to be less affected by uninformative features than does a naive bayes classifier. I suspect k-means falls into the latter category).

Second, you might consider a different feature modeling approach, rather than tf-idf. Nothing against tf-idf--it works fine for many problems--but I like to start with binary feature modeling, unless I have experimental evidence showing a more complex approach leads to better results. That said, it's possible that k-means could respond strangely to the switch from a floating-point feature space to a binary one. It's certainly an easily-testable hypothesis!

Finally, you might look at the expected class distribution in your data set. Are all classes equally likely? If not, you may get better results from either a sampling approach, or using a different distance metric. k-means is known to respond poorly in skewed class situations, so this is something to consider as well! There is probably research available in your specific domain describing how others have handled this situation.

Hey Kyle, thanks! Stop words, yes! How didn't I thought of that? Vectorizing the data in *tf-idf* format was just a function call (see the linked question in my question), is there anything equally easy for binary representation that you are aware of? And finally, if the question is whether all categories have the same number of articles, the answer is no. The distance function seems to need research and my exams started, so I do not think I will go into this for now (first time in these fields for me) :). (oops I know saw that I was using stop words, I will try with a different set though)..! — gsamaras, Feb 02 '16 at 17:35
That's sounds reasonable! I'm not sure what library you're using, but I'd imagine that if you go to the documentation for the parent class of your tf-idf function, it likely has a variety of vectorization functions. If not, binary vectorization is fairly easy to write, and it's good practice! I'd keep the class imbalance in the back of your mind for whenever you have the time; my gut feeling is that is holding your performance back a bit. Good luck on your exams! — Kyle., Feb 02 '16 at 17:42
Thanks Kyle! For future readers, I am using this: http://stackoverflow.com/questions/35109424/how-to-make-tf-idf-matrix-dense — gsamaras, Feb 02 '16 at 18:22

Improve k-means accuracy

1 Answers1