Various algorithms performance in a problem and what can be deduced about data and problem?

Question

HI I am currently trying to apply various algorithms to a classification problem to assess which could be better and then try to fine tune the bests of the first approach. I am a beginner so I use Weka for now. I have basic ML concept understanding but am not in the details of algorithms yet.

I observed that on my problem, RBF networks performed vastly worse than IBK and other K methods.

From what I read about RBF networks, "it implements a normalized Gaussian radial basisbasis function network. It uses the k-means clustering algorithm to provide the basis functions and learns either a logistic regression (discrete class problems) or linear regression (numeric class problems) on top of that. Symmetric multivariate Gaussians are fit to the data from each cluster. If the class is nominal it uses the given number of clusters per class.It standardizes all numeric attributes to zero mean and unit variance."

So basically, it also use k means to classify at first. But for some reason, I get the worst results with it using my metrics (ROC), while K methods are among the bests. Can I deduce from that fact something important about my data, like the fact that it has not a gaussian distribution, or is not fitted for logistic regression, or whatever I can't figure out?

I also observed that random forests get similar results to K methods, and that adding a filter to reduce dimensionality improved these RF, random projection being better than PCA?

Can this last point means that there is much randomness in my data so random dimension reduction is better than "ruled" dimension reduction like PCA? What can I deduce from the fact that RF perform equally to K methods?

I feel there is some signification here, but I am not skilled enough to understand what, and I would be very glad for any insights. Thanks by advance.

The first thing I want to "ask my data" is "is there a relationship in here". If my inputs are all equivalent to random numbers and have no relationship with the output, then I shouldn't be able to get a relationship no matter which "universal function approximator" I use. There are some relatively simple models that can show that there is relationship, even if they can't tell what it is. Once you know there is "information in that-thar data" you can look at dimensionality reduction, or other data-polishing and data-fracking steps. — EngrStudent, Apr 22 '17 at 01:21
Thanks, It makes sense. I think that sometimes, it is less easy than that because without DR you won't be able to perform your analysis, but I get your point. I still don't get why RBF could perform worse than K-means or what I could understand from RF being equal to K-means, but I guess that's something we can't really say. — Ando Jurai, Apr 26 '17 at 11:57
RBF has a lot more parameters. That means that the "wilderness of the lost" is much larger than for k-means. K-means has "which group does the point belong to" while RBF has not only membership, but a multivariate centroid and covariance matrix. Instead of being a 1-d wilderness, it can be much more highly dimensioned. You can think about it in terms of condition, if it helps. — EngrStudent, Apr 27 '17 at 16:16

Various algorithms performance in a problem and what can be deduced about data and problem?

0 Answers0