0

I have an upcoming publication, where I present an algorithm for clustering tabular datasets with missing values. I want to do quantitative evaluations of my algorithm and qualitative evaluation. In this question I am asking for ideas for attractive qualitative experiments that "wow, the audience".

  1. Quantitative evaluation: An obvious experiment is to take a complete dataset, remove some values and compare the clustering results of the complete dataset with the incomplete dataset. With this nice quantitative evaluation can be done, that is why I will do it. However this is not very attractive.
  2. Qualitative evaluation: Ignoring the missing values problem, an attractive use-case for clustering is image segmentation like this and that. However, there would be little meaning values in those dataset - I mean, in practice all values are available. There might be noise (regarding the color values), but not missing data. Setting noise to NA, would first require a noise detector, but that is not part of my publication. I have trouble finding another use-case that one can "see" and is not the setup I mentioned in 1.
Make42
  • 752
  • 2
  • 7
  • 18
  • What about clustering time series data and showing the inferred data makes sense w.r.t. the intuitive expected value based on the trend? – Sean Owen Feb 02 '20 at 22:50
  • @SeanOwen: Let's see if I understand: I take sequences of real numbers which have all the same length. I consider them to be in real coordinate space such that the n-th value of each sequence corresponds with the n-th coordinate of the space. I cluster the sequences in this -- very high dimensional -- data space. I forecast the time series by... extrapolation of the sequences? How do I do this with the clustering? Do you mean that I train a model for each cluster and show that the results are better than if I had trained on the entire dataset? – Make42 Feb 03 '20 at 11:07
  • Ah... that would not make much sense regarding the missing values... Do you mean perhaps that I infer some values between the time series values using my method instead of using classical interpolation? Maybe you can elaborate in more detail -- I am intrigued :-). – Make42 Feb 03 '20 at 11:10
  • Yes. The idea is that you and readers intuitively understand there's some time-oriented generating process; they're not arbitrary (e.g. daily temperatures) and can judge what is qualitatively the right type of answer to fill in the blanks with. A clustering process doesn't know that, so would be impressive if it did a decent job of filling in the blanks in a non-trivial time series. – Sean Owen Feb 03 '20 at 21:50
  • @SeanOwen: So, if I have heterogeneous time series I cannot run a regression on the entire set of time series, because the mean-curve would be too smooth. So I first cluster the time series and do a regression per cluster. I like the idea very much, but, couldn't I just use non-linear regression (e.g. with [Gaussian Processes](https://en.wikipedia.org/wiki/Gaussian_process) or [Kernel Regression](https://en.wikipedia.org/wiki/Kernel_regression)) per incomplete time series and infer the values this way? What would be the advantage of using a bunch of time series instead of this single one? – Make42 Feb 04 '20 at 14:56
  • I put this into a question: https://datascience.stackexchange.com/questions/67533/non-parametric-regression-on-set-of-time-series-one-model-for-each-or-one-for-a – Make42 Feb 04 '20 at 15:19

0 Answers0