Attractive clustering experiment for missing value datasets

Question

I have an upcoming publication, where I present an algorithm for clustering tabular datasets with missing values. I want to do quantitative evaluations of my algorithm and qualitative evaluation. In this question I am asking for ideas for attractive qualitative experiments that "wow, the audience".

Quantitative evaluation: An obvious experiment is to take a complete dataset, remove some values and compare the clustering results of the complete dataset with the incomplete dataset. With this nice quantitative evaluation can be done, that is why I will do it. However this is not very attractive.
Qualitative evaluation: Ignoring the missing values problem, an attractive use-case for clustering is image segmentation like this and that. However, there would be little meaning values in those dataset - I mean, in practice all values are available. There might be noise (regarding the color values), but not missing data. Setting noise to NA, would first require a noise detector, but that is not part of my publication. I have trouble finding another use-case that one can "see" and is not the setup I mentioned in 1.

What about clustering time series data and showing the inferred data makes sense w.r.t. the intuitive expected value based on the trend? — Sean Owen, Feb 02 '20 at 22:50
@SeanOwen: Let's see if I understand: I take sequences of real numbers which have all the same length. I consider them to be in real coordinate space such that the n-th value of each sequence corresponds with the n-th coordinate of the space. I cluster the sequences in this -- very high dimensional -- data space. I forecast the time series by... extrapolation of the sequences? How do I do this with the clustering? Do you mean that I train a model for each cluster and show that the results are better than if I had trained on the entire dataset? — Make42, Feb 03 '20 at 11:07
Ah... that would not make much sense regarding the missing values... Do you mean perhaps that I infer some values between the time series values using my method instead of using classical interpolation? Maybe you can elaborate in more detail -- I am intrigued :-). — Make42, Feb 03 '20 at 11:10
Yes. The idea is that you and readers intuitively understand there's some time-oriented generating process; they're not arbitrary (e.g. daily temperatures) and can judge what is qualitatively the right type of answer to fill in the blanks with. A clustering process doesn't know that, so would be impressive if it did a decent job of filling in the blanks in a non-trivial time series. — Sean Owen, Feb 03 '20 at 21:50
@SeanOwen: So, if I have heterogeneous time series I cannot run a regression on the entire set of time series, because the mean-curve would be too smooth. So I first cluster the time series and do a regression per cluster. I like the idea very much, but, couldn't I just use non-linear regression (e.g. with [Gaussian Processes](https://en.wikipedia.org/wiki/Gaussian_process) or [Kernel Regression](https://en.wikipedia.org/wiki/Kernel_regression)) per incomplete time series and infer the values this way? What would be the advantage of using a bunch of time series instead of this single one? — Make42, Feb 04 '20 at 14:56
I put this into a question: https://datascience.stackexchange.com/questions/67533/non-parametric-regression-on-set-of-time-series-one-model-for-each-or-one-for-a — Make42, Feb 04 '20 at 15:19

Attractive clustering experiment for missing value datasets

0 Answers0