Are cluster feature and micro-cluster good summary statistics for outlier detection in high dimensional data streams?

Question

I'm dealing with outlier detection in data streams. I'm looking for a way to summarize my data and obtain important statistics such as means and variance, etc. I want to know if the cluster features or microclusters are suitable or not.

score 2 · Answer 1 · answered Dec 26 '19 at 08:40

2

No.

Because assignment to microclusters is distance-based, and distances do not work in high-dimensional data anymore. Most likely one mucrocluster will become most central by chance and collect all the samples.

answered Dec 26 '19 at 08:40

Has QUIT--Anony-Mousse

7,969
1
14
30

Thank you for your answer. – I Sui Dec 27 '19 at 10:32

score 1 · Answer 2 · answered Jan 19 '22 at 05:16

Traditional clustering algorithm which uses Euclidean based distance fails to yield good results in high dimensional data due to Curse of dimensionality

Because mean distance between data points diverges and looses its meaning which in turn leads to the divergence of the Euclidean distance, the most common distance used for clustering.

So if you are using any Euclidean based clustering algorithm i would highly suggest not to do that.

But if your clustering algorithm is not impacted by High demensionality problem like Hierarchical DB Scan you can do what you are suggesting

Are cluster feature and micro-cluster good summary statistics for outlier detection in high dimensional data streams?

2 Answers2