I'm dealing with outlier detection in data streams. I'm looking for a way to summarize my data and obtain important statistics such as means and variance, etc. I want to know if the cluster features or microclusters are suitable or not.
2 Answers
No.
Because assignment to microclusters is distance-based, and distances do not work in high-dimensional data anymore. Most likely one mucrocluster will become most central by chance and collect all the samples.
- 7,969
- 1
- 14
- 30
-
Thank you for your answer. – I Sui Dec 27 '19 at 10:32
Traditional clustering algorithm which uses Euclidean based distance fails to yield good results in high dimensional data due to Curse of dimensionality
Because mean distance between data points diverges and looses its meaning which in turn leads to the divergence of the Euclidean distance, the most common distance used for clustering.
So if you are using any Euclidean based clustering algorithm i would highly suggest not to do that.
But if your clustering algorithm is not impacted by High demensionality problem like Hierarchical DB Scan you can do what you are suggesting
- 1,864
- 3
- 17