0

I have a large dataset with a feature y which is dependent in part on features x1 and x2. All features are noisy, and y is also dependent on other parameters not captured in the dataset. I would like to detect when y is taking an abnormal value given x1 and x2. I am implementing my analysis in Python, mostly using sklearn. Is there an easy way to use unsupervised anomaly detection (isolation forest, for example) to do this, or is there another relatively simple obvious approach?

There is a similar previous question here: How to determine the abnormality of a specific variable by taking into account all the other variables in the data? but without a clear answer, and I'm not certain from the wording that the author was asking about exactly the same situation.

Thanks in advance!

1 Answers1

0

Multivariate outlier detection could be done thanks to several algorithms:

Isolation Forest (as mentioned in the link you've shared), DBSCAN, Z-Score, and PCA.

The most universal one is PCA, as you can detect and understand correlations between features.

Nevertheless, UMAP being non-linear could detect outliers in a more accurate way.

I recommend starting with PCA to understand features' dependencies and then using UMAP for better results.

Nicolas Martin
  • 4,509
  • 1
  • 6
  • 15
  • Thanks for the response. To clarify, I understand how these and various other algorithms can be used to detect outliers, but what I'm not sure about is whether these can be used to detect whether a specified variable is taking an abnormal value given the values of other variables on which it is dependent. Thanks again, my apologies if I'm being obtuse. – user18236139 Dec 08 '22 at 16:51
  • No problem at all. Those algorithms are also useful to detect how features the variables are. I understand you want a simple approach with a feature dependent on 2 other ones, but dependencies (or correlations) used to be multi-directed. Consequently, even if those algorithms don't specifically extract dependency from 2 features, they actually do because they are more general. It is also recommended because you might see new correlations that weren't visible before. – Nicolas Martin Dec 08 '22 at 18:41
  • Does it answer your question? If not, please let me know to provide additional information. – Nicolas Martin Dec 13 '22 at 13:13
  • Partially, I think - I'm still not entirely clear on how to use these algorithms in the specified application, but I'm thinking about it. – user18236139 Dec 15 '22 at 22:43
  • I think the main issue here is the "dependency". It seems that you have dependencies in one direction. In general, dependencies are multi-directional (at least bi-directional), and if you predict one feature, you can predict others and vice-versa. – Nicolas Martin Dec 16 '22 at 08:32