0

I am working on detecting anomalies within a large time series data set. It is updated on a regular basis and consists of more than 30 parameters. I am using R as a reference language.

It is a first for me working on this type of projects and I am unfamiliar with most of the techniques. I have 6 weeks to implement a good analytical toolbox to enhance the quality of the control checks on the production line.

I have found a couple of potential methods to analyze it including statistical machine learning, deep learning using auto-encoded neural networks or clustering approaches. The purpose of the chosen method is to detect the anomalies/outliers by itself. It doesn't really need to be real-time analysis. What approach would you recommend to implement for the scope of the project, given the structure of the data?

Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34
Zaynab
  • 3
  • 3
  • 1
    Try Dilini Talagala's packages: https://github.com/pridiltal/oddstream and https://github.com/pridiltal/stray – Rob Hyndman Jun 20 '18 at 19:24

1 Answers1

0

Following J.Tukey, you should plot, draw graphs, visualize, etc... until you have a solid pack of examples.

Then make Tukey' fences on each of the 30 parameters. Let $q_1$ and $q_3$ be the 1st and 3rd quartiles, $d=q_3-q_1$ the inter-quartile distance, and define as outlier any observation outside the interval $q_1-k\cdot d < x < q_3+k\cdot d$, where $k$ is a constant. Traditionally, $k=1.5$ indicates an outlier and $k=3$ indicates the data is far out. However, the real value of $k$ should be tested against your examples.

Then make a cluster analysis (with a k-nearest neighbor) and define as outlier any point isolated in one cluster. Again, use your example to test various values of $k$.

AlainD
  • 256
  • 1
  • 6