Questions tagged [outlier]

For questions regarding outliers or unusual points in the data.

An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset. A discomfiting possibility is that these data come from a different population than the one intended to be studied.

However, outliers are not necessarily bad or wrong, nor do they need to be removed from data for further analysis. However, outliers (of which there can be more than one in any set of data) indicate that some data at least appear to differ from the bulk of the dataset, suggesting they should be individually examined and understood. Also, some statistical procedures are sensitive to outliers: this means that removal of one or more outliers could substantially change the conclusions of those procedures.

216 questions
13
votes
3 answers

How to remove outliers using box-plot?

I have data of a metric grouped date wise. I have plotted the data, now, how do I remove the values outside the range of the boxplot (outliers)? All the ['AVG'] data is in a single column, I need it for time series modelling.
Uday T
  • 322
  • 1
  • 5
  • 11
13
votes
4 answers

What is the difference between outlier detection and anomaly detection?

I would like to know the difference in terms of applications (e.g. which one is credit card fraud detection?) and in terms of used techniques. Example papers which define the task would be welcome.
Martin Thoma
  • 18,630
  • 31
  • 92
  • 167
11
votes
2 answers

Tools for automatic anomaly detection on a SQL table?

I have a large SQL table that is essentially a log. The data is pretty complex and I'm trying to find some way to identify anomalies without me understanding all the data. I've found lots of tools for Anomaly Detection but most of them require a…
THE JOATMON
  • 211
  • 2
  • 4
10
votes
4 answers

Gas consumption outliers detection - Neural network project. Bad results

I tried to detect outliers in the energy gas consumption of some dutch buildings, building a neural network model. I have very bad results, but I can't find the reason. I am not an expert so I would like to ask you what I can improve and what I'm…
marcodena
  • 1,667
  • 4
  • 14
  • 17
10
votes
2 answers

Scalable Outlier/Anomaly Detection

I am trying to setup a big data infrastructure using Hadoop, Hive, Elastic Search (amongst others), and I would like to run some algorithms over certain datasets. I would like the algorithms themselves to be scalable, so this excludes using tools…
doublebyte
  • 420
  • 3
  • 9
10
votes
1 answer

Difference: Replicator Neural Network vs. Autoencoder

I'm currently studying papers about outlier detection using RNN's (Replicator Neural Networks) and wonder what is the particular difference to Autoencoders? RNN's seem to be treaded for many as the holy grail of outlier/anomaly detection, however…
Nex
  • 285
  • 2
  • 6
9
votes
2 answers

In elbow curve how to find the point from where the curve starts to rise?

I am computing a distance metric on my data. The result is then being sorted in ascending order. The samples having distance more than a specific threshold are to be marked as outliers and will be discarded. Below is a plot of all distance…
Faiz Kidwai
  • 235
  • 2
  • 11
8
votes
3 answers

Which algorithms or methods can be used to detect an outlier from this data set?

Suppose I have a data set : Amount of money (100, 50, 150, 200, 35, 60 ,50, 20, 500). I have Googled the web looking for techniques that can be used to find a possible outlier in this data set but I ended up confused. My question is: Which…
CN1002
  • 243
  • 2
  • 7
8
votes
3 answers

Isolation forest sklearn contamination param

I am working on an unsupervised anomaly detection task on time series data using an isolation forest algorithm. I am developing it in Python, more in detail using scikit-learn. I found a lot of examples on this, but what is not very clear, is how to…
7
votes
1 answer

How to decide how many n_neighbors to consider while implementing LocalOutlierFactor?

I have a data set with rows: 134000 and columns: 200. I am trying to identify the outliers in data set using LocalOutlierFactor from scikit-learn. Although I understand how the algorithm works, I am unable to decide n_neighbors for my data…
7
votes
3 answers

Which outlier detection can detect these outliers?

I have a vector and want to detect outliers in it. The following figure shows the distribution of the vector. Red points are outliers. Blue points are normal points. Yellow points are also normal. I need an outlier detection method (a…
6
votes
2 answers

How to scale outputs from AutoEncoder from multiple models?

I have a problem for which I have not been able to find any answers in my search so far. BACKGROUND I am working on an anomaly detection problem on machines utilising an auto-encoder. I am building a model file per machine because the machines'…
6
votes
4 answers

Handling outliers and Null values in Decision tree

Outliers : As I understand, decision trees are robust to outliers. Can anybody please confirm if my hypothesis is right with an example? (What if I have a features ranging from 0 to 9 but there is an outlier of which value is 10000?) Whether it…
deepguy
  • 1,441
  • 7
  • 18
  • 38
6
votes
2 answers

Anomaly detection in time series

The use case : Everyday, we have metrics that are established daily to check that our systems are doing fine. From time to times, bugs occur in the workflow building these metrics, and I have to build an algorithm that will alert us when it seems…
Den
  • 61
  • 1
6
votes
2 answers

Remove or not to remove outliers

Are there any known academic sources that point towards supporting not removing outliers? Let say if the outlier is a natural occurrence or it has relationship to the value of target variable
Kusisi Karem
  • 161
  • 6
1
2 3
14 15