6

Are there any known academic sources that point towards supporting not removing outliers? Let say if the outlier is a natural occurrence or it has relationship to the value of target variable

Kusisi Karem
  • 161
  • 6
  • 3
    It's probably worth browsing the Statistics Q&A site for information on this topic https://stats.stackexchange.com/questions/tagged/outliers?tab=Votes – shadowtalker Nov 09 '22 at 02:57

2 Answers2

9

This older study Outlier detection and treatment in I/O psychology: A survey of researcher beliefs and an empirical illustration by Orr et al. surveyed a group of psychology researchers about how they treated outliers. They found that 67% percent of them would only exclude outliers if there was evidence the outliers were invalid. Most of the rest never excluded outliers. Only 4% stated they would always remove outliers.

In many applications such as sensor fault detection, fraud detection, and disaster risk warning systems it's the outliers or anomalies (assuming they are valid) that are of most interest, as they often indicate the unusual situation we are trying to detect.

Lynn
  • 1,121
  • 1
  • 3
  • 18
4

Personally, I have not seen any academic context. However, The model performance will decrease when the outlier increases. (opposite relationship) In conclusion:

  1. If the outlier does not affect the model dramatically, then keeping them is reasonable because they are natural.
  2. If keeping outliers affects the model performance, then there is no point in keeping them.

I think the best thing to do is to test different scenarios. I faced the same issue and removed 1%, 2%, and 3% of my outlier and checked how the model performed until I reached a balanced point between outlier and performance.

Niyaz
  • 193
  • 7
  • 2
    This is useful under the assumption that the test sample is correct. But if we suspect there are "outliers" in the training sample (which might affect the model, so we think about removing them), then we should also be open to there being "outliers" in the test sample - and whether we remove "test outliers" or not will have a huge impact on assessed model performance. This is a chicken-and-egg problem, and there is no simple solution that does not require domain knowledge. – Stephan Kolassa Nov 09 '22 at 08:18
  • Yes, you are correct. That is why completely deleting the outlier is a bad idea because it gives you perfect results during training and severely impacts testing samples. I think the best thing to do is to have a balance between outlier and performance and accept the reality that your dataset contains outliers :) – Niyaz Nov 09 '22 at 08:45
  • If you apply this blindly without considering why you have outliers or what they mean this is mathematically terrible advice. If you remove the data points that don't fit your model the model will look better on the remaining data points. This is a truism but in no way means that the model is actually better. The model should help you understand the data. Whether to exclude them should depend on whether they are likely mistakes in the data or not, not on whether they happen to fit in with your model assumptions. – quarague Nov 09 '22 at 17:57
  • @quarague, Thanks for your comment. First, No one said to remove outliers blindly; of course, suggesting something like that is terrible advice. Second, removing the outlier is usually the last step in model adjustment; if you are not happy with the current result, one of the ways is to check for the outlier. (After intensive analysis) Third, there are natural and caused-by-human outliers. You can't do much about natural outliers because they falsify reality. My point is how to eliminate outliers that cause by human error, such as data entry which highly improves the model performance. – Niyaz Nov 09 '22 at 18:35
  • 2
    @Niyaz That is sort of my point. Whether or not to remove an outlier should depend on what you think the underlying reason for the outlyer is. It has nothing to do with whether your model gets better without the outlyers or not. – quarague Nov 09 '22 at 18:38
  • @quarague I totally agree; there is no doubt about that :) – Niyaz Nov 09 '22 at 18:43