7

I have a data set with rows: 134000 and columns: 200. I am trying to identify the outliers in data set using LocalOutlierFactor from scikit-learn. Although I understand how the algorithm works, I am unable to decide n_neighbors for my data set.

Kindly suggest.

iacob
  • 309
  • 1
  • 16
Neha Bhushan
  • 121
  • 1
  • 3
  • 1
    Use grid search to find the optimal number of neighbors – Ethan Mar 10 '19 at 22:18
  • This paper may be of interest: [Automatic Hyperparameter Tuning Method for Local Outlier Factor, with Applications to Anomaly Detection](https://arxiv.org/pdf/1902.00567.pdf) (Feb 5, 2019) – iacob Mar 11 '19 at 22:36
  • This answer might be helpful https://stats.stackexchange.com/questions/138675/choosing-a-k-value-for-local-outlier-factor-lof-detection-analysis Describes how to choose the n_neighbors based on the linked paper. – Tasos Aug 10 '19 at 10:26

1 Answers1

2

One normally uses Grid Search for calculating the optimum parameters in these situations:

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

n = 30 # Max number of neighbours you want to consider
param_grid = {'n_neighbors': np.arange(n)}
grid = GridSearchCV(KNeighborsClassifier(), param_grid)

Then given this grid, you can fit it to your data to compute its optimum values (from those you provided, they may not be global optima (or even local if the returned value is one of the extrema of your input range)):

grid.fit(X_train, y_train)

You can view the optimum parameters from your input by calling:

grid.best_params_
>>> {'n_neighbors': ?}

You can automatically select an estimator with said optimum parameters by calling:

model = grid.best_estimator_
y_pred = model.fit(X_train, y_train).predict(X_test)

Note: you can find the optimum values of other parameters by adding them to the input dictionary param_grid.

E. Zeytinci
  • 147
  • 6
iacob
  • 309
  • 1
  • 16
  • I think the question asks about how many neighbours to choose in the LocalOutlierFactor data pre-processor, not in applying the KNearestNeighbors Classifier. It is a more difficult problem in that Outlier Detection is in general an unsupervised task. – Attack68 Mar 11 '19 at 20:37
  • Thank you for your response, I finally tried this on my data set. Although consuming entire training data set for grid search was taking too much time that why I used n = 300 samples for grid search. I get the best_estimator as 2. What does best_estimator indicate? I am clueless. I looked up internet for grid search and its for hyper tuning. – Neha Bhushan Mar 15 '19 at 03:39
  • @NehaBhushan `best_estimator_` returns the estimator you input automatically loaded with the optimum parameters. You can use it the same way you would use the estimator expicitly, e.g. `best_estimator_` is the same as `KNeighborsClassifier(n_neighbors=best_params_['n_neighbors'])` – iacob Mar 15 '19 at 13:07