4

I've searched quite a bit and haven't landed on any useful results.

The problem statement is: Given a set of vectors, I wish to find its approximate k-nearest neighbors. The caveat here is that each of my dimensions resemble a different entity and hence we cannot use the same weight for each dimension while computing the distance. Thus, solutions like kd-tree don't work as is.

Is there any data-structure or any alternate algorithm that I can use to find such approximate weighted k-nearest neighbors.

Note: Multiplying the initial input data with their weights so as to get a uniform weight is not an option.

  • 2
    Assuming you do have a set of weights for each component, why not use a metric like $d(x,y)=\sqrt{\sum_{i=1}^nw_i(x_i-y_i)^2}$ to figure out the closest neighbors? – Alex R. Aug 13 '15 at 20:27
  • 3
    Have you considered scaling your data before applying K-nerarest neighbours ? – image_doctor Aug 14 '15 at 00:40
  • @AlexR. Will using a custom metric as you suggested still work for knn search using kd-trees? – sushant-hiray Aug 14 '15 at 06:57
  • 5
    If you **want** to weight one dimension higher than others then I suggest you standardize all of your data so that the mean is zero and the standard deviation is one. Then you can multiply the less important dimensions by a factor (2-10) so that they appear farther away to the KNN distance metric and leave the most important dimension un-scaled. Note that both standardizing and scaling are completely reversible processes, so there is very little reason not to use this simple solution. – AN6U5 Aug 14 '15 at 23:29
  • 1
    @AN6U5 Thanks. That certainly makes sense. However, my k-d tree is not constant. It needs to support both adding nodes (which is less frequent) and the k-neighbor search query(which is very frequent). In that case standardizing the data won't be a good option correct. – sushant-hiray Aug 17 '15 at 13:05
  • 1
    The definition of "nearest" becomes meaningless in multiple dimensions with un-standardized data. If Alice has 2 dogs and 10 apples and Bobby has 4 dogs and 5 apples, the distance between them without standardization is measured as some fractional power law of dog-apples, which changes as the distance vector changes orientation. Its absolute garbage! Once you standardize, the distance metric is measured in units of standard deviation of the population. I understand the you have some sort of online learning algo, but the math only makes sense if you can define a mean and std. – AN6U5 Aug 17 '15 at 15:35

2 Answers2

2

I strongly recommend using scaling as described above because it is faster than the manual method. If for some reason, scaling/preprocessing is unavailable, please use the metric parameter to pass a custom weighting function. See the example below.

import numpy as np

from sklearn.neighbors import KNeighborsClassifier as KNN

arr = np.random.randn(500, 10) # train X data
y = np.random.randint(2, size=(500,)) # train y data

# define custom weight function
weights = np.abs(np.random.randn(100)) # set up the desired weights
def weighted_distance(sample_x, sample_y):
    global weights
    return np.sqrt(sum((w * w * x * x * y * y) for w, x, y in zip(weights, sample_x, sample_y)))

knn = KNN(n_neighbors=3, metric=weighted_distance)
knn.fit(arr, y)
test = np.random.randn(5,10) # validation or test data
knn.predict(np.random.randn(5,10)) # predict
```
kate-melnykova
  • 548
  • 2
  • 11
1

As per @an6u5's comment:

If you want to weight one dimension higher than others then I suggest you standardize all of your data so that the mean is zero and the standard deviation is one. Then you can multiply the less important dimensions by a factor (2-10) so that they appear farther away to the KNN distance metric and leave the most important dimension un-scaled. Note that both standardizing and scaling are completely reversible processes, so there is very little reason not to use this simple solution

tomglynch
  • 111
  • 3