Weighted k nearest neighbor search

Question

I've searched quite a bit and haven't landed on any useful results.

The problem statement is: Given a set of vectors, I wish to find its approximate k-nearest neighbors. The caveat here is that each of my dimensions resemble a different entity and hence we cannot use the same weight for each dimension while computing the distance. Thus, solutions like kd-tree don't work as is.

Is there any data-structure or any alternate algorithm that I can use to find such approximate weighted k-nearest neighbors.

Note: Multiplying the initial input data with their weights so as to get a uniform weight is not an option.

Assuming you do have a set of weights for each component, why not use a metric like $d(x,y)=\sqrt{\sum_{i=1}^nw_i(x_i-y_i)^2}$ to figure out the closest neighbors? — Alex R., Aug 13 '15 at 20:27
Have you considered scaling your data before applying K-nerarest neighbours ? — image_doctor, Aug 14 '15 at 00:40
@AlexR. Will using a custom metric as you suggested still work for knn search using kd-trees? — sushant-hiray, Aug 14 '15 at 06:57
If you **want** to weight one dimension higher than others then I suggest you standardize all of your data so that the mean is zero and the standard deviation is one. Then you can multiply the less important dimensions by a factor (2-10) so that they appear farther away to the KNN distance metric and leave the most important dimension un-scaled. Note that both standardizing and scaling are completely reversible processes, so there is very little reason not to use this simple solution. — AN6U5, Aug 14 '15 at 23:29
@AN6U5 Thanks. That certainly makes sense. However, my k-d tree is not constant. It needs to support both adding nodes (which is less frequent) and the k-neighbor search query(which is very frequent). In that case standardizing the data won't be a good option correct. — sushant-hiray, Aug 17 '15 at 13:05
The definition of "nearest" becomes meaningless in multiple dimensions with un-standardized data. If Alice has 2 dogs and 10 apples and Bobby has 4 dogs and 5 apples, the distance between them without standardization is measured as some fractional power law of dog-apples, which changes as the distance vector changes orientation. Its absolute garbage! Once you standardize, the distance metric is measured in units of standard deviation of the population. I understand the you have some sort of online learning algo, but the math only makes sense if you can define a mean and std. — AN6U5, Aug 17 '15 at 15:35

score 2 · Answer 1 · answered Dec 03 '20 at 07:16

I strongly recommend using scaling as described above because it is faster than the manual method. If for some reason, scaling/preprocessing is unavailable, please use the metric parameter to pass a custom weighting function. See the example below.

import numpy as np

from sklearn.neighbors import KNeighborsClassifier as KNN

arr = np.random.randn(500, 10) # train X data
y = np.random.randint(2, size=(500,)) # train y data

# define custom weight function
weights = np.abs(np.random.randn(100)) # set up the desired weights
def weighted_distance(sample_x, sample_y):
    global weights
    return np.sqrt(sum((w * w * x * x * y * y) for w, x, y in zip(weights, sample_x, sample_y)))

knn = KNN(n_neighbors=3, metric=weighted_distance)
knn.fit(arr, y)
test = np.random.randn(5,10) # validation or test data
knn.predict(np.random.randn(5,10)) # predict
```

score 1 · Answer 2 · answered Dec 03 '20 at 06:38

As per @an6u5's comment:

If you want to weight one dimension higher than others then I suggest you standardize all of your data so that the mean is zero and the standard deviation is one. Then you can multiply the less important dimensions by a factor (2-10) so that they appear farther away to the KNN distance metric and leave the most important dimension un-scaled. Note that both standardizing and scaling are completely reversible processes, so there is very little reason not to use this simple solution

Weighted k nearest neighbor search

2 Answers2