Nearest neighbors search for very high dimensional data

Question

I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I performed, my assumption is that the method I will use will need to be either parallel or distributed. So I'm considering two classes of possible solutions: one that is either available (or implementable in a reasonably easy way) on a single multicore machine, the other on a Spark cluster, i.e. as a MapReduce program. Here are three broad ideas that I considered:

Assuming a cosine similarity metric, perform the full multiplication of the normalized matrix by its transpose (implemented as a sum of outer products)
Using locality-sensitive hashing (LSH)
Reducing first the dimensionality of the problem with a PCA

I'd appreciate any thoughts or advices about possible other ways in which I could tackle this problem.

I've just been investigating this area and wrote a blog post about what I found. I used an LSH, but I think my sparsity level was higher than you are looking for. http://tttv-engineering.tumblr.com/post/109569205836/scaling-similarity — Philip Pearl, Jan 30 '15 at 11:04

score 17 · Answer 1 · edited May 23 '17 at 12:38

I hope that the following resources might get you additional ideas toward solving the problem:

1) Research paper "Efficient K-Nearest Neighbor Join Algorithms for High Dimensional Sparse Data": http://arxiv.org/abs/1011.2807

2) Class project paper "Recommendation System Based on Collaborative Filtering" (Stanford University): http://cs229.stanford.edu/proj2008/Wen-RecommendationSystemBasedOnCollaborativeFiltering.pdf

3) Project for the Netflix Prize Competition (k-NN-based): http://cs.carleton.edu/cs_comps/0910/netflixprize/final_results/knn/index.html

4) Research paper "Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data" on the curse of dimensionality phenomenon and its relation to machine learning, in general, and k-NN algorithm, in particular: http://jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf

5) Software for sparse k-NN classification (free, but appears not to be open source - might clarify with authors): http://www.autonlab.org/autonweb/10408.html

6) Several discussion threads on StackOverflow:

https://stackoverflow.com/questions/20333092/knn-with-big-sparse-matrices-in-python
https://stackoverflow.com/questions/18164348/efficient-nearest-neighbour-search-for-sparse-matrices
https://stackoverflow.com/questions/21085990/scipy-sparse-distance-matrix-scikit-or-scipy
https://stackoverflow.com/questions/10472681/handling-incomplete-data-data-sparsity-in-knn
https://stackoverflow.com/questions/5560218/computing-sparse-pairwise-distance-matrix-in-r (unlike all previous discussions, which refer to Python, this one refers to R ecosystem)

7) Pay attention to GraphLab, an open source parallel framework for machine learning (http://select.cs.cmu.edu/code/graphlab), which supports parallel clustering via MapReduce model: http://select.cs.cmu.edu/code/graphlab/clustering.html

You might also check my answer here on Data Science StackExchange on sparse regression for links to relevant R packages and CRAN Task View pages: https://datascience.stackexchange.com/a/918/2452.

score 4 · Answer 2 · answered Nov 10 '14 at 09:40

If you are working on collaborative filtering you should pose the problem as a low-rank matrix approximation, wherein both the users are items are co-embedded into the same low-dimensionality space. Similarity search will be much simpler then. I recommend using LSH, as you suggested. Another fruitful avenue for dimensionality reduction not yet mentioned is random projection.

score 2 · Answer 3 · answered Jul 10 '17 at 12:46

2

You should use : PySparNN, a recent implementation by Facebook in python which is bloody fast. It is also easy to use.

answered Jul 10 '17 at 12:46

Syzygyyy

29
2

Nearest neighbors search for very high dimensional data

3 Answers3

Linked