23

I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I performed, my assumption is that the method I will use will need to be either parallel or distributed. So I'm considering two classes of possible solutions: one that is either available (or implementable in a reasonably easy way) on a single multicore machine, the other on a Spark cluster, i.e. as a MapReduce program. Here are three broad ideas that I considered:

  • Assuming a cosine similarity metric, perform the full multiplication of the normalized matrix by its transpose (implemented as a sum of outer products)
  • Using locality-sensitive hashing (LSH)
  • Reducing first the dimensionality of the problem with a PCA

I'd appreciate any thoughts or advices about possible other ways in which I could tackle this problem.

cjauvin
  • 451
  • 3
  • 7
  • 1
    I've just been investigating this area and wrote a blog post about what I found. I used an LSH, but I think my sparsity level was higher than you are looking for. http://tttv-engineering.tumblr.com/post/109569205836/scaling-similarity – Philip Pearl Jan 30 '15 at 11:04

3 Answers3

17

I hope that the following resources might get you additional ideas toward solving the problem:

1) Research paper "Efficient K-Nearest Neighbor Join Algorithms for High Dimensional Sparse Data": http://arxiv.org/abs/1011.2807

2) Class project paper "Recommendation System Based on Collaborative Filtering" (Stanford University): http://cs229.stanford.edu/proj2008/Wen-RecommendationSystemBasedOnCollaborativeFiltering.pdf

3) Project for the Netflix Prize Competition (k-NN-based): http://cs.carleton.edu/cs_comps/0910/netflixprize/final_results/knn/index.html

4) Research paper "Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data" on the curse of dimensionality phenomenon and its relation to machine learning, in general, and k-NN algorithm, in particular: http://jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf

5) Software for sparse k-NN classification (free, but appears not to be open source - might clarify with authors): http://www.autonlab.org/autonweb/10408.html

6) Several discussion threads on StackOverflow:

7) Pay attention to GraphLab, an open source parallel framework for machine learning (http://select.cs.cmu.edu/code/graphlab), which supports parallel clustering via MapReduce model: http://select.cs.cmu.edu/code/graphlab/clustering.html

You might also check my answer here on Data Science StackExchange on sparse regression for links to relevant R packages and CRAN Task View pages: https://datascience.stackexchange.com/a/918/2452.

Aleksandr Blekh
  • 6,518
  • 4
  • 28
  • 54
4

If you are working on collaborative filtering you should pose the problem as a low-rank matrix approximation, wherein both the users are items are co-embedded into the same low-dimensionality space. Similarity search will be much simpler then. I recommend using LSH, as you suggested. Another fruitful avenue for dimensionality reduction not yet mentioned is random projection.

Emre
  • 10,481
  • 1
  • 29
  • 39
2

You should use : PySparNN, a recent implementation by Facebook in python which is bloody fast. It is also easy to use.

Syzygyyy
  • 29
  • 2