5

I read about https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html

but couldn't find a spark library for this implementation.

I have columnar string dataset.

I have a dataset with around data of 15-20 million users with their show_watched, times_watched, genre, channel and some more columns, I need to calculate lookalike/s for a user(or 100k users).

How do I find lookalikes for them within less time,

I have tried by indexing data in Solr, and then using Solr MLT for finding similar users, but that takes a lot of time, also it uses TF-IDF for MLT and I need users which have times_show_watched close to that user's times_show_watched.

Can anyone recommend a better approach for this, maybe using any other framework for faster processing?

I also tried to implement clustering using Spark MLLIB and later search in which cluster a user belongs so that search space is less, but I couldn't get this approach finished.

I am open to any approaches which would be efficient.

Thanks!

Nikhil Verma
  • 191
  • 1
  • 1
  • 9
  • 1
    [Efficient similarity algorithm now in Apache Spark, thanks to Twitter](https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html) – Emre Jul 04 '17 at 19:47
  • How do use it If I have string data in columns? as it requires them to be double. – Nikhil Verma Jul 05 '17 at 05:04
  • Use an embedding for the strings, or categorical variable, as appropriate. For example, the genre is a categorical variable, whereas a free-form sentence should be embedded. – Emre Jul 05 '17 at 08:05
  • What that algorithm is doing is 'efficiently' finding the Square Matrix, which is an important step for CCO, I would be shocked to find this algorithm does so better than Mahout running on Spark, as they are doing effectively the same thing. For a better understanding of CCO read Pat Ferrel's blog http://actionml.com/blog/cco You still need to take the LLR of the squared matrices to do anything useful. Basically twitter was doing CCO internally, and needed spark to make square matrices better- so they came up with that algo. (Mahout didn't release CCO for spark until 2015 iirc) – rawkintrevo Jul 16 '17 at 04:43

1 Answers1

1

PMC from Mahout here- we're in the middle of a site re-org at the moment, and things are... well they're a mess.

Here's a link to something I think is more useful. A tutorial on Co-Occurance in Spark.

http://mahout.apache.org/docs/latest/tutorials/cco-lastfm/

Re "A Spark Library" well, mahout IS the spark library.

To use Mahout (Scala only, sorry if you're a Python-phile, however the syntax, especially for Mahout is very pleasant), you either need to download mahout and run ./mahout spark-shell from the bin/ directory. Or if you like GUIs notebooks and Apache Zeppelin, check out this tutorial for setting up Mahout+Spark on Zeppelin

http://mahout.apache.org/docs/latest/tutorials/misc/mahout-in-zeppelin/

(If you are compiling a Jar, you just add Mahout as a dependency.)

rawkintrevo
  • 268
  • 1
  • 7