11

I have a DataFrame with IDF of certain words computed. For example

(10,[0,1,2,3,4,5],[0.413734499590671,0.4244680552337798,0.4761400657781007, 1.4004620708967006,0.37876590175292424,0.48374466516332])



 .... and so on

Now give a query Q, I can calculate the TF-IDF of this query. How do I calculate the cosine similarity of the query with all documents in the dataframe (there are close to million documents)

I could do it manually in a map-reduce job by using the vector multiplication

Cosine Similarity (Q, document) = Dot product(Q, dodcument) / ||Q|| * ||document||

but surely Spark ML must natively support calculating cosine similarity of a text?

In other words given a search Query how do I find the closest cosines of document TF-IDF from the DataFrame?

Ganesh Krishnan
  • 243
  • 1
  • 2
  • 6
  • 3
    You can make use of Spark's [Normalizer](https://spark.apache.org/docs/2.0.0/mllib-feature-extraction.html#normalizer) and, if you are interested in "all-pairs similarity", [DIMSUM](https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html). – Emre Aug 10 '16 at 06:44

1 Answers1

8

There's a related example to your problem in the Spark repo here. The strategy is to represent the documents as a RowMatrix and then use its columnSimilarities() method. That will get you a matrix of all the cosine similarities. Extract the row which corresponds to your query document and sort. That will give the indices of the most-similar documents.

Depending on your application, all of this work can be done pre-query.

Pete
  • 809
  • 5
  • 8