Spark-How can I get element pairs after calculating affinities using RowMatrix

Question

Spark-How can I get element pairs after calculating affinities using RowMatrix

In my recommendation system, I ran into the “all couples” problem. Thanks to this databricks blog , it looks like RowMatrix can come to the rescue.

However, RowMatrix is a matrix type without significant row indices, so I don’t know how to get the similarity result after calling columnSimilarities(threshold) for specific elements i and j

Below is some information about what I am doing:

1) My data file comes from Movielens with this format:

 user::item::rating

2) I create a RowMatrix in which each sparse vector I represents the ratings of all users of this element i

 val dataPath = ... val ratings: RDD[Rating] = sc.textFile(dataPath).map(_.split("::") match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) val rows = ratings.map(rating=>(rating.product, (rating.user, rating.rating))) .groupByKey() .map(p => Vectors.sparse(userAmount, p._2.map(r=>(r._1-1, r._2)).toSeq)) val mat = new RowMatrix(rows) val similarities = mat.columnSimilarities(0.5)

Now I get a similarities coordinate matrix. How can I get the similarity of specific elements i and j? Although it can be used to extract RDD[MatrixEntry] , I'm not sure if the rows i and column j correspond to the elements i and j.

+6

apache-spark apache-spark-mllib

Eric zheng Apr 25 '15 at 2:55

source share

3 answers

RowMatrix can calculate similarity between columns. So you had to reconsider your approach to building ratings.map(rating=>(rating.user, (rating.product, rating.rating))).groupByKey() (and corresponding lines)

You will then get the product identifiers in the columns, and you can call columnSimilarities().entries to retrieve the product-from,product-to,score structure.

If you have large values in the product identifier, you may have to match your values with the artificial index values. For instance. if you have 3 products with identifiers: 1, 900000, 9000000, then you need to compare them by 0,1,2, then calculate the similarities. Without this comparison, you will calculate the similarities between products 0-9000000, even if you do not have many.

+1

wind May 08 '15 at 7:27

source share

If in your case the threshold is not desired, you can use columnSimilarities for IndexedRowMatrix. It works very well for me. This way you have the best way to manage row indices.

+1

Dung Mar 30 '17 at 7:31

source share

Echo · Accepted Answer · 2015-05-11T14:38:32+0000

I ran into the same problem as you and solved it as follows.

you should notice that columnSimilarities () is a call to the similarity of column vectors. However, our “strings” always consist of row vectors. So you should get the transposition of the "strings", suppose it is "tran_rows". Then compute tran_rows.columnSimilarities ()
the thing is simple. As a result, columnSimilarities (), index i and j exactly match element i and paragraph j.

Spark-How can I get element pairs after calculating affinities using RowMatrix

More articles: