Spark-How can I get element pairs after calculating affinities using RowMatrix

In my recommendation system, I ran into the “all couples” problem. Thanks to this databricks blog , it looks like RowMatrix can come to the rescue.

However, RowMatrix is ​​a matrix type without significant row indices, so I don’t know how to get the similarity result after calling columnSimilarities(threshold) for specific elements i and j

Below is some information about what I am doing:

1) My data file comes from Movielens with this format:

 user::item::rating 

2) I create a RowMatrix in which each sparse vector I represents the ratings of all users of this element i

 val dataPath = ... val ratings: RDD[Rating] = sc.textFile(dataPath).map(_.split("::") match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) val rows = ratings.map(rating=>(rating.product, (rating.user, rating.rating))) .groupByKey() .map(p => Vectors.sparse(userAmount, p._2.map(r=>(r._1-1, r._2)).toSeq)) val mat = new RowMatrix(rows) val similarities = mat.columnSimilarities(0.5) 

Now I get a similarities coordinate matrix. How can I get the similarity of specific elements i and j? Although it can be used to extract RDD[MatrixEntry] , I'm not sure if the rows i and column j correspond to the elements i and j.

+6
source share
3 answers

I ran into the same problem as you and solved it as follows.

  • you should notice that columnSimilarities () is a call to the similarity of column vectors. However, our “strings” always consist of row vectors. So you should get the transposition of the "strings", suppose it is "tran_rows". Then compute tran_rows.columnSimilarities ()

  • the thing is simple. As a result, columnSimilarities (), index i and j exactly match element i and paragraph j.

+9
source

RowMatrix can calculate similarity between columns. So you had to reconsider your approach to building ratings.map(rating=>(rating.user, (rating.product, rating.rating))).groupByKey() (and corresponding lines)

You will then get the product identifiers in the columns, and you can call columnSimilarities().entries to retrieve the product-from,product-to,score structure.

If you have large values ​​in the product identifier, you may have to match your values ​​with the artificial index values. For instance. if you have 3 products with identifiers: 1, 900000, 9000000, then you need to compare them by 0,1,2, then calculate the similarities. Without this comparison, you will calculate the similarities between products 0-9000000, even if you do not have many.

+1
source

If in your case the threshold is not desired, you can use columnSimilarities for IndexedRowMatrix. It works very well for me. This way you have the best way to manage row indices.

+1
source

Source: https://habr.com/ru/post/985926/


All Articles