Spark Jaccard similarity calculation slow slow compared to trivial approach

Question

Spark Jaccard similarity calculation slow slow compared to trivial approach

Given 2 huge lists of values, I am trying to calculate the jaccard similarity between them in Spark using Scala.

Suppose colHashed1 contains the first list of values, and colHashed2 contains the second list.

Approach 1 (trivial approach):

 val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)

Approach 2 (using minHashing):

I used the approach described here .

 import java.util.zip.CRC32 def getCRC32 (s : String) : Int = { val crc=new CRC32 crc.update(s.getBytes) return crc.getValue.toInt & 0xffffffff } val maxShingleID = Math.pow(2,32)-1 def pickRandomCoeffs(kIn : Int) : Array[Int] = { var k = kIn val randList = Array.fill(k){0} while(k > 0) { // Get a random shingle ID. var randIndex = (Math.random()*maxShingleID).toInt // Ensure that each random number is unique. while(randList.contains(randIndex)) { randIndex = (Math.random()*maxShingleID).toInt } // Add the random number to the list. k = k - 1 randList(k) = randIndex } return randList } val colHashed1 = list1Values.map(a => getCRC32(a)) val colHashed2 = list2Values.map(a => getCRC32(a)) val nextPrime = 4294967311L val numHashes = 10 val coeffA = pickRandomCoeffs(numHashes) val coeffB = pickRandomCoeffs(numHashes) var signature1 = Array.fill(numHashes){0} for (i <- 0 to numHashes-1) { // Evaluate the hash function. val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) // Track the lowest hash code seen. signature1(i) = hashCodeRDD.min.toInt } var signature2 = Array.fill(numHashes){0} for (i <- 0 to numHashes-1) { // Evaluate the hash function. val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) // Track the lowest hash code seen. signature2(i) = hashCodeRDD.min.toInt } var count = 0 // Count the number of positions in the minhash signature which are equal. for(k <- 0 to numHashes-1) { if(signature1(k) == signature2(k)) count = count + 1 } val jSimilarity = count/numHashes.toDouble

Approach 1 seems to be superior to approach 2 in terms of time. When I analyzed the code, calling min() on RDD in approach 2 takes a considerable amount of time, and this function is called many times depending on how many hash functions are used.

The intersection and union operations used in approach 1 seem to be faster compared to repeated calls to the min () function.

I do not understand why minHashing does not help here. I expected minHashing to work faster compared to the trivial approach. Am I doing something wrong here?

Sample data can be viewed here.

+5

scala apache-spark apache-spark-mllib

CRM Apr 01 '16 at 13:43

source share

3 answers

tuxdna · Answer 1 · 2016-04-04T07:09:52+0000

JaccardSimilarity with MinHash does not give consistent results:

 import java.util.zip.CRC32 object Jaccard { def getCRC32(s: String): Int = { val crc = new CRC32 crc.update(s.getBytes) return crc.getValue.toInt & 0xffffffff } def pickRandomCoeffs(kIn: Int, maxShingleID: Double): Array[Int] = { var k = kIn val randList = Array.ofDim[Int](k) while (k > 0) { // Get a random shingle ID. var randIndex = (Math.random() * maxShingleID).toInt // Ensure that each random number is unique. while (randList.contains(randIndex)) { randIndex = (Math.random() * maxShingleID).toInt } // Add the random number to the list. k = k - 1 randList(k) = randIndex } return randList } def approach2(list1Values: List[String], list2Values: List[String]) = { val maxShingleID = Math.pow(2, 32) - 1 val colHashed1 = list1Values.map(a => getCRC32(a)) val colHashed2 = list2Values.map(a => getCRC32(a)) val nextPrime = 4294967311L val numHashes = 10 val coeffA = pickRandomCoeffs(numHashes, maxShingleID) val coeffB = pickRandomCoeffs(numHashes, maxShingleID) val signature1 = for (i <- 0 until numHashes) yield { val hashCodeRDD = colHashed1.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime) hashCodeRDD.min.toInt // Track the lowest hash code seen. } val signature2 = for (i <- 0 until numHashes) yield { val hashCodeRDD = colHashed2.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime) hashCodeRDD.min.toInt // Track the lowest hash code seen } val count = (0 until numHashes) .map(k => if (signature1(k) == signature2(k)) 1 else 0) .fold(0)(_ + _) val jSimilarity = count / numHashes.toDouble jSimilarity } // def approach1(list1Values: List[String], list2Values: List[String]) = { // val colHashed1 = list1Values.toSet // val colHashed2 = list2Values.toSet // // val jSimilarity = colHashed1.intersection(colHashed2).distinct.count / (colHashed1.union(colHashed2).distinct.count.toDouble) // jSimilarity // } def approach1(list1Values: List[String], list2Values: List[String]) = { val colHashed1 = list1Values.toSet val colHashed2 = list2Values.toSet val jSimilarity = (colHashed1 & colHashed2).size / (colHashed1 ++ colHashed2).size.toDouble jSimilarity } def main(args: Array[String]) { val list1Values = List("a", "b", "c") val list2Values = List("a", "b", "d") for (i <- 0 until 5) { println(s"Iteration ${i}") println(s" - Approach 1: ${approach1(list1Values, list2Values)}") println(s" - Approach 2: ${approach2(list1Values, list2Values)}") } } }

OUTPUT

 Iteration 0 - Approach 1: 0.5 - Approach 2: 0.5 Iteration 1 - Approach 1: 0.5 - Approach 2: 0.5 Iteration 2 - Approach 1: 0.5 - Approach 2: 0.8 Iteration 3 - Approach 1: 0.5 - Approach 2: 0.8 Iteration 4 - Approach 1: 0.5 - Approach 2: 0.4

Why are you using it?

jtitusj · Answer 2 · 2016-04-04T09:13:24+0000

It seems to me that the overhead for the minHashing approach simply outweighs its functionality in Spark. Moreover, numHashes increasing. Here are some observations I found in your code:

Firstly, while (randList.contains(randIndex)) this part will certainly slow down your process, since numHashes (which, by the way, is equal to the size of randList) increases.

Secondly, you can save some time by rewriting this code:

 var signature1 = Array.fill(numHashes){0} for (i <- 0 to numHashes-1) { // Evaluate the hash function. val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) // Track the lowest hash code seen. signature1(i) = hashCodeRDD.min.toInt } var signature2 = Array.fill(numHashes){0} for (i <- 0 to numHashes-1) { // Evaluate the hash function. val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) // Track the lowest hash code seen. signature2(i) = hashCodeRDD.min.toInt } var count = 0 // Count the number of positions in the minhash signature which are equal. for(k <- 0 to numHashes-1) { if(signature1(k) == signature2(k)) count = count + 1 }

in

 var count = 0 for (i <- 0 to numHashes - 1) { val hashCodeRDD1 = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) val hashCodeRDD2 = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) val sig1 = hashCodeRDD1.min.toInt val sig2 = hashCodeRDD2.min.toInt if (sig1 == sig2) { count = count + 1 } }

This method simplifies three loops into one. However, I am not sure that this will give a huge impetus to the computational time.

Another assumption, assuming that the first approach is still much faster, is to use the sets property to modify the first approach:

 val colHashed1_dist = colHashed1.distinct val colHashed2_dist = colHashed2.distinct val intersect_cnt = colHashed1_dist.intersection(colHashed2_dist).distinct.count val jSimilarity = intersect_cnt/(colHashed1_dist.count + colHashed2_dist.count - intersect_cnt).toDouble

with this, instead of getting the union, you can simply reuse the intersection value.

Dmitri · Answer 3 · 2017-08-03T10:32:50+0000

Actually, in the LSH apporach, you calculate minHash only once for each of your documents, and then compare two minHases for each possible pair of documents. And in the case of a trivial approach, you must perform a complete comparison of documents for each possible pair of documents. Which is approximately N ^ 2/2 number of comparisons. Therefore, the additional cost of computing minHash is negligible for a sufficiently large number of documents.

In fact, you should compare the performance of the trivial approach:

 val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)

and Jaccard distance calculation performance (last lines in your code):

 var count = 0 // Count the number of positions in the minhash signature which are equal. for(k <- 0 to numHashes-1) { if(signature1(k) == signature2(k)) count = count + 1 } val jSimilarity = count/numHashes.toDouble

Spark Jaccard similarity calculation slow slow compared to trivial approach

More articles: