It seems to me that the overhead for the minHashing approach simply outweighs its functionality in Spark. Moreover, numHashes increasing. Here are some observations I found in your code:
Firstly, while (randList.contains(randIndex)) this part will certainly slow down your process, since numHashes (which, by the way, is equal to the size of randList) increases.
Secondly, you can save some time by rewriting this code:
var signature1 = Array.fill(numHashes){0} for (i <- 0 to numHashes-1) { // Evaluate the hash function. val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) // Track the lowest hash code seen. signature1(i) = hashCodeRDD.min.toInt } var signature2 = Array.fill(numHashes){0} for (i <- 0 to numHashes-1) { // Evaluate the hash function. val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) // Track the lowest hash code seen. signature2(i) = hashCodeRDD.min.toInt } var count = 0 // Count the number of positions in the minhash signature which are equal. for(k <- 0 to numHashes-1) { if(signature1(k) == signature2(k)) count = count + 1 }
in
var count = 0 for (i <- 0 to numHashes - 1) { val hashCodeRDD1 = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) val hashCodeRDD2 = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime)) val sig1 = hashCodeRDD1.min.toInt val sig2 = hashCodeRDD2.min.toInt if (sig1 == sig2) { count = count + 1 } }
This method simplifies three loops into one. However, I am not sure that this will give a huge impetus to the computational time.
Another assumption, assuming that the first approach is still much faster, is to use the sets property to modify the first approach:
val colHashed1_dist = colHashed1.distinct val colHashed2_dist = colHashed2.distinct val intersect_cnt = colHashed1_dist.intersection(colHashed2_dist).distinct.count val jSimilarity = intersect_cnt/(colHashed1_dist.count + colHashed2_dist.count - intersect_cnt).toDouble
with this, instead of getting the union, you can simply reuse the intersection value.
source share