Combining huge sets (HashSets) in Scala

I have two huge (as in millions of records) sets (HashSet) that have (<10%) overlap between them. I need to combine them into one set (I don't care about saving the original sets).

I am currently adding all the elements of one set to another:

setOne ++= setTwo 

This will take several minutes (after several attempts to configure the hash code () for members).

Any ideas on how to speed things up?

+6
source share
3 answers

You can get slightly better performance with the parallel collection API in Scala 2.9.0 +:

 setOne.par ++ setTwo 

or

 (setOne.par /: setTwo)(_ + _) 
+5
source

There are a few things you can try:

  • Use the sizeHint method to keep your sets at the expected size.
  • Call useSizeMap(true) on it to resize the hash table table.

It seems to me that the latter option gives better results, although both show improvements in the tests here.

+2
source

Can you tell me a little more about the data inside the sets? The reason I ask is because for this kind of thing, you usually want something a little specialized. Here are a few things you can do:

  • If the data is sorted (or may be), you can use pointers to merge, similar to how it is done using merge sort. This operation is pretty trivially parallelizable, since you can split one data set and then split the second data set using binary search to find the correct border.
  • If the data is in a specific numerical range, you can use bit set instead and just set the bit when you come across this number.
  • If one of the data sets is smaller than the other, you can quickly set it to a hash set and loop over the other data set by checking to see if it is containment.

I used the first strategy to create a giant set of about 8 million integers with about 40 thousand smaller sets in about a second (on hardware, in Scala).

0
source

Source: https://habr.com/ru/post/894182/


All Articles