Combining huge sets (HashSets) in Scala

Question

Combining huge sets (HashSets) in Scala

I have two huge (as in millions of records) sets (HashSet) that have (<10%) overlap between them. I need to combine them into one set (I don't care about saving the original sets).

I am currently adding all the elements of one set to another:

setOne ++= setTwo

This will take several minutes (after several attempts to configure the hash code () for members).

Any ideas on how to speed things up?

+6

collections scala scala-collections

Alexandros Aug 3 '11 at 11:17

source share

3 answers

There are a few things you can try:

Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to resize the hash table table.

It seems to me that the latter option gives better results, although both show improvements in the tests here.

+2

Daniel C. Sobral Aug 3 '11 at 13:58

source share

Can you tell me a little more about the data inside the sets? The reason I ask is because for this kind of thing, you usually want something a little specialized. Here are a few things you can do:

If the data is sorted (or may be), you can use pointers to merge, similar to how it is done using merge sort. This operation is pretty trivially parallelizable, since you can split one data set and then split the second data set using binary search to find the correct border.
If the data is in a specific numerical range, you can use bit set instead and just set the bit when you come across this number.
If one of the data sets is smaller than the other, you can quickly set it to a hash set and loop over the other data set by checking to see if it is containment.

I used the first strategy to create a giant set of about 8 million integers with about 40 thousand smaller sets in about a second (on hardware, in Scala).

0

Joshua hartman Aug 3 '11 at 16:09

source share

Vasil Remeniuk · Accepted Answer · 2011-08-03T13:13:24+0000

You can get slightly better performance with the parallel collection API in Scala 2.9.0 +:

 setOne.par ++ setTwo

or

 (setOne.par /: setTwo)(_ + _)

Combining huge sets (HashSets) in Scala

More articles: