Can you tell me a little more about the data inside the sets? The reason I ask is because for this kind of thing, you usually want something a little specialized. Here are a few things you can do:
- If the data is sorted (or may be), you can use pointers to merge, similar to how it is done using merge sort. This operation is pretty trivially parallelizable, since you can split one data set and then split the second data set using binary search to find the correct border.
- If the data is in a specific numerical range, you can use bit set instead and just set the bit when you come across this number.
- If one of the data sets is smaller than the other, you can quickly set it to a hash set and loop over the other data set by checking to see if it is containment.
I used the first strategy to create a giant set of about 8 million integers with about 40 thousand smaller sets in about a second (on hardware, in Scala).
source share