For parallel execution of 2 100k arrays, about 200 thousand comparisons are required. You are currently completing it in 350 microseconds = 350 thousand nanoseconds. Thus, your comparison time is less than 2 nanoseconds. If your processor is around 4 GHz, then this is 8 clock cycles.
It's good. You can try to be complicated, detect runs, etc., but you will probably damage your conveyor racks more than you save.
There are only 2 ways to speed it up. Do less work or add more workers.
You indicated that doing less work is possible, so Tamas Hegedus suggested this. Instead of creating an intersection, create an Iterator that will return the next thing in the intersection. This will require you to rewrite the logic using the specified iterator, but you will do less than 10% of the current calculation. Which will be closer to 10 times faster.
As for adding workers, you want to split the work between workflows and prevent them from stepping with each other. For k small (no more than your number of processors!), With a logarithmic amount of work in the size of your arrays, you can quickly select the k-1 values that break the combined arrays into k exactly ( oops Adapt http://www.geeksforgeeks.org/ median-of-two-sorted-arrays / instead of performing quickselect ...) and the indices of these values in each array. This creates problems k with even difficulties, each of which can be indicated as 4 numbers. Spin up k threads and let everyone get a piece of the answer. It will be about k times faster than what you are doing now.
Through much greater efforts, these approaches can be combined. What you do is that the iterator creates, say, 4 workers and distributes the blocks to each. When you call iter.next() , the iterator will pass you the next value that it has, if any. If he doesn’t have it, he will wait until the worker completing his next block completes, take this block, transfer this block to another worker if he is ready, and then pass the first value in this block. You can play with block size. You want it to be large enough so that the processor understands well that it must be streamed from RAM to the processor caches, and does not think that there is a synchronization conflict between the threads.
My guess, given the size and timing constraints, the hybrid approach will not be a big win, if any, over the iterator. But if you are really desperate, you can try.