How to cross two sorted arrays as quickly as possible?

I have two huge sorted arrays (~ 100 thousand elements each). I need to cross them very quickly . Now I do it in the standard way:

  • if a [i] b [j] then i ++
  • if a [i]> b [j], then j ++
  • else: add [i] to the intersection, i ++, j ++

But it takes too much time to complete (~ 350 microseconds), which leads to a rather low overall performance. Is there any way to do this faster?

PS The size of the intersection is not more than 1000 objects (on average), and I need only 25 to 100 of them.

+5
source share
3 answers

For parallel execution of 2 100k arrays, about 200 thousand comparisons are required. You are currently completing it in 350 microseconds = 350 thousand nanoseconds. Thus, your comparison time is less than 2 nanoseconds. If your processor is around 4 GHz, then this is 8 clock cycles.

It's good. You can try to be complicated, detect runs, etc., but you will probably damage your conveyor racks more than you save.

There are only 2 ways to speed it up. Do less work or add more workers.

You indicated that doing less work is possible, so Tamas Hegedus suggested this. Instead of creating an intersection, create an Iterator that will return the next thing in the intersection. This will require you to rewrite the logic using the specified iterator, but you will do less than 10% of the current calculation. Which will be closer to 10 times faster.

As for adding workers, you want to split the work between workflows and prevent them from stepping with each other. For k small (no more than your number of processors!), With a logarithmic amount of work in the size of your arrays, you can quickly select the k-1 values ​​that break the combined arrays into k exactly ( oops Adapt http://www.geeksforgeeks.org/ median-of-two-sorted-arrays / instead of performing quickselect ...) and the indices of these values ​​in each array. This creates problems k with even difficulties, each of which can be indicated as 4 numbers. Spin up k threads and let everyone get a piece of the answer. It will be about k times faster than what you are doing now.

Through much greater efforts, these approaches can be combined. What you do is that the iterator creates, say, 4 workers and distributes the blocks to each. When you call iter.next() , the iterator will pass you the next value that it has, if any. If he doesn’t have it, he will wait until the worker completing his next block completes, take this block, transfer this block to another worker if he is ready, and then pass the first value in this block. You can play with block size. You want it to be large enough so that the processor understands well that it must be streamed from RAM to the processor caches, and does not think that there is a synchronization conflict between the threads.

My guess, given the size and timing constraints, the hybrid approach will not be a big win, if any, over the iterator. But if you are really desperate, you can try.

+2
source

I am posting a naive implementation of the problem / solution: 2 arrays filled with random ints. If a threshold of 100 overlapping values ​​is reached, the loops break.

One loop using OP logic. The other starts two threads, each of which processes one half of the array.

It seems that the problem with threads may be the problem. Or, fine tuning may be required.

This is a sample of 20 copies. Worst case scenario: there is no intersection that makes you run to the end of the arrays. Time is in microseconds.

 Workers: 2806 Workers: 4197 Workers: 4235 Workers: 818 Workers: 729 Workers: 3376 Workers: 740 Workers: 688 Workers: 2245 Workers: 732 Workers: 330 Workers: 945 Workers: 605 Workers: 630 Workers: 630 Workers: 334 Workers: 643 Workers: 309 Workers: 290 Workers: 761 done Sorted: 1525 Sorted: 405 Sorted: 550 Sorted: 880 Sorted: 265 Sorted: 267 Sorted: 252 Sorted: 310 Sorted: 253 Sorted: 272 Sorted: 285 Sorted: 270 Sorted: 270 Sorted: 315 Sorted: 267 Sorted: 269 Sorted: 265 Sorted: 258 Sorted: 269 Sorted: 289 done package so; import java.util.Arrays; import java.util.HashSet; import java.util.Random; import java.util.Set; import java.util.concurrent.TimeUnit; public final class CrazyClass { static class Feeder implements Runnable{ final int b, e; int[] k1001; int[] k1002; final Set<Integer> setThis; Feeder(int[] ia, int[] ia1, int be, int en, Set<Integer> s){ k1001 = ia; k1002= ia1; b = be; e = en; setThis = s; } public void run() { int i2 = b; for(int i1 = b; i1 < e; i1++){ if (k1001[i1] == k1002[i2]){ synchronized(setThis){ setThis.add(k1001[i1]); if (setThis.size() == 25){ System.out.println("bye!!!"); break; } } } else if (k1001[i1] < k1002[i2]) i1++; else if (k1001[i1] > k1002[i2]) i2++; } } } static void sorted(){ int i1 = 0, i2 = 0; Set<Integer> result = new HashSet<Integer>(); Random r = new Random(); int[] k1001 = new int[100000]; int[] k1002 = new int[100000]; for(int i = 0; i< k1001.length; i++){ k1001[i] = r.nextInt(); k1002[i] = r.nextInt(); } Arrays.sort(k1001); Arrays.sort(k1002); long l = System.nanoTime(); for(; i1 < k1001.length; i1++){ if (k1001[i1] == k1002[i2]){ result.add(k1001[i1]); if (result.size() == 100){ System.out.println("bye!!!"); break; } } else if (k1001[i1] < k1002[i2]) i1++; else if (k1001[i1] > k1002[i2]) i2++; } l = System.nanoTime() - l; System.out.println("Sorted: " + TimeUnit.MICROSECONDS.convert(l, TimeUnit.NANOSECONDS)); } static void workers(){ Thread t1, t2; Set<Integer> setThis = new HashSet<Integer>(); Random r = new Random(); int[] k1001 = new int[100000]; int[] k1002 = new int[100000]; for(int i = 0; i< k1001.length; i++){ k1001[i] = r.nextInt(); k1002[i] = r.nextInt(); } t1 = new Thread(new Feeder(k1001, k1002, 0, 49999, setThis)); t2 = new Thread(new Feeder(k1001, k1002, 50000, 99999, setThis)); try{ long l = System.nanoTime(); t1.start(); t2.start(); t1.join(); t2.join(); System.out.println("Workers: " + TimeUnit.MICROSECONDS.convert(System.nanoTime() - l, TimeUnit.NANOSECONDS)); }catch(Exception x){ } } static public void main(String[] args){ int run = 20; for(int i = 0; i < run; i++) workers(); System.out.println("done"); for(int i = 0; i < run; i++) sorted(); System.out.println("done"); } } 
+1
source

Below code runs about 10 milliseconds for me. Therefore, I think you are either processing strings, or in a scripting language.

 package com.example.so.algorithms; import java.util.Arrays; import java.util.Random; /** * <p> http://stackoverflow.com/questions/42538902/how-to-intersect-two-sorted-arrays-the-fastest-possible-way#comment72213844_42538902 </p> * <p> Given two sorted sub-lists of 100k each determine the first 10 intersecting (common) entries within 350 millis </p> * @author Ravindra * @since 03March2017 * */ public class TestMergeIntersection { /** * <pre> Time (millis):9 Result :[442958664, 932132404, 988442487, 1356502780, 1614742980, 1923995812, 1985016181, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] </pre> * @param args */ public static void main(String[] args) { handleTest(); } private static void handleTest() { int size = 1024*128; int intersectionCount = 100; int[] arrayOne = generateSortedSublist(size); int[] arrayTwo = generateSortedSublist(size); int[] result = new int[intersectionCount]; int count = 0; int i=0; int j=0; long start = System.currentTimeMillis(); while(count < 100 && i < size && j < size ) { if( arrayOne[i] < arrayTwo[j]) { i++; } else if( arrayOne[i] > arrayTwo[j] ) { j++; } else { result[count] =arrayOne[i]; i++; j++; count++; } } long end = System.currentTimeMillis(); System.out.println("Time (millis):"+(end-start)); System.out.println("Result :"+Arrays.toString(result)); } private static int[] generateSortedSublist(int size) { Random random = new Random(); int[] result = new int[size]; for(int i=0;i<result.length;i++) { result[i] = random.nextInt(Integer.MAX_VALUE); } Arrays.sort(result); return result; } } 
0
source

Source: https://habr.com/ru/post/1264897/


All Articles