Is there a more efficient way to reconcile large data sets?

I was assigned to coordinate two large data sets (two large lists of transactions). Basically, I extract the corresponding fields from two data sources into two files of the same format, then compare the files to find records that are in A, but not in B, or vice versa, and report it. I wrote a blog entry about my best efforts to achieve this goal (click if interested).

Its essence is to load both sets of data into a large hash table, with the keys being strings, and the values ​​are +1 each time they appear in file A, and -1 every time it appears in file B Then , in the end, I'm looking for any key / value pairs, where the value! = 0.

My algorithm seems fast enough (10 seconds for 2 * 100 MB files), however its bit uses memory intensively: 280 MB to compare two sets of files of 100 MB in size, I would like to get it up to 100 MB of memory and possibly lower if two datasets are sorted in approximately the same order.

Any ideas?

Also, let me know if this is too open for SO.

+3
source share
4 answers

I did something similar to this only in unix scripts using shell and perl, however the theory can be redone.

1, , . unix ( , - , ). , , , .

2, , , ( ). , , (, , ).

, , , , . . , , , , , .

. , , .

+2

, , - . , , , IO , , , .

+1

. , , .

- - .

, 280mb 100- - () , ? , , .

+1

, . , , . . . . .

0

Source: https://habr.com/ru/post/1712170/


All Articles