How to cross two sets of lengths that do not fit into memory?

Question

How to cross two sets of lengths that do not fit into memory?

I have an interesting problem. I want to cross two Sets of Longs in Java, with each set having 1B members - it's 4 GB per set. This will not fit in the memory on the server that I need to run.

I am wondering what interesting methods exist to solve this problem.

What I have come up with so far is reading subsets of each source set from disk, which are small enough to fit into memory, and then traversing each subset and temporarily writing them to disk. Finally, I could go through and cross these subsets. I get the feeling that this may turn into a work to reduce the map.

Perhaps you will have better ideas :) I doubt that I am the first person to come across this problem.

+4

java set algorithm memory intersection

cjroth Oct 20 '12 at 18:21

source share

9 answers

Do disk-based sorting on both sets.

After that, you can simply sort the files sequentially and record the intersections in a new file.

+2

Brian roach Oct 20 '12 at 18:30

source share

Here is what I think can be done. Obviously to disk. You must sort them.

Sort them using external sorting

compare

 if a. length < 1 or b.length < 1 exit else if a[0] == b[0] addToIntersectionSet(a[0]) remove a[0] from a remove b[0] from b else if a[0] < b[0] remove a[0] else remove b[0]

+2

Nishant Oct 20 '12 at 18:36

source share

Initialize 2 large bitmaps to zero (bitmap1 and bitmap2) in memory, if available or on disk. for each value in set1, set the value-th position of bitmap1 to 1. for each value in set2, read the value of th position to bit-bit1, and if 1, set bitmap2 to th-value to position 1. for each bit values in bitmap2, the output value of this position.

Edit: Jessop below in the answers points to a flaw: these are Java 64-bit (8-byte) long ints, not the 32-bit architecture C compiler 4 bytes long ints. This solution is not practical for 64-bit lengths.

+2

Richard Finney Oct 20 '12 at 18:50

source share

This looks like a map reduction job, but you have to be very careful in choosing subsets. If you want your intersection to work, subsets from the original sets must be cut at the same points. For example, suppose you have sets

 A = {1 3 7 9} B = {2 7 8 9}

And you cut them into two subsets:

 A1 = {1 3} A2 = {7 9} B1 = {2 7} B2 = {8 9}

Then you intersect:

 C1 = A1 inter B1 = {} C2 = A2 inter B2 = {9}

Then you assume that:

 C = A inter B = C1 union C2 = {9}

This is obviously wrong. In order for your map to be reduced to work, you need to cut out sets using some constant value, for example, A1 and B1 will contain values <5 and values A2 and B2> 5.

You can also get the regular parts of sets A and B from your disk and then cross them in an intelligent way, which means gradually looking for sorted items and stopping when you get to the end of one of the two subsets. At this point, you get an additional part of the subset.

+1

alestanis Oct 20 '12 at 18:28

source share

One obvious thing that comes to mind begins with sorted sets on disk. Then you can read from both files at the same time, find matches and write them. Let's say you read 204 from one file, you read from another to the first number> = 204. At this point, you know whether this particular number belongs to the intersection.

+1

Marko topolnik Oct 20 '12 at 18:28

source share

The first step may be to sort each set; sort smaller blocks and sort sort to create sorted files.

After sorting, you can go through both sets.

+1

Richard Sitze Oct 20 '12 at 18:31

source share

You can easily open 512 files, so you can preview both sets of 256 pieces on disk, with a maximum of 16 M items, each 64 megabytes in size. You can do this based on the most significant byte of each Long, from set.A.00 to set.B.ff

Then you can load each corresponding pair of pieces ( set.A.42 containing Longs, starting at 0x42, corresponds to set.B.42 ) and use them to initialize the Byte 16M array - you initialize it for all 0s and when you read the value i from the first fragment, you increase the i-th index). Then you read in the second fragment, but this time increase by 2.

Upon completion, you will check the array; 0 means that the value was not present in any fragment, 1 means that it was present only in the first set, 2 only in the second, 3 in both. And write the results to the result file.

You do not need sorting, even if the result file is sorted (since you will check the pieces in order and do the final scan in order).

All this is done in O (n) (all steps are performed in linear time) and no more than 16M RAM is required.

If there are too many 512 open files, you can use the first 7 most significant bits and use 256 open files and 32 MB of RAM. Or 128 open files and 64 M RAM, etc.

It is also possible (and perhaps more efficient if the file system cache is not too good) to save a series of 256 "buckets", each of which has a size of 16384 Longs (so it is again 16M). When the bucket reaches its full value, you will open the corresponding chunk file on the disk, unload the 16384 Long so far found and close the file. Then do the same for set B. You end up with 512 files containing from 0 (unlikely) to 16 million Long s, never opening more than two files at a time.

+1

LSerni Oct 20 '12 at 18:43

source share

Perhaps a more efficient way of sorting is to use hashing and split the data into multiple boxes - and make a crossroad on each box. The idea is to break the problem into subtasks that correspond to memory , and then you can efficiently perform the intersection in RAM.

Say you want to find the intersection R, S:

 for each element in R: write element in bucket_R[hash(element) % NUM_BUCKETS] for each element in S: write element in bucket_S[hash(element) % NUM_BUCKETS] //assuming each bucket from bucket_S or bucket_R now fits memory - proceed. //if it doesn't, you can repeat the previous step with a new hash function. for each i from 0 to NUM_BUCKETS: find bucket_S INTERSECTION bucket_R

IMPORTANT:
bucket_S, bucket_R or on DISK, not RAM.

Number of disk accesses:

The total number of disk reads with this approach is 3 * (|R|+|S|)

reading of each element in R and S when repeating the first two cycles
Writing each item to a hash table
reading all buckets

While for any sort-based algorithm, most likely, you will need more than 1 read + 1 write (and additional data traversal), which will give much MORE than 3 * (|R|+|S|)

PS I am currently participating in the “File System” exam (which will take place on Monday), and the lectures say that this is the preferred solution in most database systems if you have one drive.

+1

amit Oct 20 '12 at 21:55

source share

Tomasz Nurkiewicz · Accepted Answer · 2012-10-20T18:31:21+0000

Sort both sets A and B separately.
Take and remove the first element from set A and the first element from set B
If they are equal, add to the result set.
If an item from one set is larger, take the next item from the second set.
Go to step 2 until you reach the end of a set.

The advantage of this method is that you do not store more than two lengths in memory (except for sorting). Sorting can be performed efficiently on disk (merge sorting).

How to cross two sets of lengths that do not fit into memory?

More articles: