Determine the largest suffix of some file, which is the prefix of another file

Question

Determine the largest suffix of some file, which is the prefix of another file

I have two files - call them file0 and file1.

What I would like to get is a fast algorithm for the next task (it’s clear to me how to write a rather slow algorithm that solves it):

Determine the largest file suffix file0, which is the prefix file1, which means memory block B (or, more precisely, the number of bytes of such a memory block) of maximum length so that

file0 consists of some block of memory A followed by B
file1 constist of memory block B followed by some memory block C

Note that blocks A, B, and C can also have lengths of zero bytes.

Edit (to answer a dry remark): the obvious rather slow algorithm that I think of (Pseudocode): let the file length be limited to m, n with wlog m <= n.

for each length from m to 0 compare the m last bytes of file0 with the first m bytes of file1 if they are equal return length

This is obviously an O (m * min (m, n)) algorithm. If the files are the same size, this is O (n ^ 2).

The files that I have to process at the moment are between 10 and several megabytes of hundread. But as a last resort, they can also be several gigabytes in size - simple enough to no longer fit into the x86 32-bit address space.

+4

algorithm prefix

Nubok Mar 31 '11 at 11:23

source share

3 answers

Consider handling bytes as 0..255 numbers stored as mod p integers, where p is a prime, not necessarily much larger than 255. Here are two ways to calculate b0 * x ^ 2 + b1 * x + b2:

(b0 * x + b1) * x + b2

b0 * x ^ 2 + (b1 * x + b2).

Therefore, I can effectively calculate this value, working from left to right - multiply by x and add b2 or work from right to left - adding b0 * x ^ 2.

Select random x and calculate it, working from right to left in AB and left to right in BC. If the calculated values match, you mark the location. Later, perform a slow check on all matches, starting with the longest, to see if B is really identical in both cases.

What is the probability of a random choice? If you have a false match, then (a0 - c0) * x ^ 2 + (a1 - c1) * x + (a2 - c2) = 0. A polynomial of degree d has at most d roots, so if x is random, the probability of a false match is no more than d / p, and you can make it small by working mod p for a sufficiently large p. (If I remember correctly, there is a message authentication scheme that has this idea at its core).

+3

mcdowella Mar 31 '11 at 19:35

source share

If this is not an academic assignment, then it makes sense to implement the simplest solution and see how it behaves in your data.

For example, a theoretically more efficient solution based on the Knut-Morris-Pratt algorithm may work worse than a solution based on IndexOf (see Overlap Detection ).

For large files, your program can spend all the time waiting for input / output.

+1

jfs Mar 31 '11 at 20:11

source share

templatetypedef · Accepted Answer · 2011-03-31T11:29:43+0000

Depending on how much memory you have available, you might want to create a suffix tree for the first file. After that, you can request the prefix of the second file, which overlaps the suffix of the second file as much as possible, simply by walking the suffix tree down from the root along the edges corresponding to the letters of the prefix of the second file. Since suffix trees can be constructed in linear time, the execution time for this algorithm is O (| A | + | B |), using your terminology, because building a suffix tree takes time O (| A | + | B |); O (| B |) to go through the suffix tree to find block B.

Determine the largest suffix of some file, which is the prefix of another file

More articles: