One method that I used for another function to calculate how much data was new in a modified file may possibly work for you.
I have a C # diff / patch implementation that allows me to take two files, presumably an old and a new version of the same file, and calculate the “difference”, but not in the usual sense of the word. Basically, I am calculating a set of operations that I can perform on the old version to update it to the same content as the new version.
To use this for the originally described function, to see how much data was new, I simply performed operations and for each operation copied from the old file verbatim, which had a 0-factor and each operation that inserted new text (distributed as part of the patch, since it did not appear in the old file) had a 1-factor. All characters got this factory, which gave me basically a long list of 0 and 1.
All I had to do then was to count points 0 and 1. In your case with my implementation, a small number 1 compared to 0 would mean that the files are very similar.
This implementation will also handle cases where a modified file inserted copies from an old file out of order or even duplicated (i.e., copied a part from the beginning of the file and pasted it at the bottom), since they will both be copies of the same original part from old file.
I experimented with weighting copies, so the first copy was counted as 0, and subsequent copies of the same characters had progressively higher coefficients to give the copy / paste operation some "new factor", but I never finished it as the project was canceled.
If you're interested, my diff / patch code is available from my Subversion repository.
Lasse Vågsæther Karlsen Dec 21 '09 at 1:39 2009-12-21 01:39
source share