Text difference algorithm

I need an algorithm that can compare two text files and highlight their difference and (even better!) Can calculate their difference in a meaningful way (for example, two similar files should have a similarity rating higher than two dissimilar files, with the word “similar” defined in normal terms). It sounds easy to implement, but it is not.

The implementation may be in C # or python.

Thank.

+42
python c # diff
Sep 28 '08 at 10:12
source share
11 answers

Python has difflib , like others.

difflib offers a SequenceMatcher class that you can use to give you a similarity coefficient. Function example:

 def text_compare(text1, text2, isjunk=None): return difflib.SequenceMatcher(isjunk, text1, text2).ratio() 
+25
Sep 28 '08 at 23:02
source share

I can recommend a look at the code and articles by Neil Fraser:

google-diff-match-patch

Currently available in Java, JavaScript, C ++ and Python. Despite the language, each library has the same API and the same functionality. All versions also have test harnesses.

Neil Fraser: Diff Strategies - for theory and implementation comments

+30
Sep 28 '08 at 11:04
source share

Take a look at difflib . (Python)

This will calculate the differences in different formats. Then you can use the size of the context diff as an indicator of how different the two documents are?

+23
Sep 28 '08 at 10:14
source share

Bazaar contains an alternative difference algorithm called patience diff (there is more information in the comments on this page), which is claimed to be better than the traditional comparison algorithm. The file 'patiencediff.py' in the bazaar distribution is a simple command line interface.

+10
Sep 28 '08 at 10:35
source share

My real understanding is that the best solution to the Shortest Edit Script (SES) problem is Myers 'medium snakes' method with refinement of the Hirschberg linear space.

Myers algorithm is described in:

E. Myers, "Difference O (ND) Algorithm and Variations",
Algorithm 1, 2 (1986), 251-266.

The GNU diff utility uses the Myers algorithm.

The "similarity" you are talking about is called the "editing distance" in the literature, which is the number of insertions or deletions needed to convert one sequence to another.

Please note that a number of people have cited the Levenshtein distance algorithm, but this is easy to implement, and not an optimal solution, since it is inefficient (requires the use of perhaps a huge n * m matrix) and does not provide an “edit script”, which is a sequence of changes, which can be used to convert one sequence to another and vice versa.

For a good Myers / Hirschberg implementation:

http://www.ioplex.com/~miallen/libmba/dl/src/diff.c

The specific library in which it is contained inside is no longer supported, but as far as I know, the diff.c module itself is still correct.

Mike

+9
Jan 26 '09 at 0:44
source share

If you need finer granularity than lines, you can use the Levenshtein distance. Levenshtein distance is a straightforward measure of how similar two texts are. You can also use it to retrieve edit logs and can have a very small-scale diff similar to the one found on the SO change history pages. Be careful, although Levenshtein’s distance can be computational and computationally intensive, so using difflib, as Douglas Leder suggested, is likely to be faster.

Cf. also this answer .

+5
Sep 28 '08 at 10:27
source share

There are several indicators of distance, since the paradox mentions the distance of Levenshtein, but there are also NYSIIS and Soundex . Regarding Python implementations, I used py-editdist and ADVAS . Both are good in the sense that you get one number back as an account. Check ADVAS first, it implements a bunch of algorithms.

+3
Sep 28 '08 at 19:21
source share

As indicated, use difflib. After you get a different output, you can find the Levenshtein distance of the different lines to give the “value” of how different they are.

+2
Sep 28 '08 at 10:33
source share

You can use the solution to the problem with the largest common subsequence (LCS) . See also a discussion of possible ways to optimize this solution.

+1
Dec 21 '09 at 1:31
source share

One method that I used for another function to calculate how much data was new in a modified file may possibly work for you.

I have a C # diff / patch implementation that allows me to take two files, presumably an old and a new version of the same file, and calculate the “difference”, but not in the usual sense of the word. Basically, I am calculating a set of operations that I can perform on the old version to update it to the same content as the new version.

To use this for the originally described function, to see how much data was new, I simply performed operations and for each operation copied from the old file verbatim, which had a 0-factor and each operation that inserted new text (distributed as part of the patch, since it did not appear in the old file) had a 1-factor. All characters got this factory, which gave me basically a long list of 0 and 1.

All I had to do then was to count points 0 and 1. In your case with my implementation, a small number 1 compared to 0 would mean that the files are very similar.

This implementation will also handle cases where a modified file inserted copies from an old file out of order or even duplicated (i.e., copied a part from the beginning of the file and pasted it at the bottom), since they will both be copies of the same original part from old file.

I experimented with weighting copies, so the first copy was counted as 0, and subsequent copies of the same characters had progressively higher coefficients to give the copy / paste operation some "new factor", but I never finished it as the project was canceled.

If you're interested, my diff / patch code is available from my Subversion repository.

0
Dec 21 '09 at 1:39
source share

Take a look at the Fuzzy module. It has fast (written in C) algorithms for soundex, NYSIIS and two-metaphones.

A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528

0
Apr 03 '12 at 12:11
source share



All Articles