Java library for diff free text

I need to match two almost identical long lines to freetext; that is, match indexes to index where possible.

Since this is freetext, the comparison should not be linear, as in the case of code.

Any suggestions for Java libraries?

A simple example (in real life, of course, there would be no extra spaces to align things, and there might be more complex problems, such as whole sentences moved around.)

The quick brown fox jumped over the lazy dog. |||||||||| ||||||||||||||||||||| ||||| The quick yellow fox jumped over the well-bred dog. 
+14
java string comparison diff text
Jan 26 '09 at 12:35
source share
4 answers

This one might be a good diff match patch . Others?

+16
Jan 26 '09 at 13:01
source share

Depending on your exact requirements, the StringUtils class of the component Levenshtein distance between two lines

+6
Jan 26 '09 at 13:03
source share

Here's a (slightly verified) version of the code that does what you requested. You can easily go through the result in parallel with the inputs to find the insert and delete.

 public class StringDiff { private static int length(String s) { return s == null ? 0 : s.length(); } private static char[] chars(String s) { return s == null ? new char[0] : s.toCharArray(); } private final String left; private final String right; private final char[] lccs; private final String lcs; public StringDiff(String left, String right) { this.left = left; this.right = right; lccs = init(); lcs = new String(lccs); } public String getLcs() { return lcs; } public char[] getLccs() { return lccs.clone(); } private char[] init() { int lLength = length(left); int rLength = length(right); char[] lChars = chars(left); char[] rChars = chars(right); int [][] t = new int [lLength + 1][rLength + 1]; for (int i = lLength - 1; i >= 0; --i) { for (int j = rLength - 1; j >= 0; --j) { if (lChars[i] == rChars[j]) { t[i][j] = t[i + 1][j + 1] + 1; } else { t[i][j] = Math.max(t[i + 1][j], t[i][j + 1]); } } } char[] result = new char[t[0][0]]; int l = 0, r = 0, p = 0; while (l < lLength && r < rLength) { if (lChars[l] == rChars[r]) { result[p++] = lChars[l++]; r++; } else { if (t[l + 1][r] > t[l][r + 1]) { ++l; } else { ++r; } } } return result; } } 

Accordingly, the longest subsequence of your source inputs:

 The quick brown fox jumped over the lazy dog. The quick yellow fox jumped over the well-bred dog. 

is an:

 The quick ow fox jumped over the l dog. 

(because "brown" and "yellow" have a "common", etc.)

It is relatively simple to change the above separation into spaces (instead of char arrays) and replace String # equals for == to get a version that finds the longest common subsequence of words instead of characters. For your example above, this change will lead to an obvious result:

 found 7 words 'The' 'quick' 'fox' 'jumped' 'over' 'the' 'dog.' 

(Your question involves comparing characters as you match spaces between words.)

0
Jan 26 '09 at 13:17
source share

If you are an example, this is what you want to do, that is, the subsequences will only match if they start with the same index (which differs from how the differences usually work) - that’s all you need to do:

 import java.util.*; class StringDiff { public static List<int[]> from(String s1, String s2) { int start = -1; int pos = 0; LinkedList<int[]> list = new LinkedList<int[]>(); for(; pos < s1.length() && pos < s2.length(); ++pos) { if(s1.charAt(pos) == s2.charAt(pos)) { if(start < 0) start = pos; } else { if(start >= 0) list.add(new int[] { start, pos }); start = -1; } } if(start >= 0) list.add(new int[] { start, pos }); return list; } public static void main(String[] args) { for(int[] idx : from(args[0], args[1])) System.out.println(args[0].substring(idx[0], idx[1])); } } 

The actual implementation of diff will be much more complicated.

0
Jan 26 '09 at 15:09
source share



All Articles