Character Recognition / Annotation Algorithm

I have a set of lines representing the history of a document. Each line - this is the entire document - so far there has been no diff analysis.

I need a relatively efficient algorithm that allows me to annotate document substrings with the version from which they came.

For example, if the document’s history was like this:

Rev1: The quiet fox Rev2: The quiet brown fox Rev3: The quick brown fox 

The algorithm will give:

 The quick brown fox 1111111331222222111 

i.e. "Qui" was added to revision 1, "ck" was added to version 3, "was added to revision 1," brown "was added to revision 2, and finally," fox "was added in version 1.

+4
source share
3 answers

I have a class library that can do this easily, although I don’t know how well it works with large or many such versions.

The library is here: DiffLib on CodePlex (you can also install it through NuGet.)

the script for your example in the question here (you can run this in LINQPad if you add a link to DiffLib):

 void Main() { var revs = new string[] { "The quiet fox", "The quiet brown fox", "The quick brown fox", "The quick brown fox.", "The quick brown fox jumped over the lazy dog.", "The quick brown fox jumped over the lazy cat.", "The Quick Brown Fox jumped over the Lazy Cat.", }; string current = revs[0]; List<int> owner = new List<int>(); foreach (char c in current) owner.Add(1); // owner 1 owns entire string Action<int> dumpRev = delegate(int rev) { Debug.WriteLine("rev " + rev); Debug.WriteLine(current); Debug.WriteLine(new string(owner.Select(i => (char)(48 + i)).ToArray())); Debug.WriteLine(""); }; dumpRev(0); for (int index = 1; index < revs.Length; index++) { int ownerId = index + 1; var diff = new DiffLib.Diff<char>(current, revs[index]).ToArray(); int position = 0; foreach (var part in diff) { if (part.Equal) position += part.Length1; else { // get rid of old owner for the part that was // removed or replaced for (int index2 = 0; index2 < part.Length1; index2++) owner.RemoveAt(position); // insert new owner for the part that was // added or did replace the old text for (int index2 = 0; index2 < part.Length2; index2++) owner.Insert(position, ownerId); position += part.Length2; } } current = revs[index]; dumpRev(index); } } 

Output:

  rev 0
 The quiet fox
 1111111111111

 rev 1
 The quiet brown fox
 1111111111222222111

 rev 2
 The quick brown fox
 1111111331222222111

 rev 3
 The quick brown fox.
 11111113312222221114

 rev 4
 The quick brown fox jumped over the lazy dog.
 1111111331222222111555555555555555555555555555554

 rev 5
 The quick brown fox jumped over the lazy cat.
 1111111331222222111555555555555555555555555666664

 rev 6
 The Quick Brown Fox jumped over the Lazy Cat.
 11117113317222227115555555555555555555755557664 
+3
source

You want to use the Myers diff algorithm implemented by Google . It is quite fast and has implementations in many languages, and you can provide timeout values ​​so that it does not spend too much time looking for complex differences.

The result should be fairly trivially converted into the account you need (assigning a patch by assigning a fix).

+1
source

Doesn't your story format already provide this information? If so, then it is simply a matter of displaying it. The most effective method will depend on the format in which your story is stored, of course, so no one here can really provide this for you without knowing this format.

It should be noted that if you send the output to some display device (for example, a screen), then usually your algorithm should be really stupid to slow down much more than the display device will already slow down.

0
source

Source: https://habr.com/ru/post/1340804/


All Articles