How to ignore some characters when doing diff in google-diff-match-patch?

I use google-diff-match-patch to compare plain text in natural languages.

How can I make google-diff-match-patch ignore certain characters? (Some minor differences that I don't care.)

For example, this text1:

give me a cup of bean-milk. Thanks. 

and text2:

 please give mom a cup of bean milk! Thank you. 

(Note that there are two spaces before β€œThank you”.)

google-diff-match-patch outputs something like this:

 [please] give m(e)[om] a cup of bean(-)[ ]milk(.)[!] Thank(s)[ you]. 

It seems that google-diff-match-patch ignores only different spaces.

How can I tell google-diff-match-patch to also ignore characters like [-.!] ?

Expected Result:

 [please] give m(e)[om] a cup of bean-milk. Thank(s)[ you]. 

Thanks.

+4
source share
1 answer

google-diff-match-patch can list tuples

The first element indicates whether this is insert (1), delete (-1), or equality (0). The second element indicates the affected text.

For instance:

 diff_main("Good dog", "Bad dog") => [(-1, "Goo"), (1, "Ba"), (0, "d dog")] 

So we just need to filter this list.

Python sample code:

 Ignored_marks = re.compile('[ ,\.;:!\'"?-]+$') def unmark_minor_diffs(diffs): #diffs are list of tuples produced by google-diff-match-patch cooked_diffs = [] for (op, data) in diffs: if not Ignored_marks.match(data): cooked_diffs.append((op, data)) else: if op in (0, -1): cooked_diffs.append((0, data)) return cooked_diffs 
+1
source

Source: https://habr.com/ru/post/1436089/


All Articles