Text similarity function for strong document similarity

I am writing a piece of Java software that is supposed to make a final judgment about the similarity of the two documents encoded in UTF-8.

The two documents are likely to be the same or slightly different from each other, as they have many common features such as date, location, creator, etc., but their text is what decides whether they really are.

I expect the text of the two documents to be either very similar or not at all like that, so I can be pretty strict about the threshold to establish the similarities. For example, I could say that two documents are similar only if they have 90% of their common words, but I would like to have something more reliable that will work for short and long texts.

To summarize, I:

  • two documents, very similar or not similar, but:
  • more likely that the two documents will be similar rather than
  • documents can be long (some paragraphs) and short (several sentences)

I experimented with simmetrics , which has a large array of string matching functions, but what interests me most is the proposal for possible usage algorithms.

Possible candidates:

  • Levenshtein: his conclusion is more significant for short texts
  • overlapping factor: maybe, but will it be good to distinguish documents with different lengths?

Also, looking at two texts that only look like they are exactly the same will not work well, because I would like documents that differ in only a few words to pass a similarity test.

+3
4

: . , . , , .

+1

; , - diff, , , .

: http://c2.com/cgi/wiki?DiffAlgorithm. diff-, .

+3

. , .

+2

. , , , , , - , .

+1

Source: https://habr.com/ru/post/1734411/


All Articles