Getting the percentage of similarity of two texts

I need to get an assessment of the similarity between the texts when one is inside the second.

For instance:

Text1: aaa bbb ccc ddd eee Text2: bbb ccc 

I need to say something that Text2 is 100% inside Text1. Is there any way to do this?

+4
source share
3 answers

Depending on what you want, you may try

  • length of the longest common subsequence of both texts divided by the length of the text2
  • or the length of the longest adjacent subsequence of both texts is also divided by the length of text2

Both will give you 1 if the text is completely inside text1 and 0 if they do not have a common character.

+1
source

You do not use Lucene to obtain similarities between texts. There are several measures available depending on the length of the text, the type of lines, etc., and you will need to experiment, which gives you the best results.

A pretty good and complete collection of algorithms is available in SimMetrics - this is the F / OSS library that offers an extensive collection of similarity algorithms and their associated cost functions.

0
source

See the book Mining Massive Datasets and the Definition of Dekang Lin Similarities (PDF) . Both do not require Lucene.

0
source

Source: https://habr.com/ru/post/1342698/


All Articles