Getting the percentage of similarity of two texts

Question

Getting the percentage of similarity of two texts

I need to get an assessment of the similarity between the texts when one is inside the second.

For instance:

Text1: aaa bbb ccc ddd eee Text2: bbb ccc

I need to say something that Text2 is 100% inside Text1. Is there any way to do this?

+4

java lucene

Trepik Mar 07 '11 at 20:13

source share

3 answers

Howard · Answer 1 · 2011-03-07T20:19:31+0000

Depending on what you want, you may try

length of the longest common subsequence of both texts divided by the length of the text2
or the length of the longest adjacent subsequence of both texts is also divided by the length of text2

Both will give you 1 if the text is completely inside text1 and 0 if they do not have a common character.

Mikos · Answer 2 · 2011-03-08T09:45:15+0000

You do not use Lucene to obtain similarities between texts. There are several measures available depending on the length of the text, the type of lines, etc., and you will need to experiment, which gives you the best results.

A pretty good and complete collection of algorithms is available in SimMetrics - this is the F / OSS library that offers an extensive collection of similarity algorithms and their associated cost functions.

Yuval F · Answer 3 · 2011-03-08T12:32:30+0000

See the book Mining Massive Datasets and the Definition of Dekang Lin Similarities (PDF) . Both do not require Lucene.

Getting the percentage of similarity of two texts

More articles: