Effective phrase matching algorithm

Question

Effective phrase matching algorithm

I have a set of approximately 7 million phrases that will correspond to approximately 300 million queries.

Queries can be substrings or contain phrases themselves. Basically, I want a measure of "similarity" between two phrases [not necessarily editing distance]

Can someone give some pointers to efficient algorithms for this. I would prefer distributed algorithms since I am going to do this on Hadoop through streaming using python.

+4

algorithm hadoop phrase

Rohan monga Feb 21 '11 at 3:39

source share

2 answers

B ^ed trees look interesting

B ^ed -Tree: A universal index structure for finding line similarities based on edit distance (Pdf presentations)

+2

Martin DeMello Feb 21 '11 at 13:23

source share

Thomas jungblut · Accepted Answer · 2011-02-22T09:56:10+0000

This, at least, is not very trivial, because on the one hand you have a lot of data, and on the other hand, even more.

The simplest approach is a lucene index for 7 million phrases and let the hadoop job request an index. Not quite sure if you need a solr server for this or any similar python implementations.

The cartographer must write out the phrase id or linenumber, regardless of what you need to specify. Or at least the phrase itself, as well as a coincidence. In the reduction step, you can go for a reduction in the phrase key and write out all related phrases with the score. (or whatever you want)
For similarities, you can read here:

The similarity of Apache Lucene
Apache lucene

Effective phrase matching algorithm

More articles: