Writing a search algorithm

Question

Writing a search algorithm

I am trying to write a free text search algorithm to search for specific posts on a wall (similar to the wall used by Facebook). The user should be able to write several words in the search field and receive hits on messages containing words; with the best match from above, and then by other messages in descending order of coincidence.

I use the Levenshtein editing distance "e (x, y) = e" to calculate the score for each message compared to the query word "x" and after the word "y" according to: score (x, y) = 2 ^ (2 - e) (1 - min (e, | x |) / | x |), where "| x |" - the number of letters in the query word.

Each word in the message contributes to the overall score for that particular post. This approach seems to work well when messages are about the same size, but sometimes some large posts manage to score points solely because they have a lot of words, while in practice they are not related to the request.

Am I approaching this problem incorrectly or is there some way to normalize the assessment that I did not think about?

+3

python full-text-search levenshtein distance

Mdag May 27 '10 at 8:27

source share

1 answer

Joe · Accepted Answer · 2010-05-27T08:34:56+0000

Yes. There are many normalization methods that you could use. This is a well-studied area!

. TDF/IDF , . , , .

, O (N) . - stemmming. VSM.

( MySQL Postgres) . , , , .

Writing a search algorithm

More articles: