I am trying to write a free text search algorithm to search for specific posts on a wall (similar to the wall used by Facebook). The user should be able to write several words in the search field and receive hits on messages containing words; with the best match from above, and then by other messages in descending order of coincidence.
I use the Levenshtein editing distance "e (x, y) = e" to calculate the score for each message compared to the query word "x" and after the word "y" according to: score (x, y) = 2 ^ (2 - e) (1 - min (e, | x |) / | x |), where "| x |" - the number of letters in the query word.
Each word in the message contributes to the overall score for that particular post. This approach seems to work well when messages are about the same size, but sometimes some large posts manage to score points solely because they have a lot of words, while in practice they are not related to the request.
Am I approaching this problem incorrectly or is there some way to normalize the assessment that I did not think about?
source
share