How to effectively implement a document similarity search engine?

How do you implement a system of "similar elements" for elements described by a set of tags?

There are three tables in my database: Article, ArticleTag and Tag. Each Article is linked to multiple tags through a many-to-many relationship. For each article, I want to find the five most similar articles to implement "if you like this article, you will like these too."

I am familiar with the similarity to cosine and using this algorithm works very well. But this is a way to slow down. For each article, I need to iterate over all the articles, calculate the cosine of similarity for a pair of products, and then select five articles with the highest similarity rating.

With 200 kilogram articles and 30 thousand tags, I need half a minute to calculate similar articles for one article. So I need another algorithm that gives about the same good results as cosine similarity, but which can be executed in real time and which does not require me to iterate over the entire document body every time.

Can someone suggest a ready-made solution for this? Most search engines that I looked at do not allow me to pick up a document search.

+4
source share
2 answers

Some questions,

  • How is ArticleTag different from a tag? Or is it an M2M mapping table?
  • Can you sketch how you implemented the cosine matching algorithm?
  • Why don't you store document tags in any data structure in memory, using it only to obtain document identifiers? Thus, you get to the database only during the search.
  • Depending on the frequency of adding a document, this structure may be designed for fast / slow updates.

The initial intuition regarding the answer is, I would say, an online clustering algorithm (perhaps an analysis of the main components on the coincidence matrix, which will bring the K-medium cluster closer?). Itโ€™s better to clarify as soon as you answer some of these questions above.

Greetings.

+1
source

You can do this with the Lemur toolkit. With KeyfileIncIndex you must re-extract the document from your source; IndriIndex supports extracting a document from an index.

But in any case, you index your documents, and then create a query from the document to which you want to find similar documents. You can then search with this query and it will evaluate other documents for similarity. This is pretty fast in my experience. It considers both source documents and basic queries as documents, so looking for similarities is what it does (if you are not using Indri parser material - this is slightly different, and I'm not sure how this works).

0
source

Source: https://habr.com/ru/post/1300163/


All Articles