How to find similar posts in a large database

I have a database with 2,000,000 messages. When a user receives a message, I need to find relevant messages in my database based on the appearance of words.

I tried to start a batch process to summarize my database: 1 - Save all words (except a, a, an, of, for ...) of all messages. 2 - Create a link between all messages and the words contained in it (I also save the frequency of this word in the message.)

Then, when I get the message: 1 - I parse the words (this seems like the first step of my batch process). 2 - Run a query in the database to retrieve messages sorted by the numbers of matching words.

However, the process of updating the word base and the request to receive similar messages is very difficult and slow. The update of the base word lasts ~ 1.2111 seconds for a message of 3000 bytes. Requests for similar messages lasts ~ 9.8 seconds for messages with the same size.

Database setup has already been completed, and the code is working fine.

I need a better algorithm for this.

Any ideas?

+4
source share
2 answers

I would recommend using Apache Solr (http://lucene.apache.org/solr/). It’s very easy to customize and index millions of documents. Solr handles all the necessary optimization (although it is open source, so you can customize it if you think it is necessary).

Then you can request the use of the available APIs, I prefer the SolrJ Java API (http://wiki.apache.org/solr/Solrj). I usually see results in one second.

Solr usually outperforms MySQL for indexing text.

+2
source

Similarity matching is still a particularly complex field, but you can take a look at the full text matching in the MySQL Reference, especially some of the more complex examples.

It should be possible for you to do a one-time task to build a similarity matrix for all your current messages, and then just start a night batch to add new messages to the similarity matrix.

+1
source

Source: https://habr.com/ru/post/1338489/


All Articles