Massive Search Application Implementation

We have an email service that hosts about 10,000 domains, so we save the message headers in the SQL Server database.

I need to implement an application that will search the message body for keywords. Messages are stored as files on the NAS storage system.

As a proof of concept, I implemented a search engine based on an SQL server, I would analyze the message and save all the words in the database table along with memberid and messageid. The database was on a separate server in the headers database.

The problem with this system was that I ended the table with 600 million rows after processing messages on only one domain. Obviously, this is not a very scalable solution.

Since the headers are stored in the SQL Server table, I will need to join the message identifiers from the search application to the header table in order to display the messages containing the searched keywords.

Any suggestions for better architecture? Any better alternative to using SQL Server? We receive over 20 million messages per day. We are a small company with limited resources regarding servers, maintenance, etc.

thanks

+3
source share
8 answers

Hadoop. "map-reduce" , Google. ( ) Rackspace .

+4

lucene.net , , , .

+3

SQL . .

GREP .

+2

java lucene, . Katta, lucene Solr, rsync , . , , , . , .

+1

600 , . . , . , , , . , , , TLD (.com,.net,.org ..).

SQL Server vs Lucene.NET vs cLucene vs MySQL vs PostgreSQL. . , . , Linux.

http://incubator.apache.org/lucene.net/

http://sourceforge.net/projects/clucene/

+1

/ SQL Server. , , Qaru .

0

:

  • ( lucene, )
    • SQL ( ).
    • Do not create a new entry for each word, just add a new value to the large field in the word entry. Even better, if you are not using SQL for this table, use a keystore where the key is the word and the value is the list of entries. Check Inverted Index Bibliography for Inspiration

but to be honest, I think the only reasonable approach is # 1

0
source

Source: https://habr.com/ru/post/1715200/


All Articles