How to search thousands of possible keywords in a row

I have a database of thousands (about 10,000) of keywords. When a user posts a blog on my site, I would like to automatically search for keywords in the text and mark the post with any direct matches.

So far, all I can think of is to pull out the ENTIRE keyword list, skip it and check for each tag in the message ... which seems very inefficient (it's 10,000 cycles).

Is there a more general way to do this? Should I use a MySQL query to limit it?

I guess this is not a rare task.

+6
source share
4 answers

No, just don't do it.

Instead of iterating over 10,000 elements, it’s better to extract the words from the sentence or text, then add it to the SQL query, and thus you will have all the necessary entries. This is certainly more effective than the solution you offer.

You can do this as follows using PHP:

$possible_keywords = preg_split('/\b/', $your_text, PREG_SPLIT_NO_EMPTY); 

The above will split the text at word boundaries and will not return any empty elements to the array.

Then you can simply create the SQL query as shown below:

 SELECT * FROM `keywords` WHERE `keywords`.`keyword` IN (...) 

(just put a list of extracted words in parenthesis)

You should probably filter the $possible_keywords array before making a query (include only keywords with the appropriate length and exclude duplicates), and also index the keyword columns.

+7
source

I don’t know which language you intend to use, but a standard trie (prefix tree) would solve this problem if you felt this.

+3
source

I think you could dynamically build a regex that allows you to match keywords within a specific string. You can pack it all in a class that does the grunt work.

 class KeywordTagger { static function getTags($body) { if(preg_match_all(self::getRegex(), $body, $keywords)) { return $keywords[0]; } else { return null; } } private static $regex; private static function getRegex() { if(self::$regex === null) { // Load Keywords from DB here $keywords = KeywordsTable::getAllKeywords(); // Let escape $keywords = array_map('KeywordTagger::pregQuoteWords', $keywords); // Base Regex $regex = '/\b(?:%s)\b/ui'; // Build Final self::$regex = sprintf($regex, implode('|', $keywords)); } return self::$regex; } private static function pregQuoteWords($word) { return preg_quote($word, '/'); } } 

Then all you have to do is when the user writes a message, run it through the class:

 $tags = KeywordTagger::getTags($_POST['messageBody']); 

For low speed, you can cache the built-in regular expression with memcached, APC, or a good old file-based cache.

+3
source

Well, I think PHP stripos is already pretty optimized. If you want to optimize this search further, you will need to use the similarity between your keywords (for example, instead of searching for “foobar” and then for “foobaz”, look for “fooba” and then check each “fooba” if followed by "r", "z" or not). But this will require some kind of tree view of your keywords, for example:

root (empty string)

  | fooba / \ 

foobar foobaz

Yes, this is a trick.

+2
source

Source: https://habr.com/ru/post/889145/


All Articles