Find 3-8 word common phrases in text using PHP

I am looking for a way to find common phrases in text using PHP. If this is not possible in php, I would be interested in other web languages ​​that would help me fill this out.

Memory or speed is not a problem.

Now I can easily find keywords, but I don’t know how to search for search phrases.

+4
source share
6 answers

I wrote a PHP script that does just that, right here. It first breaks the source text into an array of words and their occurrence. Then he counts the general sequence of these words with the specified parameters. This is old code and not commented, but maybe you will find it useful.

+3
source

Using only PHP? The simplest thing I can come up with:

  • Add each phrase to an array
  • Get the first phrase from the array and delete it
  • Find the number of phrases that match it and delete them, keeping the number of matches
  • Click the phrase and the number of matches on the new array
  • Repeat until the source array is empty

I am garbage for formal CS, but I believe that this is n^2 complexity, in particular, n(n-1)/2 comparison in the worst case. I have no doubt that there is a better way to do this, but you mentioned that efficiency is not a problem, so it will be done.

The following is the code (I used the new array_keys function for me, which takes a search parameter):

 // assign the source text to $text $text = file_get_contents('mytext.txt'); // there are other ways to do this, like preg_match_all, // but this is computationally the simplest $phrases = explode('.', $text); // filter the phrases // if you're in PHP5, you can use a foreach loop here $num_phrases = count($phrases); for($i = 0; $i < $num_phrases; $i++) { $phrases[$i] = trim($phrases[$i]); } $counts = array(); while(count($phrases) > 0) { $p = array_shift($phrases); $keys = array_keys($phrases, $p); $c = count($keys); $counts[$p] = $c + 1; if($c > 0) { foreach($keys as $key) { unset($phrases[$key]); } } } print_r($counts); 

See it in action: http://ideone.com/htDSC

+1
source

I think you should go for

str_word_count

 $str = "Hello friend, you're looking good today!"; print_r(str_word_count($str, 1)); 

will give

 Array ( [0] => Hello [1] => friend [2] => you're [3] => looking [4] => good [5] => today ) 

Then you can use array_count_values()

 $array = array(1, "hello", 1, "world", "hello"); print_r(array_count_values($array)); 

which will give you

 Array ( [1] => 2 [hello] => 2 [world] => 1 ) 
+1
source

The ugly solution, since you said that ugliness is in order, would be to look for the first word for any of your phrases. Then, as soon as this word is found, check whether the next word matches the previous next expected word in the phrase. This will be a cycle that will continue until the punches are positive until the word is present or the phrase is completed.

Simple but extremely ugly and probably very, very slow.

0
source

Arriving here late, but since I came across this, trying to do something like this, I decided to share where I landed in 2019:

https://packagist.org/packages/yooper/php-text-analysis

This library made my task completely trivial. In my case, I had an array of search phrases, which I divided into separate terms, normalized, and then created ngrams from two and three words. Going through the obtained grams, I was able to easily summarize the frequency of specific phrases.

 $words = tokenize($searchPhraseText); $words = normalize_tokens($words); $ngram2 = array_unique(ngrams($words, 2)); $ngram3 = array_unique(ngrams($words, 3)); 

Really cool library with a great deal.

0
source

If you want full-text search in html files, use Sphinx , a powerful search engine. Documentation here

-2
source

Source: https://habr.com/ru/post/1337091/


All Articles