Find 3-8 word common phrases in text using PHP

Question

Find 3-8 word common phrases in text using PHP

I am looking for a way to find common phrases in text using PHP. If this is not possible in php, I would be interested in other web languages that would help me fill this out.

Memory or speed is not a problem.

Now I can easily find keywords, but I don’t know how to search for search phrases.

+4

php data-mining text-mining

owilde1900 Jan 26 '11 at 4:37

source share

6 answers

Using only PHP? The simplest thing I can come up with:

Add each phrase to an array
Get the first phrase from the array and delete it
Find the number of phrases that match it and delete them, keeping the number of matches
Click the phrase and the number of matches on the new array
Repeat until the source array is empty

I am garbage for formal CS, but I believe that this is n^2 complexity, in particular, n(n-1)/2 comparison in the worst case. I have no doubt that there is a better way to do this, but you mentioned that efficiency is not a problem, so it will be done.

The following is the code (I used the new array_keys function for me, which takes a search parameter):

 // assign the source text to $text $text = file_get_contents('mytext.txt'); // there are other ways to do this, like preg_match_all, // but this is computationally the simplest $phrases = explode('.', $text); // filter the phrases // if you're in PHP5, you can use a foreach loop here $num_phrases = count($phrases); for($i = 0; $i < $num_phrases; $i++) { $phrases[$i] = trim($phrases[$i]); } $counts = array(); while(count($phrases) > 0) { $p = array_shift($phrases); $keys = array_keys($phrases, $p); $c = count($keys); $counts[$p] = $c + 1; if($c > 0) { foreach($keys as $key) { unset($phrases[$key]); } } } print_r($counts);

See it in action: http://ideone.com/htDSC

+1

Steven xu Jan 26 '11 at 6:36

source share

I think you should go for

str_word_count

 $str = "Hello friend, you're looking good today!"; print_r(str_word_count($str, 1));

will give

 Array ( [0] => Hello [1] => friend [2] => you're [3] => looking [4] => good [5] => today )

Then you can use array_count_values()

 $array = array(1, "hello", 1, "world", "hello"); print_r(array_count_values($array));

which will give you

 Array ( [1] => 2 [hello] => 2 [world] => 1 )

+1

Harish Jan 26 '11 at 7:04

source share

The ugly solution, since you said that ugliness is in order, would be to look for the first word for any of your phrases. Then, as soon as this word is found, check whether the next word matches the previous next expected word in the phrase. This will be a cycle that will continue until the punches are positive until the word is present or the phrase is completed.

Simple but extremely ugly and probably very, very slow.

0

Drew Jan 26 '11 at 6:09

source share

Arriving here late, but since I came across this, trying to do something like this, I decided to share where I landed in 2019:

https://packagist.org/packages/yooper/php-text-analysis

This library made my task completely trivial. In my case, I had an array of search phrases, which I divided into separate terms, normalized, and then created ngrams from two and three words. Going through the obtained grams, I was able to easily summarize the frequency of specific phrases.

 $words = tokenize($searchPhraseText); $words = normalize_tokens($words); $ngram2 = array_unique(ngrams($words, 2)); $ngram3 = array_unique(ngrams($words, 3));

Really cool library with a great deal.

0

Chords Jan 9 '19 at 16:05

source share

If you want full-text search in html files, use Sphinx , a powerful search engine. Documentation here

-2

MDI Jan 26 '11 at 6:30

source share

Core xii · Accepted Answer · 2011-01-26T06:51:02+0000

I wrote a PHP script that does just that, right here. It first breaks the source text into an array of words and their occurrence. Then he counts the general sequence of these words with the specified parameters. This is old code and not commented, but maybe you will find it useful.

Find 3-8 word common phrases in text using PHP

More articles: