Php - number of word instances in an array supporting UTF8

I am creating a jquery tagcloud on a php site. in my mysql db I have a β€œtags” field where there will be a comma-separated list of words. I want to create an array of words with the frequency with which they appear. just to complicate the situation, the text will be written in Hebrew (utf8 encoded).

in English this solution works fine:

$words = array_count_values(str_word_count($str, 1)); print_r($words); 

taken here php: sorting and counting instances of words in a given string

with Hebrew text, the array is not populated.

I found this entry The str_word_count () function does not display the Arabic language correctly , and although it works, it gives the total total number of words, and does not create an array of results similar to the previous function.

I would like the results to look something like this:

 Array ( [happy] => 4 [beautiful] => 1 [lines] => 3 [pear] => 2 [gin] => 1 [rock] => 1 ) 

any suggestions?

+4
source share
2 answers

You can create a version of UTF-8 (only!) Using Unicode mode for PHP PCRE functions.

 function utf8_str_word_count($string, $format = 0, $charlist = null) { if ($charlist === null) { $regex = '/\\pL[\\pL\\p{Mn}\'-]*/u'; } else { $split = array_map('preg_quote', preg_split('//u',$charlist,-1,PREG_SPLIT_NO_EMPTY)); $regex = sprintf('/(\\pL|%1$s)([\\pL\\p{Mn}\'-]|%1$s)*/u', implode('|', $split)); } switch ($format) { default: case 0: // For PHP >= 5.4.0 this is fine: return preg_match_all($regex, $string); // For PHP < 5.4 it necessary to do this: // $results = null; // return preg_match_all($regex, $string, $results); case 1: $results = null; preg_match_all($regex, $string, $results); return $results[0]; case 2: $results = null; preg_match_all($regex, $string, $results, PREG_OFFSET_CAPTURE); return empty($results[0]) ? array() : array_combine( array_map('end', $results[0]), array_map('reset', $results[0])); } } 

This function should be as close as possible to the semantics of str_word_count ; in particular, if you replace "locale dependent" with "UTF-8" in the next note for str_word_count , the result is true for this

For this function, the word "word" is defined as a locale-dependent string containing alphabetic characters, which may also contain, but do not begin with the characters "and" -.

In addition, the characters ' and - are considered part of the word, but cannot trigger it; however, any characters specified in the $charlist parameter can trigger a word, which means that specifying ' and / or - slightly changes the way the function works. This behavior also matches the original str_word_count .

It is also interesting to note that you can force a function to recognize only certain subsets of Unicode scripts by replacing \pL with character properties such as \p{Greek} - see PCE Unicode Reference .

+1
source

Although this is not exactly the answer you are hoping for, I would recommend that you first review your DB-Design. Saving multiple tags separated by commas in one field is not very smart. You must create a split table for tags with only two columns:

  • tag
  • the identifier of the corresponding object / message or whatever your application is about

There are many advantages:

  • Easier to remove or add tags.
  • You can get the array you are looking for without any crappy php code with one SQL-Query, for example "select tag, count (id) from tag group tags"
  • It's easier and MUCH MORE faster when you have a lot of tags.
  • Finally, but not least, I would put (without being sure) that MySQL will not have problems with different alphabets, which you obviously get in php -
+2
source

Source: https://habr.com/ru/post/1486832/


All Articles