You can create a version of UTF-8 (only!) Using Unicode mode for PHP PCRE functions.
function utf8_str_word_count($string, $format = 0, $charlist = null) { if ($charlist === null) { $regex = '/\\pL[\\pL\\p{Mn}\'-]*/u'; } else { $split = array_map('preg_quote', preg_split('//u',$charlist,-1,PREG_SPLIT_NO_EMPTY)); $regex = sprintf('/(\\pL|%1$s)([\\pL\\p{Mn}\'-]|%1$s)*/u', implode('|', $split)); } switch ($format) { default: case 0: // For PHP >= 5.4.0 this is fine: return preg_match_all($regex, $string); // For PHP < 5.4 it necessary to do this: // $results = null; // return preg_match_all($regex, $string, $results); case 1: $results = null; preg_match_all($regex, $string, $results); return $results[0]; case 2: $results = null; preg_match_all($regex, $string, $results, PREG_OFFSET_CAPTURE); return empty($results[0]) ? array() : array_combine( array_map('end', $results[0]), array_map('reset', $results[0])); } }
This function should be as close as possible to the semantics of str_word_count ; in particular, if you replace "locale dependent" with "UTF-8" in the next note for str_word_count , the result is true for this
For this function, the word "word" is defined as a locale-dependent string containing alphabetic characters, which may also contain, but do not begin with the characters "and" -.
In addition, the characters ' and - are considered part of the word, but cannot trigger it; however, any characters specified in the $charlist parameter can trigger a word, which means that specifying ' and / or - slightly changes the way the function works. This behavior also matches the original str_word_count .
It is also interesting to note that you can force a function to recognize only certain subsets of Unicode scripts by replacing \pL with character properties such as \p{Greek} - see PCE Unicode Reference .