Hey Dave, you will never see how it comes.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私'; php > $not_kanji = 'aaabcccbbc'; php > $pattern = '/(.)\1+/u'; php > echo preg_replace($pattern, '$1', $kanji); 漢字漢字私字私字漢字私漢字漢字私 php > echo preg_replace($pattern, '$1', $not_kanji); abcbc
What, you thought I was using mb_substr again?
In regex-talk, it searches for one single character, then one or more instances of the same character. The corresponding area is then replaced with one character that matches.
u modifier enables UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is already UTF-8 and PCRE has been compiled with Unicode support, this should work for you.
Hey guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff'; $l = mb_strlen($not_kanji); $unique = array(); for($i = 0; $i < $l; $i++) { $char = mb_substr($not_kanji, $i, 1); if(!array_key_exists($char, $unique)) $unique[$char] = 0; $unique[$char]++; } echo join('', array_keys($unique));
It uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. Then we use this symbol as the key in the array. We use PHP positional arrays: keys are sorted in the order in which they are defined. When we go through the string and identify all the characters, we will take the keys and join'em back together in the same order as in the string. You also get the number of characters for each character from this technique.
It would be much easier if there was such a thing as mb_str_split to go along with str_split .
(There is no example of kanji here, I am experiencing a copy / paste error.)
Here, try this for size:
function mb_count_chars_kinda($input) { $l = mb_strlen($input); $unique = array(); for($i = 0; $i < $l; $i++) { $char = mb_substr($input, $i, 1); if(!array_key_exists($char, $unique)) $unique[$char] = 0; $unique[$char]++; } return $unique; } function mb_string_chars_diff($one, $two) { $left = array_keys(mb_count_chars_kinda($one)); $right = array_keys(mb_count_chars_kinda($two)); return array_diff($left, $right); } print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
You want to call this twice, a second time with the left line on the right and the right line on the left. The output will be different - array_diff just gives you material on the left that is missing on the right, so you need to do this twice to get the whole story.