PHP method for removing duplicate characters from a multibyte string?

Question

PHP method for removing duplicate characters from a multibyte string?

Arrrgh. Does anyone know how to create a function equivalent to the multibyte character of the PHP command count_chars ($ string, 3)?

Thus, he will return a list of ONLY ONE INSTANCE of each unique character. If it was English, and we had

"aaabggxxyxzxxgggghq xcccxxxzxxyx"

It will return "abgh qxyz" (note the IS space).

(In this case, the order is not important, maybe anything).

If Japanese Kanji (not sure browsers will confirm all this):

漢漢漢字漢字私私字私字漢字私漢字漢字私

And he will return only 3 hieroglyphs:

漢字私

It should work with any UTF-8 encoded encoding.

+4

php

Dave Mar 24 '11 at 1:18

source share

3 answers

Please try checking the iconv_strlen function of the standard PHP library. I can not say about eastern encodings, but it is great for languages of Europe and Eastern Europe. In any case, it gives some freedom!

0

Igor Mar 24 '11 at 1:29

source share

 $name = "Benny boy"; $name_array = str_split($name); $name_array_uniqued = array_unique($name_array); print_r($name_array_uniqued);

Much easier. The str_split user will turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it.

0

HoldOffHunger Aug 11 '13 at 0:33

source share

Charles · Accepted Answer · 2011-03-24T04:24:10+0000

Hey Dave, you will never see how it comes.

php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私'; php > $not_kanji = 'aaabcccbbc'; php > $pattern = '/(.)\1+/u'; php > echo preg_replace($pattern, '$1', $kanji); 漢字漢字私字私字漢字私漢字漢字私 php > echo preg_replace($pattern, '$1', $not_kanji); abcbc

What, you thought I was using mb_substr again?

In regex-talk, it searches for one single character, then one or more instances of the same character. The corresponding area is then replaced with one character that matches.

u modifier enables UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is already UTF-8 and PCRE has been compiled with Unicode support, this should work for you.

Hey guess what!

 $not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff'; $l = mb_strlen($not_kanji); $unique = array(); for($i = 0; $i < $l; $i++) { $char = mb_substr($not_kanji, $i, 1); if(!array_key_exists($char, $unique)) $unique[$char] = 0; $unique[$char]++; } echo join('', array_keys($unique));

It uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. Then we use this symbol as the key in the array. We use PHP positional arrays: keys are sorted in the order in which they are defined. When we go through the string and identify all the characters, we will take the keys and join'em back together in the same order as in the string. You also get the number of characters for each character from this technique.

It would be much easier if there was such a thing as mb_str_split to go along with str_split .

(There is no example of kanji here, I am experiencing a copy / paste error.)

Here, try this for size:

 function mb_count_chars_kinda($input) { $l = mb_strlen($input); $unique = array(); for($i = 0; $i < $l; $i++) { $char = mb_substr($input, $i, 1); if(!array_key_exists($char, $unique)) $unique[$char] = 0; $unique[$char]++; } return $unique; } function mb_string_chars_diff($one, $two) { $left = array_keys(mb_count_chars_kinda($one)); $right = array_keys(mb_count_chars_kinda($two)); return array_diff($left, $right); } print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde')); /* => Array ( [5] => f [6] => g ) */

You want to call this twice, a second time with the left line on the right and the right line on the left. The output will be different - array_diff just gives you material on the left that is missing on the right, so you need to do this twice to get the whole story.

PHP method for removing duplicate characters from a multibyte string?

More articles: