Split utf8 string into character array

Question

Split utf8 string into character array

I am trying to split utf8 string into character array. The function that I am currently using is used to work, but for some reason it no longer works. What could be the reason. And better yet, how can I fix this?

This is my line:

Zelf heb ik maar één vraag: wie ben jij?

This is my function:

function utf8Split($str, $len = 1) { $arr = array(); $strLen = mb_strlen($str); for ($i = 0; $i < $strLen; $i++) { $arr[] = mb_substr($str, $i, $len); } return $arr; }

This is the result:

 Array ( [0] => Z [1] => e [2] => l [3] => f [4] => [5] => h [6] => e [7] => b [8] => [9] => i [10] => k [11] => [12] => m [13] => a [14] => a [15] => r [16] => [17] => e [18] => ́ [19] => e [20] => ́ [21] => n [22] => [23] => v [24] => r [25] => a [26] => a [27] => g [28] => : [29] => [30] => w [31] => i [32] => e [33] => [34] => b [35] => e [36] => n [37] => [38] => j [39] => i [40] => j [41] => ? )

+3

php utf-8

tersmitten Feb 24 '12 at 21:15

source share

6 answers

For mb_... functions, you must specify the encoding encoding.

In your code example, these are especially the following two lines:

 $strLen = mb_strlen($str, 'UTF-8'); $arr[] = mb_substr($str, $i, $len, 'UTF-8');

Full picture:

 function utf8Split($str, $len = 1) { $arr = array(); $strLen = mb_strlen($str, 'UTF-8'); for ($i = 0; $i < $strLen; $i++) { $arr[] = mb_substr($str, $i, $len, 'UTF-8'); } return $arr; }

Because you are using UTF-8 here. However, if the input is incorrectly encoded, it will not work “anymore” - just because it is not intended for something else.

You can take turns processing UTF-8 encoded strings with PCRE regular expressions, for example, this will return what you are looking for in less code:

 $str = 'Zelf heb ik maar één vraag: wie ben jij?'; $chars = preg_split('/(?!^)(?=.)/u', $str);

Next to preg_split also mb_split .

+10

hakre Feb 24 '12 at 21:26

source share

This is the best solution!

I found this nice solution in the PHP manual pages .

 preg_split('//u', $str, null, PREG_SPLIT_NO_EMPTY);

It works very fast:

In PHP 5.6.18, it broke a large 6 MB text file in seconds.

The best thing. It does not need MultiByte support (mb_)!

A similar answer is also here .

+10

Yani2000 May 12, '16 at 16:02

source share

If you are not sure about the availability of the mb_string function library, use:

Version 1:

 function utf8_str_split($str='',$len=1){ preg_match_all("/./u", $str, $arr); $arr = array_chunk($arr[0], $len); $arr = array_map('implode', $arr); return $arr; }

Version 2:

 function utf8_str_split($str='',$len=1){ return preg_split('/(?<=\G.{'.$len.'})/u', $str,-1,PREG_SPLIT_NO_EMPTY); }

Both features tested in PHP5

+4

Igor Mar 23 '12 at 15:04

source share

In PHP, there is a multi-byte split function, mb_split .

+2

bfavaretto Feb 24 '12 at 21:22

source share

 mb_internal_encoding("UTF-8");

46 arrays - off 41 array

0

user956584 Feb 24 '12 at 21:51

source share

tersmitten · Accepted Answer · 2012-03-06T08:56:21+0000

I found out that é was not the character I was expecting. There seems to be a difference between né and ńe. I started working with normalizing the first line.

Split utf8 string into character array

More articles: