Php cannot find a way to split utf-8 strings

Question

Php cannot find a way to split utf-8 strings

I just started doing php, and I'm afraid I need help to figure out how to manipulate utf-8.

I work in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding confirms this), which then go on to read with

 $file = fopen("file.txt", "r"); while(!feof($file)){ $line = fgets($file); //... } fclose($file);

using mb_detect_encoding($line) UTF-8 reports
If I do echo $line , I can correctly see the line (without distorted characters) in the browser
- so i think everything is ok with browser and apache. Although I searched the apache configuration for AddDefaultCharset and tried to add http meta tags for character encoding (just in case)

When I try to split a string using $arr = mb_split(';',$line) , the fields in the resulting array contain the desired utf-8 characters ( mb_detect_encoding($arr[0]) also reports utf-8).

So, echo $arr[0] will lead to something like this: ï»¿Î'Î˜Î—ÎÎ .

I tried setting mb_detect_order('utf-8') , mb_internal_encoding('utf-8') , but nothing changed. I also tried manually defining utf-8 using this w3 perl regex , because I read somewhere that mb_detect_encoding can sometimes fail (myth?), But the results were the same.

So my question is how to split the string correctly? Does mb_ go mb_ incorrectly? What am I missing?

Thank you for your help!

UPDATE I am adding examples of strings and base64 equivalents (thanks @chris for his suggestion)

 1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889" 2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5" 3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ==" 4. first part ($arr[0] after splitting): "ï»¿Î'Î˜Î—ÎÎ'" 5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="

Ok, so after that there is a 77u/ difference between 3. and 5., which according to this is utf -8 Specification Icon. So how can I avoid this?

UPDATE 2 . Today I woke up, updated and based on your advice, I tried again. It seems that $line=fgets($file) reads the first line correctly (without distorted characters) and does not work for each subsequent line. So, I have base64_encoded first and second lines, and bom-bom appeared only in base64'd line of the first line. Then I opened the offending file in vim and entered :set nobomb :w to save the file without bom. Dismissing php again showed that the first line was also garbled. Based on @hakre remove_utf8_bom I added an extra function

 function add_utf8_bom($str){ $bom= "\xEF\xBB\xBF"; return substr($str,0,3)===$bom?$str:$bom.$str; }

and voila, each line is read correctly.

I don’t really like this solution, because it seems very hacky (I can’t believe that the whole framework / language does not provide a way to deal with valuable lines). Do you know about an alternative approach? Otherwise, I will continue the above.

Thanks to @chris, @hakre and @jacob for their time!

UPDATE 3 (decision) . It turns out that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8') and meta tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" /> . It must also be properly enclosed within the <html><body> section, or the browser will not correctly understand the encoding. Thanks @jake for his suggestion.

The moral of the story: I have to learn more about html before trying to code the browser in the first place. Thank you for all your help and patience.

+4

php utf-8 multibyte mbstring

bottlenecked Dec 03 '11 at 17:39

source share

4 answers

UTF-8 has a very nice feature compatible with ASCII. By this I mean that:

ASCII characters remain unchanged when encoded in UTF-8
no other characters will be encoded for ASCII characters

This means that when you try to break a UTF-8 string with a semicolon ; , which is an ASCII character, you can simply use standard single-byte string functions.

In your example, you can simply use explode(';',$utf8encodedText) and everything should work as expected.

PS: Since UTF-8 encoding is without a prefix , you can actually use explode() with any delimiter encoded by UTF-8.

PPS: It seems you are trying to parse a CSV file. Take a look at the fgetcsv () function. It should work fine in UTF-8 encoded strings if you use ASCII characters for delimiters, quotes, etc.

+4

Jakob egger Dec 03 '11 at 10:32

source share

The mb_split ^Docs function should be great, but you have to determine the encoding it uses, as well as using mb_regex_encoding ^Documents :

 mb_regex_encoding('UTF-8');

About mb_detect_encoding ^Docs : it may fail, but this is because you can never detect encoding. You either know this, or you can try, but that’s it. Encoding detection is basically a gambling game, however you can use a strict parameter with this function and specify the encoding you are looking for.

How to remove specification mask:

You can filter the line input and remove UTF-8 with a little helper function:

 /** * remove UTF-8 BOM if string has it at the beginning * * @param string $str * @return string */ function remove_utf8_bom($str) { if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF") { $str = substr($str, 3); } return $str; }

Using:

 $line = remove_utf8_bom($line);

There are probably better ways to do this, but it should work.

+1

hakre Dec 03 '11 at 17:43

source share

Edit, I just read your post closer. You assume that this should output false, because you are proposing a specification introduced by mb_split ().

 header('content-type: text/plain;charset=utf-8'); $s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"; $str = base64_decode($s); $peices = mb_split(';', $str); var_dump(substr($str, 0, 10) === $peices[0]); var_dump($peices);

It? It works as expected for me (bool true, and the rows in the array are correct)

+1

goat Dec 03 '11 at 20:04

source share

Jakob egger · Accepted Answer · 2011-12-04T17:35:21+0000

When you write debugging / testing scripts in php, make sure you output a more or less reliable HTML page.

I like to use a PHP file similar to the following:

 <!DOCTYPE html> <html> <head> <meta charset=utf-8> <title>Test page for project XY</title> </head> <body> <h1>Test Page</h1> <pre><?php echo print_r($_GET,1); ?></pre> </body> </html>

If you do not include HTML tags, the browser can interpret the file as a text file, and all sorts of strange things can happen. In your case, I assume that the browser interpreted this file as a Latin1 encoded text file. I assume that it worked with the specification, because whenever the specification was there, the browser recognized the file as a UTF-8 file.

Php cannot find a way to split utf-8 strings

More articles: