I just started doing php, and I'm afraid I need help to figure out how to manipulate utf-8.
I work in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding
confirms this), which then go on to read with
$file = fopen("file.txt", "r"); while(!feof($file)){ $line = fgets($file); //... } fclose($file);
- using
mb_detect_encoding($line)
UTF-8
reports - If I do
echo $line
, I can correctly see the line (without distorted characters) in the browser- so i think everything is ok with browser and apache. Although I searched the apache configuration for AddDefaultCharset and tried to add http meta tags for character encoding (just in case)
When I try to split a string using $arr = mb_split(';',$line)
, the fields in the resulting array contain the desired utf-8 characters ( mb_detect_encoding($arr[0])
also reports utf-8).
So, echo $arr[0]
will lead to something like this: Î'ΘΗÎÎ
.
I tried setting mb_detect_order('utf-8')
, mb_internal_encoding('utf-8')
, but nothing changed. I also tried manually defining utf-8 using this w3 perl regex , because I read somewhere that mb_detect_encoding can sometimes fail (myth?), But the results were the same.
So my question is how to split the string correctly? Does mb_
go mb_
incorrectly? What am I missing?
Thank you for your help!
UPDATE I am adding examples of strings and base64 equivalents (thanks @chris for his suggestion)
1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889" 2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5" 3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ==" 4. first part ($arr[0] after splitting): "Î'ΘΗÎÎ'" 5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="
Ok, so after that there is a 77u/
difference between 3. and 5., which according to this is utf -8 Specification Icon. So how can I avoid this?
UPDATE 2 . Today I woke up, updated and based on your advice, I tried again. It seems that $line=fgets($file)
reads the first line correctly (without distorted characters) and does not work for each subsequent line. So, I have base64_encoded
first and second lines, and bom-bom appeared only in base64'd line of the first line. Then I opened the offending file in vim and entered :set nobomb
:w
to save the file without bom. Dismissing php again showed that the first line was also garbled. Based on @hakre remove_utf8_bom
I added an extra function
function add_utf8_bom($str){ $bom= "\xEF\xBB\xBF"; return substr($str,0,3)===$bom?$str:$bom.$str; }
and voila, each line is read correctly.
I don’t really like this solution, because it seems very hacky (I can’t believe that the whole framework / language does not provide a way to deal with valuable lines). Do you know about an alternative approach? Otherwise, I will continue the above.
Thanks to @chris, @hakre and @jacob for their time!
UPDATE 3 (decision) . It turns out that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8')
and meta tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" />
. It must also be properly enclosed within the <html><body>
section, or the browser will not correctly understand the encoding. Thanks @jake for his suggestion.
The moral of the story: I have to learn more about html before trying to code the browser in the first place. Thank you for all your help and patience.