First of all, I would like to say that I read another post regarding php mb_detect_encoding on Strange behavior of mb_detect_order () in PHP . Which will definitely confirm what I learned by going the path and error. however, there are a few more things that confuse me.
I create html scrapers of mostly English sites that collect data and store them in UTF-8 XML. I ran into a problem when the page itself declares the ISO-8859-1 encoding but contains characters unique to Windows-1252. in particular, single quote () 0x92. As I understand it, windows-1252 is a superset of iso-8859-1 that makes me think, why use utf8_encode () at all? why not just use iconv ('Windows-1252', 'UTF-8', $ str) instead of utf8_encode (), since everything presented in iso-8859-1 will be converted, as well as characters unique to windows- 1252 (e. €, ƒ '"")
Besides
$ansi = "€";//euro mark, the code file itself is in ansi $detected = mb_detect_encoding($ansi, "WINDOWS-1252");// $detected == "Windows-1252" $detected = mb_detect_encoding('a'.$ansi, "WINDOWS-1252");// $detected == FALSE $detected = mb_detect_encoding($ansi.'a', "WINDOWS-1252");// $detected == "Windows-1252" $detected = mb_detect_encoding($ansi.'a', "WINDOWS-1252",TRUE);// $detected == FALSE
Why is this happening? if the first character in the line is not window-1252, although the rest of this fails? Doesn't that make it pretty worthless? how distinctive iso-8859-1 and windows-1252
the other thing that baffled me was, say, I want to detect an encoding between ASCII, ISO-8859-1, windows-1252, UTF-8. Is it possible to detect strings in such a way as to give me the lowest rating? (those.
$ascii = "123"; // desired detect result == 'ASCII' $iso = "é".$ascii; // desired detect result == 'ISO-8859-1' $ansi = "€".$iso; // desired detect result == 'Windows-1252' $utf8 = file_get_contents('utf8.txt', true);//$utf8 == '你好123é€', desired detect result == 'UTF-8'
shouldn't my array $ detect_order = ('ASCII', 'ISO-8859-1', 'Windows-1252', 'UTF-8'); I know this is not true, as it gave me the following results.
$ascii == 'ASCII' $iso == 'ISO-8859-1' $ansi == 'ISO-8859-1' $utf8 == 'ISO-8859-1'
Why is my detection order ("ASCII", "ISO-8859-1", "Windows-1252", "UTF-8") incorrect for what I want to get?
the closest desired return value that I received was
$ascii == 'ASCII' $iso == 'ISO-8859-1' $ansi == 'ISO-8859-1' $utf8 == 'UTF-8'
both of the following mb_detect_order arrays gave me the above values
$detect_order = array('ASCII', 'UTF-8', 'Windows-1252', 'ISO-8859-1'); $detect_order = array('ASCII', 'UTF-8', 'ISO-8859-1', 'Windows-1252');
it baffles me!
p> can someone shed some light on this? thanks for appreciating this!