Php mb_detect_encoding ()

First of all, I would like to say that I read another post regarding php mb_detect_encoding on Strange behavior of mb_detect_order () in PHP . Which will definitely confirm what I learned by going the path and error. however, there are a few more things that confuse me.

I create html scrapers of mostly English sites that collect data and store them in UTF-8 XML. I ran into a problem when the page itself declares the ISO-8859-1 encoding but contains characters unique to Windows-1252. in particular, single quote () 0x92. As I understand it, windows-1252 is a superset of iso-8859-1 that makes me think, why use utf8_encode () at all? why not just use iconv ('Windows-1252', 'UTF-8', $ str) instead of utf8_encode (), since everything presented in iso-8859-1 will be converted, as well as characters unique to windows- 1252 (e. €, ƒ '"")

Besides

$ansi = "€";//euro mark, the code file itself is in ansi $detected = mb_detect_encoding($ansi, "WINDOWS-1252");// $detected == "Windows-1252" $detected = mb_detect_encoding('a'.$ansi, "WINDOWS-1252");// $detected == FALSE $detected = mb_detect_encoding($ansi.'a', "WINDOWS-1252");// $detected == "Windows-1252" $detected = mb_detect_encoding($ansi.'a', "WINDOWS-1252",TRUE);// $detected == FALSE 

Why is this happening? if the first character in the line is not window-1252, although the rest of this fails? Doesn't that make it pretty worthless? how distinctive iso-8859-1 and windows-1252

the other thing that baffled me was, say, I want to detect an encoding between ASCII, ISO-8859-1, windows-1252, UTF-8. Is it possible to detect strings in such a way as to give me the lowest rating? (those.

 $ascii = "123"; // desired detect result == 'ASCII' $iso = "é".$ascii; // desired detect result == 'ISO-8859-1' $ansi = "€".$iso; // desired detect result == 'Windows-1252' $utf8 = file_get_contents('utf8.txt', true);//$utf8 == '你好123é€', desired detect result == 'UTF-8' 

shouldn't my array $ detect_order = ('ASCII', 'ISO-8859-1', 'Windows-1252', 'UTF-8'); I know this is not true, as it gave me the following results.

 $ascii == 'ASCII' $iso == 'ISO-8859-1' $ansi == 'ISO-8859-1' $utf8 == 'ISO-8859-1' 

Why is my detection order ("ASCII", "ISO-8859-1", "Windows-1252", "UTF-8") incorrect for what I want to get?

the closest desired return value that I received was

 $ascii == 'ASCII' $iso == 'ISO-8859-1' $ansi == 'ISO-8859-1' $utf8 == 'UTF-8' 

both of the following mb_detect_order arrays gave me the above values

 $detect_order = array('ASCII', 'UTF-8', 'Windows-1252', 'ISO-8859-1'); $detect_order = array('ASCII', 'UTF-8', 'ISO-8859-1', 'Windows-1252'); 

it baffles me!

p> can someone shed some light on this? thanks for appreciating this!
+4
source share
3 answers

This is a known bug .

Windows-1251 and Windows-1252 will succeed only if the entire string consists of high-byte characters in a certain range. This means you will never get the correct conversion, because the text will display as ISO-8859-1 , even if it is Windows-1252 .

I ran into this problem, going from LATIN1 to UTF-8 . I had a lot of content inserted from Microsoft Word and stored in the VARCHAR field using the LATIN1 charset MySQL table. As you probably know, Word converts apostrophes and quotes into smart apostrophes and curly quotes. None of them will be displayed on the screen because these characters were incorrectly converted. Text has always been designated as ISO-8859-1 . To solve the problem, I forcedly converted from Windows-1252 to UTF-8 , and both apostrophes and quotation marks (and other characters) were correctly converted.

+2
source

Not sure if I will answer all your questions, but here we go:

As I understand it, windows-1252 is a superset of iso-8859-1 that makes me think, why use utf8_encode () at all? why not just use iconv ('Windows-1252', 'UTF-8', $ str) instead of utf8_encode (), since everything presented in iso-8859-1 will be converted, as well as characters unique to windows- 1252

You do not have to worry about ut8_encode. Go to iconv () or mb_convert_encoding. ut8_encode converts only ISO-8859-1 to UTF-8. if you need to convert different encodings, you should use other functions.

As for the Euro brand. Not sure if at some point it was added (formally or informally) in ISO-8859-1, but both statements below return true

 $ansi = "€";//euro mark, the code file itself is in ansi $detected = mb_detect_encoding($ansi, "WINDOWS-1252", TRUE);// $detected == "Windows-1252" echo $detected."<br/>-<br/>"; $detected = mb_detect_encoding($ansi, "ISO-8859-1", TRUE);// $detected == ISO-8859-1 echo $detected."<br/>-<br/>"; $detected = mb_detect_encoding($ansi, "WINDOWS-1252");// $detected == "Windows-1252" echo $detected."<br/>-<br/>"; $detected = mb_detect_encoding($ansi, "ISO-8859-1");// $detected == ISO-8859-1 echo $detected."<br/>-<br/>"; 

Please note: this is a result with a strict value of True or False. This may explain why

shouldn't my array $ detect_order = ('ASCII', 'ISO-8859-1', 'Windows-1252', 'UTF-8'); I know this is not true, as it gave me the following results.

gives you ISO-8859-1. I noticed that you switched UTF-8 ahead of ISO in the last order, and why it gave you UTF-8 at the end.

Why is my detection order ("ASCII", "ISO-8859-1", "Windows-1252", "UTF-8") incorrect for what I want to get?

On the php site http://us3.php.net/manual/en/function.mb-detect-order.php , setting the ISO before UTF-8 will always return the ISO. Check out the useless order example.

From what I saw, it seems that if you have both ISO-8859-1 and Windows-1252, you will get the ISO back. If you take one or the other, you will get all that is left of them. So positioning the last 2 below doesn't seem to matter

$ detect_order = array ('ASCII', 'UTF-8', 'Windows-1252', 'ISO-8859-1'); $ detect_order = array ('ASCII', 'UTF-8', 'ISO-8859-1', 'Windows-1252');

+1
source

the € character is not part of the utf8 encoding!

you must specify it as & euro ;!

or encode on windows-1252 or iso-8859-15 (the same as iso-8859-1, but has a € symbol)

-3
source

Source: https://habr.com/ru/post/1381707/


All Articles