Function assignment utf8_encode

Suppose im encodes my files using UTF-8.

In a PHP script, the string will be compared:

$string="ぁ"; $string = utf8_encode($string); //Do i need this step? if(preg_match('/ぁ/u',$string)) //Do if match... 

Is his string really UTF-8 without the utf8_encode () function? If you encode your files using UTF-8, is this function not needed?

+6
source share
4 answers

If you read the manual record for utf8_encode , it converts the encoded string ISO-8859-1 to UTF-8. A function name is a terrible misnomer, as it offers some kind of automatic encoding, which is necessary. This is not relevant. If your source code is saved as UTF-8, and you assign "あ" to $string , then $string contains the character "あ" encoded in UTF-8. No further action is required. In fact, an attempt to convert a UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will distort it.

To develop a little more, your source code is read as a sequence of bytes. PHP interprets the information important to it (all keywords and operators, etc.) in ASCII. UTF-8 is backward compatible with ASCII. This means that all “regular” ASCII characters are represented using the same byte in both ASCII and UTF-8. Thus, it is " interpreted as " PHP regardless of whether it should be stored in ASCII or UTF-8. Anything between quotation marks, PHP simply takes a bit of the sequence as a literal. Therefore, PHP sees your "あ" as "11100011 10000001 10000010" . No matter what exactly is between quotation marks, it will just use it as is.

+10
source

PHP doesn't care about string encoding at all; strings are binary data in PHP. Thus, you should know the encoding of the data inside the string if you need the encoding. The question is, does coding do in your case?

If you set the contents of string variables to something like:

 $string="ぁ"; 

It will not contain UTF-8. Instead, it contains a binary sequence that is not a valid UTF-8 character. Therefore, the browser or editor displays a question mark or similar. So, before you start, you already see that something may not be as intended. (It turned out that it was the missing font at my end)

It also shows that your editor file supports UTF-8 or some other Unicode encoding method. Just remember the following: one file - one encoding. If you store a string inside a file, it is encoded in that file. Check your editor, in what encoding you save the file. Then you know the encoding of the string.

Suppose this is really valid UTF-8 (support for my font):

 $string="ä"; 

Then you can do a binary string comparison later:

 if ( 'ä' === $string ) # do your stuff 

Since it is binary in the same file and PHP lines, this works with every encoding. Therefore, usually you do not need to re-encode (change the encoding) data if you use functions that are binary - this means that the data encoding does not change.

For regular expressions, the coding role plays a role. That's why there is a u modifier that signals that you want to make the expression work with Unicode-encoded data. However, if the data is already encoded in Unicode, you do not need to change it to Unicode before using preg_match . However, with your code example, regular expressions are not needed at all, and a simple string comparison does the job.

Summary:

 $string="ä"; if ( 'ä' === $string ) # do your stuff 
+3
source

Your string is not a utf-8 character, so it cannot match it, so you need to use utf8_encode. Try encoding the PHP file as utf-8 (use something like Notepad ++) and it can work without it.

+1
source

Summary:

The utf8_encode() function will encode every byte from the given string in UTF-8. No matter what encoding was used previously to store the file. Its purpose is to encode strings¹ that arent UTF-8 yet.

1.- The correct use of this function gives the string ISO-8859-1 as a parameter. What for? Because Unicode and ISO-8859-1 have the same characters in the same positions.

  [Char][Value/Position] [Encoded Value/Position] [Windows-1252] [][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the []? No [ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes 

It seems that the function works with other encodings: it works if the encoding string contains only characters with the same values, which are encoded by ISO-8859-1 (for example, in Windows-1252 00-EF & A0-FF ).

We must take into account that if a function receives a UTF-8 string (a file encoded as UTF-8), it will again encode this UTF-8 string and make garbage.

0
source

Source: https://habr.com/ru/post/892795/


All Articles