Parse XML with special characters (UTF-8)

I start with some XML that looks like this (simplified):

<?xml version="1.0" encoding="UTF-8"?> <alldata> <data name="Forsetì" /> </alldata> </xml> 

But after I analyzed it with simplexml_load_string , the special character (i) will become: ì , which is obviously distorted.

Is there any way to prevent this?

I know that XML is great when it is saved as .txt and viewed in a browser, the characters are in order. When I use simplexml_load_string in XML and then save the values ​​as a text file or database, it gets distorted.

+4
source share
5 answers

It looks SimpleXML creates a UTF-8 string, which is then displayed in ISO-8859-1 (Latin-1) or something like CP-1252.

When you save the result to a file and serve this file through a web server, the browser will use the encoding declared in the file.

Including web page
Since your webpage encoding is not UTF-8, you need to convert the string to any encoding you use, for example, ISO-8859-1 (Latin-1).

This is easy to do with iconv ():

  $xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout); 

Saving Database
The database column does not use UTF-8 collation, so you should use iconv to convert the string to the encoding used by your database.

Assuming your database sort is the same as the encoding you are visualizing, you don't need to do anything when reading from the database.

Explanation
In UTF-8, the 0xc2 prefix byte is used to access the upper half of the Latin-1 Supplement block, which includes characters such as accented letters, currency symbols, fractional parts, superscripts 2 and 3, copyright and registered characters trademark, and inextricable space.

However, in ISO-8859-1, the 0xC2 byte represents Γ‚. Therefore, when your UTF-8 string is misinterpreted as one of them, you get Γ‚ followed by some other meaningless character.

+6
source

XML is very likely to be fine, but the character gets distorted when saved or exited.

If you output data to an HTML page: make sure that it is encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use utf8_decode as a quick fix; Using UTF-8 is the best option in the long run.

If you save data in mySQL, you need UTF8 to be selected as the encoding throughout: as the connection encoding, in the table and columns (columns) you insert data.

0
source

I also had some problems with this, and it came from a PHP script. Make sure it is installed in UTF-8. If this is still not good, try printing the variable using uft8_encode or utf8_decode.

0
source

XML is strict when it comes to entities, for example, and there should be &amp;amp; and Γ¬ should &amp;igrave;

So, you will need a translation table.

 function xml_entity_decode($_string) { // Set up XML translation table $_xml=array(); $_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT); while (list($_key,)=each($_xl8)) $_xml['&#'.ord($_key).';']=$_key; return strtr($_string,$_xml); } 
0
source

Late to the party ... But I ran into this and decided, as shown below.

You have declared the encoding in XML, so if you load the xml file using DOMDocument , this will not cause any problems.

But if this happens in another case, you can use html_entity_decode , as shown below:

 html_entity_decode($xml->saveXML()); 
0
source

Source: https://habr.com/ru/post/1302637/


All Articles