XML Parsing - PHP Coding

I have large XML (> 15 MB), and I have to read it, parse it and save some values ​​in the database. My problem is that XML comes in different formats (UTF-8, ISO-8859-1).

There are no problems with UTF-8. But ISO-8859-1 gives me huge problems! Tags contain special charachters that are not processed correctly by XMLReader and readOuterXML ()

Tried it already but no luck

$xml = new XMLReader; $xml->open($import_file,'ISO-8859-1'); 

Tried using:

  • utf8_encode
  • mb_convert_encoding ($ stringXML, 'UTF-8');
  • iconv ("ISO-8859-1", "UTF-8 // TRANSLIT", $ stringXML);

XML (Simplified)

  • tag (id) -> no problem
  • tag (baños) → problem

XML:

 <?xml version="1.0" encoding="ISO-8859-1"?> <data> <id><![CDATA[5531]]></id> <baños><![CDATA[0]]></baños> </data> 

None of them helped me.

+5
source share
6 answers

What is your internal encoding in php? You can check it with echo mb_internal_encoding(); .

If it is UTF-8, then mb_convert_encoding($data, "UTF-8") do nothing, because the third parameter $from_encoding will already be "UTF-8".

You must provide the source encoding as the third parameter of the function.

So maybe this will do the trick:

 //check which encoding the data has? $encoding = mb_detect_encoding($data); if($encoding != "UTF-8"){ //specify from which encoding to convert to utf-8 $data = mb_convert_encoding($data, "UTF-8", $encoding); } 
0
source

As @Evert noted, your bytecode is: 0x96 , and the encoding of your XML file is actually MacRoman ( see table here ).

If you want to convert your data to UTF-8 format, here is what you need to do:

 $stringXML = file_get_contents('yourFile.xml'); $data = iconv('MACINTOSH', 'UTF-8', $stringXML); 

Another option is to use iconv on the command line:

 iconv -f MACINTOSH -t UTF-8 file.xml > outputUTF8.xml 

(Here is a link to the lib for Linux: http://www.gnu.org/software/libiconv/ )

0
source

I was able to successfully decode this xml using the Symfony XmlEncoder class ( https://github.com/symfony/Serializer ). I saved xml in the test.xml file to guarantee the correct encoding (since my php files are encoded in UTF-8 by default).

 $encoder = new Symfony\Component\Serializer\Encoder\XmlEncoder(); $data = $encoder->decode(file_get_contents('test.xml'), 'xml'); //$data = ['id' = 5531, 'baños' => 0] 
0
source

If you have problems with special characters in XML tags, you can quickly clear the tags before parsing:

 $xml = <<<END <?xml version="1.0" encoding="ISO-8859-1"?> <data> <id><![CDATA[5531]]></id> <baños><![CDATA[0]]></baños> </data> END; function FilterXML($matches) { return $matches[1] . preg_replace('/[^az]/ui', '_', $matches[2]) . $matches[3]; } var_dump(preg_replace_callback('#(</?)([^!?]+?)(\\s|>)#', 'FilterXML', $xml)); 

It will replace <baños> with <ba_os> .

0
source

First you can try to read the XML file, and then convert the special characters, and then read the XML string using XMLReader.

Here is the code:

 <?php header("Content-Type: text/plain; charset=ISO-8859-1"); function normalizeChars($s){ $replace = array( '&amp;' => 'and', '@' => 'at', '©' => 'c', '®' => 'r', 'À' => 'a', 'Á' => 'a', 'Â' => 'a', 'Ä' => 'a', 'Å' => 'a', 'Æ' => 'ae','Ç' => 'c', 'È' => 'e', 'É' => 'e', 'Ë' => 'e', 'Ì' => 'i', 'Í' => 'i', 'Î' => 'i', 'Ï' => 'i', 'Ò' => 'o', 'Ó' => 'o', 'Ô' => 'o', 'Õ' => 'o', 'Ö' => 'o', 'Ø' => 'o', 'Ù' => 'u', 'Ú' => 'u', 'Û' => 'u', 'Ü' => 'u', 'Ý' => 'y', 'ß' => 'ss','à' => 'a', 'á' => 'a', 'â' => 'a', 'ä' => 'a', 'å' => 'a', 'æ' => 'ae','ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ø' => 'o', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'ý' => 'y', 'þ' => 'p', 'ÿ' => 'y', 'Ā' => 'a', 'ā' => 'a', 'Ă' => 'a', 'ă' => 'a', 'Ą' => 'a', 'ą' => 'a', 'Ć' => 'c', 'ć' => 'c', 'Ĉ' => 'c', 'ĉ' => 'c', 'Ċ' => 'c', 'ċ' => 'c', 'Č' => 'c', 'č' => 'c', 'Ď' => 'd', 'ď' => 'd', 'Đ' => 'd', 'đ' => 'd', 'Ē' => 'e', 'ē' => 'e', 'Ĕ' => 'e', 'ĕ' => 'e', 'Ė' => 'e', 'ė' => 'e', 'Ę' => 'e', 'ę' => 'e', 'Ě' => 'e', 'ě' => 'e', 'Ĝ' => 'g', 'ĝ' => 'g', 'Ğ' => 'g', 'ğ' => 'g', 'Ġ' => 'g', 'ġ' => 'g', 'Ģ' => 'g', 'ģ' => 'g', 'Ĥ' => 'h', 'ĥ' => 'h', 'Ħ' => 'h', 'ħ' => 'h', 'Ĩ' => 'i', 'ĩ' => 'i', 'Ī' => 'i', 'ī' => 'i', 'Ĭ' => 'i', 'ĭ' => 'i', 'Į' => 'i', 'į' => 'i', 'İ' => 'i', 'ı' => 'i', 'IJ' => 'ij','ij' => 'ij','Ĵ' => 'j', 'ĵ' => 'j', 'Ķ' => 'k', 'ķ' => 'k', 'ĸ' => 'k', 'Ĺ' => 'l', 'ĺ' => 'l', 'Ļ' => 'l', 'ļ' => 'l', 'Ľ' => 'l', 'ľ' => 'l', 'Ŀ' => 'l', 'ŀ' => 'l', 'Ł' => 'l', 'ł' => 'l', 'Ń' => 'n', 'ń' => 'n', 'Ņ' => 'n', 'ņ' => 'n', 'Ň' => 'n', 'ň' => 'n', 'ʼn' => 'n', 'Ŋ' => 'n', 'ŋ' => 'n', 'Ō' => 'o', 'ō' => 'o', 'Ŏ' => 'o', 'ŏ' => 'o', 'Ő' => 'o', 'ő' => 'o', 'Œ' => 'oe','œ' => 'oe','Ŕ' => 'r', 'ŕ' => 'r', 'Ŗ' => 'r', 'ŗ' => 'r', 'Ř' => 'r', 'ř' => 'r', 'Ś' => 's', 'ś' => 's', 'Ŝ' => 's', 'ŝ' => 's', 'Ş' => 's', 'ş' => 's', 'Š' => 's', 'š' => 's', 'Ţ' => 't', 'ţ' => 't', 'Ť' => 't', 'ť' => 't', 'Ŧ' => 't', 'ŧ' => 't', 'Ũ' => 'u', 'ũ' => 'u', 'Ū' => 'u', 'ū' => 'u', 'Ŭ' => 'u', 'ŭ' => 'u', 'Ů' => 'u', 'ů' => 'u', 'Ű' => 'u', 'ű' => 'u', 'Ų' => 'u', 'ų' => 'u', 'Ŵ' => 'w', 'ŵ' => 'w', 'Ŷ' => 'y', 'ŷ' => 'y', 'Ÿ' => 'y', 'Ź' => 'z', 'ź' => 'z', 'Ż' => 'z', 'ż' => 'z', 'Ž' => 'z', 'ž' => 'z', 'ſ' => 'z', 'Ə' => 'e', 'ƒ' => 'f', 'Ơ' => 'o', 'ơ' => 'o', 'Ư' => 'u', 'ư' => 'u', 'Ǎ' => 'a', 'ǎ' => 'a', 'Ǐ' => 'i', 'ǐ' => 'i', 'Ǒ' => 'o', 'ǒ' => 'o', 'Ǔ' => 'u', 'ǔ' => 'u', 'Ǖ' => 'u', 'ǖ' => 'u', 'Ǘ' => 'u', 'ǘ' => 'u', 'Ǚ' => 'u', 'ǚ' => 'u', 'Ǜ' => 'u', 'ǜ' => 'u', 'Ǻ' => 'a', 'ǻ' => 'a', 'Ǽ' => 'ae','ǽ' => 'ae','Ǿ' => 'o', 'ǿ' => 'o', 'ə' => 'e', '' => 'jo','Є' => 'e', 'І' => 'i', 'Ї' => 'i', '' => 'a', '' => 'b', '' => 'v', '' => 'g', '' => 'd', '' => 'e', '' => 'zh','' => 'z', '' => 'i', '' => 'j', '' => 'k', '' => 'l', '' => 'm', '' => 'n', '' => 'o', '' => 'p', '' => 'r', '' => 's', '' => 't', '' => 'u', '' => 'f', '' => 'h', '' => 'c', '' => 'ch','' => 'sh','' => 'sch', '' => '-', '' => 'y', '' => '-', '' => 'je','' => 'ju','' => 'ja', '' => 'a', '' => 'b', '' => 'v', '' => 'g', '' => 'd', '' => 'e', '' => 'zh','' => 'z', '' => 'i', '' => 'j', '' => 'k', '' => 'l', '' => 'm', '' => 'n', '' => 'o', '' => 'p', '' => 'r', '' => 's', '' => 't', '' => 'u', '' => 'f', '' => 'h', '' => 'c', '' => 'ch', '' => 'sh','' => 'sch','' => '-','' => 'y', '' => '-', '' => 'je', '' => 'ju','' => 'ja','' => 'jo','є' => 'e', 'і' => 'i', 'ї' => 'i', 'Ґ' => 'g', 'ґ' => 'g', 'א' => 'a', 'ב' => 'b', 'ג' => 'g', 'ד' => 'd', 'ה' => 'h', 'ו' => 'v', 'ז' => 'z', 'ח' => 'h', 'ט' => 't', 'י' => 'i', 'ך' => 'k', 'כ' => 'k', 'ל' => 'l', 'ם' => 'm', 'מ' => 'm', 'ן' => 'n', 'נ' => 'n', 'ס' => 's', 'ע' => 'e', 'ף' => 'p', 'פ' => 'p', 'ץ' => 'C', 'צ' => 'c', 'ק' => 'q', 'ר' => 'r', 'ש' => 'w', 'ת' => 't', '™' => 'tm', 'ñ' => 'n', ); return strtr($s, $replace); } $path_to_file = ''; $xml_text = @file_get_contents($path_to_file); if(!empty($xml_text)){ $xml_text = normalizeChars($xml_text); $xml = new XMLReader(); $xml->XML($xml_text); } ?> 

In another note, if you are looking for performance, you should try SimpleXML and the DOM Document, as indicated in the following StackOverflow question: fooobar.com/questions/86991 / ...

EDIT:

I added header("Content-Type: text/plain; charset=ISO-8859-1") because strtr only works with ISO-8859-1. I tried this with the XML string provided by OP and it works fine. If there is a missing character, feel free to add it to the array.

-1
source
 $doc = new DOMDocument('1.0', 'UTF-8'); $doc->load($import_file,LIBXML_PARSEHUGE); $doc->save($import_file); 

Read the second example of user notes at http://php.net/manual/en/domdocument.save.php

-1
source

Source: https://habr.com/ru/post/1200633/


All Articles