Error: "The input does not match UTF-8, specify the encoding!" using php simplexml_load_string

I get an error message:

parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20

When trying to process an XML response using simplexml_load_string from a third-party source. The raw XML response declares the content type:

<?xml version="1.0" encoding="UTF-8"?>

However, it looks like XML is not UTF-8. The XML content langauge is Spanish and contains words of type Dublín in XML.

I cannot get a third party to deal with their XML.

How can I preprocess XML and fix encoding incompatibility?

Is there a way to determine the correct encoding for an XML file?

+47
xml php encoding character-encoding simplexml
Mar 24 '10 at 12:35
source share
10 answers

Your 0xED 0x6E 0x2C 0x20 bytes corresponds to "ín" in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Let the data provider know about it and ask them to fix it, because if it doesn't work for you, it probably doesn't work for other people either.

Now there are several ways to work it, and you should use it if you cannot load XML properly . One of them is to use utf8_encode() . The downside is that if this XML contains both valid UTF-8 and some ISO-8859-1, then the result will contain mojibake . Or you can try converting a string from UTF-8 to UTF-8 using iconv() or mbstring and hope they fix it for you. (they won't, but you can at least ignore invalid characters so you can load your XML)

Or you can take a long long road and check / correct the sequence yourself. This will take some time, depending on how familiar you are with UTF-8. Perhaps there are libraries that would do this, although I don't know.

In any case, notify the data provider that they are sending the wrong data so that they can fix it.




Here's a partial fix. This will definitely not fix everything, but it will fix some of them. I hope you have enough until your provider fixes their stuff.

 function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str) { return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str); } function utf8_encode_callback($m) { return utf8_encode($m[0]); } 
+69
Mar 24 '10 at 18:02
source share

I solved it using

 $content = utf8_encode(file_get_contents('http://example.com/rss.xml')); $xml = simplexml_load_string($content); 
+44
Jan 01 '10 at 21:14
source share

If you are sure that your XML is encoded in UTF-8 but contains bad characters, you can use this function to fix them:

 $content = iconv('UTF-8', 'UTF-8//IGNORE', $content); 
+9
Dec 02 '13 at 13:12
source share

Instead of using javascript, you can simply put this line of code after the mysql_connect clause:

 mysql_set_charset('utf8',$connection); 

Greetings.

+3
Apr 3 '11 at 1:10
source share

We recently encountered a similar problem and could not find anything obvious as a reason. There was a control character in our line, but when we exported this line to the browser, this symbol was not visible unless we copied the text into the IDE.

We were able to solve our problem thanks to this post and this:

preg_replace ('/ [\ x00- \ x1F \ x7F] /', '', $ input);

+3
Nov 11 '16 at 16:18
source share

If you upload an XML file and open it, for example, in Notepad ++, you will see that the encoding is set to something other than UTF8. I had the same xml problem as me and it was just te encoding in the editor :)

String <?xml version="1.0" encoding="UTF-8"?> Do not configure the encoding of the document, this is only information for the validator or other resource.

+2
Jan 29 '11 at 23:15
source share

Can you open a third-party XML source in Firefox and see what it automatically identifies as encoding? Perhaps they are using plain old ISO-8859-1, UTF-16, or something else.

If they claim to be UTF-8, although they serve something else, their feed is clearly broken. Work around such a broken feed seems awful to me (although sometimes it is inevitable, I know).

If this is a simple example, such as "UTF-8 versus ISO-8859-1," you can also try your luck with mb_detect_encoding () .

+1
Mar 24 '10 at 12:38
source share

After several attempts, I found that the htmlentities function works.

 $value = htmlentities($value) 
+1
Jul 22 '16 at 8:34
source share

When creating doctrine mapping files, I ran into one problem. I fixed this by deleting all the comments that some fields had in the database.

0
Jun 03 '16 at 4:39 on
source share

I had this problem. It turns out that the XML file (not the content) was not encoded in utf-8, but in ISO-8859-1. You can check this on Mac with file -I xml_filename .

I used Sublime to change the encoding of the file in utf-8, and lxml did not import any problems.

0
Jun 08 '16 at 22:41
source share



All Articles