Your 0xED 0x6E 0x2C 0x20 bytes corresponds to "ín" in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Let the data provider know about it and ask them to fix it, because if it doesn't work for you, it probably doesn't work for other people either.
Now there are several ways to work it, and you should use it if you cannot load XML properly . One of them is to use utf8_encode() . The downside is that if this XML contains both valid UTF-8 and some ISO-8859-1, then the result will contain mojibake . Or you can try converting a string from UTF-8 to UTF-8 using iconv() or mbstring and hope they fix it for you. (they won't, but you can at least ignore invalid characters so you can load your XML)
Or you can take a long long road and check / correct the sequence yourself. This will take some time, depending on how familiar you are with UTF-8. Perhaps there are libraries that would do this, although I don't know.
In any case, notify the data provider that they are sending the wrong data so that they can fix it.
Here's a partial fix. This will definitely not fix everything, but it will fix some of them. I hope you have enough until your provider fixes their stuff.
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str) { return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str); } function utf8_encode_callback($m) { return utf8_encode($m[0]); }
Josh Davis Mar 24 '10 at 18:02 2010-03-24 18:02
source share