Illegal character in Xml

I have a PHP file that creates an Xml site map based on data that has been imported from several sources. Currently my site map is poorly formed due to an illegal character in one line of imported data, however I am trying my best to delete it.

The symbol looks like a “square” or superscript 2 and is represented as a square. I tried pasting this into a hex editor, but it displays as ?, does the hex code also match ?. I also tried using iconv to convert from all source encodings to all target encodings, with no combination deleting this character.

I also have the following function to remove characters other than ascii:

function stripInvalidXml($value)
{
    $ret = "";
    $current;
    if (empty($value)) 
    {
        return $ret;
    }

    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    {
        $current = ord($value{$i});
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||
            (($current >= 0x20) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        {
            if($current != 0x1F)
            {
                $ret .= chr($current);
            }
        }
        else
        {
            $ret .= " ";
        }
    }


    return $ret;
}

. , & # 65535; . , , (, )

251gm-50

, , - , Xml.

, . Eclipses # 65535; ( - , , & # 65535;)

+3
3

, - HTML, "". URL- , htmlentities :

$content = preg_replace("/&#?[a-z0-9]+;/i","",$content);
+1

. , PHP.

iconv :

$cleanText = iconv('UTF-8','ISO-8859-1//TRANSLIT//IGNORE', $srcText);

utf-8 iso-8859, "" , .

, - utf-8. , , XML.

linux, , enca

+3

:

    $current = ord($value{$i});
    if (($current == 0x9) ||
        ($current == 0xA) ||
        ($current == 0xD) ||
        (($current >= 0x20) && ($current <= 0xD7FF)) ||
        (($current >= 0xE000) && ($current <= 0xFFFD)) ||
        (($current >= 0x10000) && ($current <= 0x10FFFF)))
    {
        if($current != 0x1F)
            $ret .= chr($current);
    }

ord() , 0xFF, .

I assume that your XML is invalid because the file contains an invalid UTF-8 sequence (indeed, and # 65535 ;, i.e. 0xFFFF, is invalid in UTF-8). This probably comes from copy-paste of different XML files with different encodings.

I suggest you use the DOM extension instead to make your XML-mash-up, which handles various encodings automatically, converting them inside UTF -8.

+2
source

Source: https://habr.com/ru/post/1754591/


All Articles