Replace invalid XML character references

I project some data as XML from SQL Server using ADO.NET. Some of my data contains invalid characters in XML, such as CHAR(7)(known as BEL).

SELECT 'This is BEL: ' + CHAR(7) AS A FOR XML RAW

SQL Server encodes invalid characters such as numeric references:

<row A="This is BEL: &#x7;" />

However, even the encoded form is invalid in XML 1.0 and will lead to errors in XML parsing:

var doc = XDocument.Parse("<row A=\"This is BEL: &#x7;\" />");
// XmlException: ' ', hexadecimal value 0x07, is an invalid character. Line 1, position 25.

I would like to replace all of these invalid numeric references to the Unicode replacement character ' '. I know how to do this for unencoded XML:

string str = "<row A=\"This is BEL: \u0007\" />";
if (str.Any(c => !XmlConvert.IsXmlChar(c)))
    str = new string(str.Select(c => XmlConvert.IsXmlChar(c) ? c : ' ').ToArray());
          // <row A="This is BEL:  " />

XML? HtmlDecode, HtmlEncode , , .

. #, SQL, .

+4
3

, . , . , , .

public string ReplaceXMLEncodedCharacters(string input)
{
    const string pattern = @"&#(x?)([A-Fa-f0-9]+);";
    MatchCollection matches = Regex.Matches(input, pattern);
    int offset = 0;
    foreach (Match match in matches)
    {
        int charCode = 0;
        if (string.IsNullOrEmpty(match.Groups[1].Value))
            charCode = int.Parse(match.Groups[2].Value);
        else
            charCode = int.Parse(match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
        char character = (char)charCode;
        input = input.Remove(match.Index - offset, match.Length).Insert(match.Index - offset, character.ToString());
        offset += match.Length - 1;
    }
    return input;
}
+2

CDATA. . :

SELECT 'This is BEL: <![CDATA[' + CHAR(7) + ']]>' AS A FOR XML RAW

XML , , .

+1

For reference, this is my solution. I built on the answer of Tonkleton , but modified it to get closer to the internal implementation HtmlDecode. The code below ignores surrogate pairs.

// numeric character references
static readonly Regex ncrRegex = new Regex("&#x?[A-Fa-f0-9]+;");

static string ReplaceInvalidXmlCharacterReferences(string input)
{
    if (input.IndexOf("&#") == -1)   // optimization
        return input;

    return ncrRegex.Replace(input, match =>
    {
        string ncr = match.Value;            
        uint num;
        var frmt = NumberFormatInfo.InvariantInfo;

        bool isParsed =
            ncr[2] == 'x' ?   // the x must be lowercase in XML documents
            uint.TryParse(ncr.Substring(3, ncr.Length - 4), NumberStyles.AllowHexSpecifier, frmt, out num) :
            uint.TryParse(ncr.Substring(2, ncr.Length - 3), NumberStyles.Integer, frmt, out num);

        return isParsed && !XmlConvert.IsXmlChar((char)num) ? " " : ncr;
    });
}
0
source

Source: https://habr.com/ru/post/1599550/


All Articles