Invulnerable XMLException

Background

I am serializing a very large List<string> with this code:

 public static string SerializeObjectToXML<T>(T item) { XmlSerializer xs = new XmlSerializer(typeof(T)); using (StringWriter writer = new StringWriter()) { xs.Serialize(writer, item); return writer.ToString(); } } 

And deserialize it with this code:

 public static T DeserializeXMLToObject<T>(string xmlText) { if (string.IsNullOrEmpty(xmlText)) return default(T); XmlSerializer xs = new XmlSerializer(typeof(T)); using (MemoryStream memoryStream = new MemoryStream(new UnicodeEncoding().GetBytes(xmlText.Replace((char)0x1A, ' ')))) using (XmlTextReader xsText = new XmlTextReader(memoryStream)) { xsText.Normalization = true; return (T)xs.Deserialize(xsText); } } 

But I get this exception when I deserialize it:

XMLException . There is an error in the XML document (217388, 15). '[]', the hexadecimal value 0x1A, is an invalid character. Line 217388, position 15.

in System.Xml.Serialization.XmlSerializer.Deserialize (XmlReader xmlReader, String encodingStyle, XmlDeserializationEvents events)

in System.Xml.Serialization.XmlSerializer.Deserialize (XmlReader xmlReader)

Question

Why line xmlText.Replace((char)0x1A, ' ') does not work , what is witchery?

Some limitations

  • My code is in C #, framework 4 built in VS2010 Pro.
  • I cannot view the xmlText value in debug mode, because the List<string> too large, and the clock windows simply display the error message Unable to evaluate the expression. Not enough storage is available to complete this operation. Unable to evaluate the expression. Not enough storage is available to complete this operation. .
+4
source share
3 answers

I think I found the problem. By default, XmlSerializer will allow you to generate invalid XML.

Based on the code:

 var input = "\u001a"; var writer = new StringWriter(); var serializer = new XmlSerializer(typeof(string)); serializer.Serialize(writer, input); Console.WriteLine(writer.ToString()); 

Conclusion:

 <?xml version="1.0" encoding="utf-16"?> <string>&#x1A;</string> 

This is invalid XML. According to the XML specification, all character references must be valid. Valid characters are:

 #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

As you can see, U + 001A (and all other C0 / C1 control characters) are not allowed as references, as they are not allowed.

The error message given by the decoder is a bit misleading, and it would be clearer if he said that there is an invalid symbolic link.

There are several options for what you can do.

1) Do not let XmlSerializer create invalid documents in the first place

You can use XmlWriter , which by default will not allow invalid characters:

 var input = "\u001a"; var writer = new StringWriter(); var serializer = new XmlSerializer(typeof(string)); // added following line: var xmlWriter = XmlWriter.Create(writer); // then, write via the xmlWriter rather than writer: serializer.Serialize(xmlWriter, input); Console.WriteLine(writer.ToString()); 

This will throw an exception when serialization happens. This will need to be processed and the corresponding error displayed.

This is probably not useful for you because you already have data with these invalid characters.

or 2) Remove references to this invalid character

That is, instead of .Replace((char)0x1a, ' ') , which does not actually replace anything in your document, use .Replace("&#x1A;", " ") . (This is not case sensitive, but this is what .NET generates. A more robust solution would be to use a case-insensitive regular expression.)


On the XML side, it actually allows you to reference control characters if they are links, not just characters in the document. This will solve your problem, except that .NET XmlSerializer does not support version 1.1.

+6
source

If you have existing data in which you serialized a class that contains characters that cannot subsequently be deserialized, you can misinform the data in the following way:

 public static string SanitiseSerialisedXml(this string serialized) { if (serialized == null) { return null; } const string pattern = @"&#x([0-9A-F]{1,2});"; var sanitised = Regex.Replace(serialized, pattern, match => { var value = match.Groups[1].Value; int characterCode; if (int.TryParse(value, NumberStyles.HexNumber, CultureInfo.InvariantCulture, out characterCode)) { if (characterCode >= char.MinValue && characterCode <= char.MaxValue) { return XmlConvert.IsXmlChar((char)characterCode) ? match.Value : string.Empty; } } return match.Value; }); return sanitised; } 

The preferred solution is to prevent serialization on invalid characters at the serialization point in accordance with point 1 of Porsche's answer. This code spans point 2 of the Porges response (separate links to this invalid character) and crosses out all invalid characters. The above code was written to solve the problem when we saved the serialized data in the database field, therefore there was no serialization point to fix the obsolete data and fix the problem.

+8
source

This problem also plagued us when working with ASCII control characters ( SYN, NAK, etc. ). There is an easy way to disable this if you are using XmlWriterSettings , just use XmlWriterSettings.CheckCharacters to match the XML 1.0 character specifications .

 class Program { static void Main(string[] args) { MyCustomType c = new MyCustomType(); c.Description = string.Format("Something like this {0}", (char)22); var output = c.ToXMLString(); Console.WriteLine(output); } } public class MyCustomType { public string Description { get; set; } static readonly XmlSerializer xmlSerializer = new XmlSerializer(typeof(MyCustomType)); public string ToXMLString() { var settings = new XmlWriterSettings() { Indent = true, OmitXmlDeclaration = true, CheckCharacters = false }; StringBuilder sb = new StringBuilder(); using (var writer = XmlWriter.Create(sb, settings)) { xmlSerializer.Serialize(writer, this); return sb.ToString(); } } } 

The output will contain an encoded character like &#x16; instead of throwing an error:

Unhandled exception: System.InvalidOperationException: An error occurred while generating an XML document. ---> System.ArgumentException: '▬', the hexadecimal value 0x16, is an invalid character.
in System.Xml.XmlEncodedRawTextWriter.InvalidXmlChar (Int32 ch, Char * pDst, Boolean entitize) in System.Xml.XmlEncodedRawTextWriter.WriteElementTextBlock (Char * pSrc, Char * pSrcEnd)
in System.Xml.XmlEncodedRawTextWriter.WriteString (string text)
in System.Xml.XmlEncodedRawTextWriterIndent.WriteString (string text)
in System.Xml.XmlWellFormedWriter.WriteString (string text)
in System.Xml.XmlWriter.WriteElementString (String localName, String ns, String value)
in System.Xml.Serialization.XmlSerializationWriter.WriteElementString (String localName, String ns, String value, XmlQualifiedName xsiType

+2
source

Source: https://habr.com/ru/post/1402577/


All Articles