.NET XmlDocument LoadXML and objects

When loading XML in an XmlDocument, i.e.

  XmlDocument document = new XmlDocument ();
 document.LoadXml (xmlData);

Is there a way to stop the process from replacing objects? I have a weird problem when I have a TM symbol (saved as object # 8482) in xml that is converted to a TM symbol. As far as I know, this should not happen, since the XML document is encoded in ISO-8859-1 (which does not have a TM character)

thanks

+4
source share
7 answers

This is a standard misunderstanding of XML tools. The entire business with "& #x" is a syntax function designed to handle character encodings. Your XmlDocument is not a character stream — it is freed from character encoding problems — instead, it contains an abstract data model of type XML. Words for this include DOM and InfoSet, I don’t know exactly what exactly.

The "& #x" sleeves will not exist in this model, because the whole problem does not matter, it will return - if necessary - when converting the Info Set back to a character stream in a certain encoding.

This misunderstanding is widespread enough to turn it into academic literature as part of the totality of such quirks. Take a look at the “Xml Fever” at this place: http://doi.acm.org/10.1145/1364782.1364795

+4
source

What are you writing it to? TextWriter? flow? what?

The following saves the object (well, it replaces it with the hexadecimal equivalent), but if you do the same with StringWriter, it detects unicode and uses this instead:

XmlDocument doc = new XmlDocument(); doc.LoadXml(@"<xml>&#8482;</xml>"); using (MemoryStream ms = new MemoryStream()) { XmlWriterSettings settings = new XmlWriterSettings(); settings.Encoding = Encoding.GetEncoding("ISO-8859-1"); XmlWriter xw = XmlWriter.Create(ms, settings); doc.Save(xw); xw.Close(); Console.WriteLine(Encoding.UTF8.GetString(ms.ToArray())); } 

Outputs:

  <?xml version="1.0" encoding="iso-8859-1"?><xml>&#x2122;</xml> 
+4
source

I admit that everything is a bit confusing with XML documents and encodings, but I hope it will be set appropriately if you save it again, if you still use ISO-8859-1 - but that if you save with UTF- 8, this is not necessary. In a sense, logically, a document does contain a symbol, not an entity reference - the latter is just a coding issue. (I think out loud here - please do not take this as authoritative information.)

What do you do with a document after downloading it?

+2
source

I believe if you enclose the contents of an entity in a CDATA section, it should leave everything alone, for example.

 <root> <testnode> <![CDATA[some text &#8482;]]> </testnode> </root> 
0
source

Object references are not coding specific. According to W3C XML 1.0 Recommendation :

If a character reference begins with "& #x", numbers and letters until terminated; provide a hexadecimal representation of the character code in ISO / IEC 10646.

0
source

& #xxxx; Entities are considered the symbol they represent. All XML is converted to unicode when read, and any such objects are deleted in favor of the unicode character that they represent. This includes any events for them in the Unicode source, such as the string passed to LoadXML.

Similarly, when writing any character that cannot be represented by the stream that is being written, it is converted to & #xxxx; organization. It makes no sense to try to save them.

A common mistake is getting String from the DOM in some ways that use non-Unicode encoding. It just doesn't happen no matter what

0
source

Thanks for the help.

I fixed my problem by writing an HtmlEncode function that actually replaces all the characters before it spits them out onto a web page (instead of relying on the incorrect HtmlEncode (). NET function, which apparently encodes a small subset of the required characters )

0
source

Source: https://habr.com/ru/post/1277315/


All Articles