XDocument.Save () removes my & # xA; legal entities

I wrote a tool to recover some XML files (i.e. insert some attributes / values ​​that were missing) using C # and Linq-to-XML. The tool loads an existing XML file into an XDocument object. Then it parses through node to insert the missing data. After that, it calls XDocument.Save () to save the changes to another directory.

All this is fine, except for one: any objects and #xA; , which are in the text in the XML file, are replaced by a new line character. Of course, the entity is a new line, but I need to save the object in XML, because it needs a different consumer.

Is there a way to save the modified XDocument without losing the & #xA; ?

Thanks.

+6
source share
2 answers

Objects 
 technically called "numeric symbolic links" in XML, and they are resolved when the source document is loaded into XDocument . This makes it difficult to solve the problem because it is impossible to distinguish between allowed white space objects and a small space (usually used to format XML documents for plain text viewers) after loading XDocument . Therefore, the following applies only if your document does not have any minor spaces.

The System.Xml library allows System.Xml to save whitespace by setting the NewLineHandling property of the XmlWriterSettings class to Entitize . However, in text nodes this will only mean \r to 
 , not \n before 
 .

The simplest solution is to infer from the XmlWriter class and override its WriteString method to manually replace whitespace characters with their numeric characters. The WriteString method WriteString also the place where .NET includes characters that are not allowed to appear in text nodes, for example, syntax markers & , < and > , which are respectively designated as &amp; , &lt; and &gt; .

Since XmlWriter is abstract, we get from XmlTextWriter to avoid having to implement all the abstract methods of the previous class. Here is a quick and dirty implementation:

 public class EntitizingXmlWriter : XmlTextWriter { public EntitizingXmlWriter(TextWriter writer) : base(writer) { } public override void WriteString(string text) { foreach (char c in text) { switch (c) { case '\r': case '\n': case '\t': base.WriteCharEntity(c); break; default: base.WriteString(c.ToString()); break; } } } } 

If it is intended for use in a production environment, you want to get rid of the c.ToString() , since it is very inefficient. You can optimize the code by substituting a substring of the source text that does not contain any of the characters you want to give, and combining them into a single call to base.WriteString .

A word of warning: the following naive implementation will not work, since the basic WriteString method will replace any & characters with &amp; , thereby increasing \r to &amp;#xA; .

  public override void WriteString(string text) { text = text.Replace("\r", "&#xD;"); text = text.Replace("\n", "&#xA;"); text = text.Replace("\t", "&#x9;"); base.WriteString(text); } 

Finally, to save your XDocument to a destination file or stream, simply use the following snippet:

 using (var textWriter = new StreamWriter(destination)) using (var xmlWriter = new EntitizingXmlWriter(textWriter)) document.Save(xmlWriter); 

Hope this helps!

Change For reference: optimized version of the overridden WriteString method:

 public override void WriteString(string text) { // The start index of the next substring containing only non-entitized characters. int start = 0; // The index of the current character being checked. for (int curr = 0; curr < text.Length; ++curr) { // Check whether the current character should be entitized. char chr = text[curr]; if (chr == '\r' || chr == '\n' || chr == '\t') { // Write the previous substring of non-entitized characters. if (start < curr) base.WriteString(text.Substring(start, curr - start)); // Write current character, entitized. base.WriteCharEntity(chr); // Next substring of non-entitized characters tentatively starts // immediately beyond current character. start = curr + 1; } } // Write the trailing substring of non-entitized characters. if (start < text.Length) base.WriteString(text.Substring(start, text.Length - start)); } 
+10
source

If your document contains minor spaces that you want to distinguish from your entities &#xA; , you can use the following (much simpler) solution: temporarily convert references to the &#xA; into another character (which is not already present in your document), do the XML processing, and then convert the character to output. In the example below, we will use the private character U+E800 .

 static string ProcessXml(string input) { input = input.Replace("&#xA;", "&#xE800;"); XDocument document = XDocument.Parse(input); // TODO: Perform XML processing here. string output = document.ToString(); return output.Replace("\uE800", "&#xA;"); } 

Note that because XDocument allows numeric character references to the corresponding Unicode characters, the objects "&#xE800;" would be resolved to '\uE800' in the output file.

As a rule, you can safely use any Unicode code "Private area of ​​use" ( U+E000 - U+F8FF ). If you want to be more secure, check that the symbol is not already present in the document; if so, select another character from the specified range. Since you will only use the symbol temporarily and internally, it does not matter which one you use. In a very unlikely scenario, all personal presence characters are already present in the document, throw an exception; however, I doubt that this will ever happen in practice.

0
source

Source: https://habr.com/ru/post/905668/


All Articles