Removing Invalid Characters from XML String C #

I am stuck in deleting invalid characters from an XML file. I found a RegEx template that should remove everything that is not available:

public static string CleanInvalidXmlChars(string text) { // From xml spec valid chars: // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]"; return Regex.Replace(text, re, ""); } 

And here is my code that reads data from SQL Server (well, snippet)

 using (var cmd = new SqlCommand(Context.Command, connection)) { cmd.CommandTimeout = Context.CommandTimeout; using (var reader = cmd.ExecuteReader()) { StringBuilder xmlResults = new StringBuilder(string.Empty); while (reader.Read()) { xmlResults.Append(reader.GetString(0)); } if (!string.IsNullOrWhiteSpace(xmlResults.ToString())) { var doc = new XmlDocument(); XmlReader xmlReader = XmlReader.Create(new StringReader(xmlResults.ToString())); doc.Load(xmlReader); var nav = doc.CreateNavigator(); var objs = nav.Select("/index/type"); foreach (XPathNavigator obj in objs) { o.OnNext(obj); } } } } 

I tried wrapping CleanInvalindXmlChars in different places:

 while (reader.Read()) { xmlResults.Append(CleanInvalindXmlChars(reader.GetString(0))); } 

Or in

 XmlReader xmlReader = XmlReader.Create(new StringReader(CleanInvalindXmlChars(xmlResults.ToString()))); 

There is an x0B symbol in one cell that I am reading (I can replace it with SQL Server, but I want to be sure of that).

However, I always end up with a mistake

System.Xml.XmlException: '', the hexadecimal value 0x0B, is an invalid character. Line 115, position 33407.

Can someone help me solve this?

+6
source share
2 answers

This is a non-regex based method for clearing string data. I added 0X0B, which is not removed by the regular expression you posted:

 public static string stripNonValidXMLCharacters(string textIn) { if (String.IsNullOrEmpty(textIn)) return textIn; StringBuilder textOut = new StringBuilder(textIn.Length); foreach (Char current in textIn) if ((current == 0x9 || current == 0xA || current == 0xB || current == 0xD) || ((current >= 0x20) && (current <= 0xD7FF)) || ((current >= 0xE000) && (current <= 0xFFFD)) || ((current >= 0x10000) && (current <= 0x10FFFF))) textOut.Append(current); return textOut.ToString(); } 
+1
source

Here is the same question with the accepted answer and the alternative answer that I prefer (the code is copied below).

 public static string XmlCharacterWhitelist( string in_string ) { if( in_string == null ) return null; StringBuilder sbOutput = new StringBuilder(); char ch; for( int i = 0; i < in_string.Length; i++ ) { ch = in_string[i]; if( ( ch >= 0x0020 && ch <= 0xD7FF ) || ( ch >= 0xE000 && ch <= 0xFFFD ) || ch == 0x0009 || ch == 0x000A || ch == 0x000D ) { sbOutput.Append( ch ); } } return sbOutput.ToString(); } 
+2
source

Source: https://habr.com/ru/post/985528/


All Articles