.NET DataSet.GetXml () - what is the default encoding?

An existing application passes XML to sproc in SQLServer 2000, the input data type is TEXT; XML is derived from Dataset.GetXML (). But I noticed that it does not indicate the encoding.

So, when a user sneaks into an unacceptable character into a dataset, in particular ASCII 146 (which appears to be an apostrophe) instead of ASCII 39 (a single quote), sproc does not work.

One approach is to getXML result prefix with

<?xml version="1.0" encoding="ISO-8859-1"?> 

This works in this case, but what would be a more correct approach to ensure that sproc does not crash (if other unexpected characters appear)?

PS. I suspect that the user prints the text in MS-Word or a similar editor, copies and pastes into the application input fields; I probably want the user to continue working this way, just need to prevent crashes.

EDIT: I am looking for answers that confirm or deny some aspects, for example:
- according to the header, what is the default encoding if none are specified in XML?
- Is the encoding ISO-8859-1 used correctly?
- if there is a better encoding that will cover more characters in the English-speaking world and, therefore, less likely to lead to an error in sproc?
- Would you filter the application for standard ASCII (only from 0 to 127) at the user interface level and not allow advanced ASCII?
- any other relevant data.

+1
source share
2 answers

DataSet.GetXml() returns a string . In .NET, strings are internally encoded using UTF-16, but this is not very important here.

The reason the line does not have a <?xml encoding=...> declaration because this declaration is useful or necessary for parsing XML in a byte stream. The .NET string is not a byte stream, it is just text with well-defined semantics of semantics (which is Unicode), so it is not needed there.

If no XML encoding declaration exists, UTF-8 shall be accepted by the XML parser in the absence of specification. In your case, however, this is also completely inappropriate, since the problem is not related to the XML parser (XML is not parsed by SQL Server when it is stored in the TEXT column). The problem is that your XML contains some Unicode characters, and TEXT is a non-Unicode SQL type.

You can encode a string for any encoding using the Encoding.GetBytes() method.

0
source

I believe your approach should be to use WriteXml instead of GetXml. This should allow you to specify the encoding.

However, note that you will have to write through an intermediate stream - if you go directly to the line, it will always use UTF-16. Since you are using the TEXT column, this will allow characters to be invalid for TEXT.

0
source

Source: https://habr.com/ru/post/912219/


All Articles