MemoryStream from a string - confusion about coding to use

I have a piece of code that converts a string to a memory stream:

using (MemoryStream stream = new MemoryStream(Encoding.UTF8.GetBytes(applicationForm))) 

However, I am a little confused if this is correct. Basically, I'm always confused with .NET encoding.

Bottom line: use the correct encoding object ( UTF8 ) to get bytes?

I know that inside .NET stores the string as UTF-16 , but my applicationForm variable was based on a file with text that was saved in UTF-8 .

Thanks Pawel

EDIT 1: Let me explain how I can get the applicationForm variable. I have access to the assembly, which exposes the class using the GenerateApplicationForm method. This method returns a string. However, I know that somewhere backstage the component uses files stored on disk. The content of these files is encoded using UTF-8. Therefore, I cannot read the file directly, etc. I only have this line, and I know that the UTF-8 encoded file is originally used. In the client code that used the GenerateApplicationForm component, I need to convert the applicationForm variable to a stream, because other components (from another assembly) are expecting Stream . That the expression using .... mentioned in the question takes effect.

+6
source share
5 answers

Assuming applicationForm is a string that you are reading from some UTF8 text file. It will be UTF16 / Unicode , regardless of the encoding of the source file. The conversion occurred when you uploaded the file to a string.

Your code will encode the applicationForm string into a MemoryStream of UTF8 bytes.

This may or may not be correct depending on what you want to do with it.

. String strings are always UTF16 or Unicode . When Strings converted to files, streams, or byte[] , they can be encoded in different ways. 1 byte is not enough to store all the different characters used in all languages, so more complex strings must be encoded, so a single charachter can be represented by more than one byte, sometimes or always depending on the encoding used.

If you use a simple encoding such as ASCII , one hacker will always contain one byte, but the data will be limited to the ASCII character ASCII . Converting to "ASCII" from any UTF encoding may lose data if multiple byte characters are used.

For a complete unicode image, go here .

EDIT 1: Preventing additional information about a GenerateApplicationForm component suitable for UTF8 probably be the right choice. If this works, try ASCII or UTF16 . Best of all, refer to the source code of the component or the supplier of the component.

EDIT 2: Definitely UTF8 , then you were right all the time.

+3
source

If the data is saved in UTF-8, you need to open it with UTF-8.

0
source

Just use the same encoding to read as you wrote. If it is UTF8 -> use UTF8. If you write Chinese, then someone must be able to read Chinese in order to understand you ...

0
source

To indicate the byte mark, UTF-8 (BOM) is added at the beginning of the file. See the utf-8 file, then use the utf-8 converter.

0
source

Encoding a UTF8 byte creates a representation of your data that is backward compatible with the ASCII character set to represent your data. Since ASCII is the lowest common denominator for data transfer, you can pretty much guarantee that this representation will work on the vast majority of systems.

While you can change it, you assume that any system in which it works will understand that you have changed it and will support your new view. This is a rather complicated assumption to verify. The encodings at both ends are very similar.

If, as you say, you cannot change the system that generates your string, then yes, you are doing it right. This works so why do you think you need to make changes? The interiors of how .NET represents the string are not played here, you are not getting the .NET string, you are getting the encoded representation of the UTF-8 value, so you should use UTF8 to decode it to the original value.

0
source

Source: https://habr.com/ru/post/889449/


All Articles