Notepad ++. NET plugin - get current buffer text - encoding problems

I have a .NET plugin that needs to get the text of the current buffer. I found this page that shows a way to do this:

public static string GetDocumentText(IntPtr curScintilla) { int length = (int)Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1; StringBuilder sb = new StringBuilder(length); Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb); return sb.ToString(); } 

And that’s fine, until we reach the problems with character encoding. I have a buffer that is set in the "Encoding" menu to "UTF-8 without specification", and I write this text to a file:

 System.IO.File.WriteAllText(@"C:\Users\davet\BBBBBB.txt", sb.ToString()); 

when I open this file (in Notepad ++), UTF-8 without specification is displayed in the encoding menu, but the ß character is broken (ß).

I managed to find the encoding information of my current buffer:

 int currentBuffer = (int)Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETCURRENTBUFFERID, 0, 0); Console.WriteLine("currentBuffer: " + currentBuffer); int encoding = (int) Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETBUFFERENCODING, currentBuffer, 0); Console.WriteLine("encoding = " + encoding); 

And it shows “4” for “UTF-8 without specification” and “0” for “ASCII”, but I cannot find that notepad ++ or Scintilla thinks these values ​​should represent.

So, I lost a little place where to go next (Windows is not my natural habitat). Does anyone know that I'm wrong, or how to debug it further?

Thanks.

+4
source share
2 answers

Removing StringBuilder fixes this problem.

 public static string GetDocumentTextBytes(IntPtr curScintilla) { int length = (int) Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1; byte[] sb = new byte[length]; unsafe { fixed (byte* p = sb) { IntPtr ptr = (IntPtr) p; Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, ptr); } return System.Text.Encoding.UTF8.GetString(sb).TrimEnd('\0'); } } 
+4
source

Alternative approach:

The reason for broken UTF-8 characters is that this line ..

 Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb); 

.. reads a string using [MarshalAs(UnmanagedType.LPStr)] , which uses the default ANSI encoding for your computer when decoding strings ( MSDN ). This means that you get a string with one character for each byte, which is split into multi-byte UTF-8 characters.

Now, to save the original UTF-8 bytes to disk, you just need to use the same default ANSI encoding when writing the file:

 File.WriteAllText(@"C:\Users\davet\BBBBBB.txt", sb.ToString(), Encoding.Default); 
0
source

Source: https://habr.com/ru/post/1489749/


All Articles