Notepad ++. NET plugin - get current buffer text - encoding problems

Question

Notepad ++. NET plugin - get current buffer text - encoding problems

I have a .NET plugin that needs to get the text of the current buffer. I found this page that shows a way to do this:

public static string GetDocumentText(IntPtr curScintilla) { int length = (int)Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1; StringBuilder sb = new StringBuilder(length); Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb); return sb.ToString(); }

And that’s fine, until we reach the problems with character encoding. I have a buffer that is set in the "Encoding" menu to "UTF-8 without specification", and I write this text to a file:

 System.IO.File.WriteAllText(@"C:\Users\davet\BBBBBB.txt", sb.ToString());

when I open this file (in Notepad ++), UTF-8 without specification is displayed in the encoding menu, but the ß character is broken (ÃŸ).

I managed to find the encoding information of my current buffer:

 int currentBuffer = (int)Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETCURRENTBUFFERID, 0, 0); Console.WriteLine("currentBuffer: " + currentBuffer); int encoding = (int) Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETBUFFERENCODING, currentBuffer, 0); Console.WriteLine("encoding = " + encoding);

And it shows “4” for “UTF-8 without specification” and “0” for “ASCII”, but I cannot find that notepad ++ or Scintilla thinks these values should represent.

So, I lost a little place where to go next (Windows is not my natural habitat). Does anyone know that I'm wrong, or how to debug it further?

Thanks.

+4

c # plugins .net notepad ++ scintilla

woddle Jul 04 '13 at 14:54

source share

2 answers

Alternative approach:

The reason for broken UTF-8 characters is that this line ..

 Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb);

.. reads a string using [MarshalAs(UnmanagedType.LPStr)] , which uses the default ANSI encoding for your computer when decoding strings ( MSDN ). This means that you get a string with one character for each byte, which is split into multi-byte UTF-8 characters.

Now, to save the original UTF-8 bytes to disk, you just need to use the same default ANSI encoding when writing the file:

 File.WriteAllText(@"C:\Users\davet\BBBBBB.txt", sb.ToString(), Encoding.Default);

0

Sphinxxx Jun 13 '16 at 0:53

source share

woddle · Accepted Answer · 2013-08-06T13:27:20+0000

Removing StringBuilder fixes this problem.

 public static string GetDocumentTextBytes(IntPtr curScintilla) { int length = (int) Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1; byte[] sb = new byte[length]; unsafe { fixed (byte* p = sb) { IntPtr ptr = (IntPtr) p; Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, ptr); } return System.Text.Encoding.UTF8.GetString(sb).TrimEnd('\0'); } }

Notepad ++. NET plugin - get current buffer text - encoding problems

More articles: