Delete 4 bytes of UTF8 characters

I would like to remove 4 bytes of UTF8 characters that starts with \ xF0 (char with ASCII code 0xF0) from the string and tried

sText = Regex.Replace (sText, "\xF0...", "");

This does not work. Using two backslashes did not work.

The exact entry is the contents of https://de.wikipedia.org/w/index.php?title=Spezial:Exportieren&action=submit&pages=Unicode 4-byte character is the one after the text [[Violinschlüssel]], in hexadecimal notations: .. 0x65 0x6c 0x5d 0x5d 0x20 0xf0 0x9d 0x84 0x9e 0x20 .. Expected output 0x65 0x6c 0x5d 0x5d 0x20 0x20 ..

What's wrong?

+4
source share
2 answers

.NET, UTF-16. UTF-16, char.

, (using System.Linq;):

sText = string.Concat(sText.Where(x => !char.IsSurrogate(x)));

( Concat, .NET 4.0 (Visual Studio 2010)).


: :

sText = new string(sText.Where(x => !char.IsSurrogate(x)).ToArray());

. ( .NET 3.5 (Visual Studio 2008).)

+5

byte, # char. # "2.4.4.4 " :

, 'a'.
...
escape- , , , \x.

, "\xF0..." U+F0, C3 B0.

Unicode, 0xF0, , , 0xFO.

U+10000 F0 90 80 80 ( U+FFFF, EF BF BF). F1 .... .. U+40000, F1 80 80 80, U+3FFFF F0 BF BF BF.

, U+10000 U+3FFFF. ,

sText = Regex.Replace (sText, "[\\x10000-\\x3FFFF]", "");

, , . , .

static void Main(string[] args)
{
    string input = "] 𝄞 (";
    Console.Write("Input length  {0} : '{1}'  : ", input.Length, input);
    foreach (char cc in input)
    {
        Console.Write("  {0,2:X02}", (int)cc);
    }
    Console.WriteLine();
}

. , @Jeppe .

Input length  6 : '] ?? ('  :   5D  20  D834  DD1E  20  28 
+2

Source: https://habr.com/ru/post/1649866/


All Articles