How to remove any utf8mb4 characters in a string

With C #, you can remove utf8mb4 strings (emoji, etc.) from a string so that the result is fully compatible with utf8.

Most of the solutions are related to changing the database configuration, but, unfortunately, I do not have such an opportunity.

+4
source share
1 answer

This should replace the surrogate characters with replacementCharacter(it could even be string.Empty)

MySql, utf8mb4. utf8 utf8mb4 MySql. , utf8 4 utf8. wiki, 4 utf8- - > 0xFFFF, utf16 char ( ). . "" ( + ), replacementCharacter, () replacementCharacte.

public static string RemoveSurrogatePairs(string str, string replacementCharacter = "?")
{
    if (str == null)
    {
        return null;
    }

    StringBuilder sb = null;

    for (int i = 0; i < str.Length; i++)
    {
        char ch = str[i];

        if (char.IsSurrogate(ch))
        {
            if (sb == null)
            {
                sb = new StringBuilder(str, 0, i, str.Length);
            }

            sb.Append(replacementCharacter);

            // If there is a high+low surrogate, skip the low surrogate
            if (i + 1 < str.Length && char.IsHighSurrogate(ch) && char.IsLowSurrogate(str[i + 1]))
            {
                i++;
            }
        }
        else if (sb != null)
        {
            sb.Append(ch);
        }
    }

    return sb == null ? str : sb.ToString();
}
+4

Source: https://habr.com/ru/post/1589393/


All Articles