Is there a more elegant way to change Unicode to Ascii?

I saw a lot of problem, where you have some strange unicode character, which is somewhat similar to a certain ascii character, and for some reason it needs to be converted at runtime.

In this case, I am trying to export to csv. Already using the nasty fix for dash, emdash, endash and hbar, I just got a new request for `` ''. Besides another nasty fix, is there another better way to do this?

Here is what I have at the moment ...

        formattedString = formattedString.Replace(char.ConvertFromUtf32(8211), "-");
        formattedString = formattedString.Replace(char.ConvertFromUtf32(8212), "-");
        formattedString = formattedString.Replace(char.ConvertFromUtf32(8213), "-");

Any ideas?

+3
source share
4 answers

This is a rather difficult task, so no method will be really very elegant.

, , , . , , ( , , ).

, , .Replace , char.ConvertFromUtf32(8211) "\u2013". , , , U + 2013, (, char.ConvertFromUtf32(0x2013) , char). ( '–' - , , , -, - - ).

( , , , char char).

, :

formattedString = formattedString.Replace('\u2013', '-');
formattedString = formattedString.Replace('\u2014', '-');
formattedString = formattedString.Replace('\u2015', '-');

3, , , , ( , , formattedString , ). :

StringBuilder sb = new StringBuilder(formattedString.length);//we know this is the capacity so we initialise with it:
foreach(char c in formattedString)
  switch(c)
  {
    case '\u2013': case '\u2014': case '\u2015':
      sb.Append('-');
    default:
      sb.Append(c)
  }
formattedString = sb.ToString();

( - , (int)c >= 0x2013 && (int)c <= 0x2015, , , , ).

(, formattedString - , , , ).

, char , , . :

case 'ß':
  sb.Append("ss");

, , - . , .

. , , US-ASCII. 128 , :

char[] replacements = {/*list of replacement characters*/}
StringBuilder sb = new StringBuilder(formattedString.length);
foreach(char c in formattedString)
  sb.Append(replacements[(int)c]);
formattedString = sb.ToString();

Unicode, 109 000 0 1114111. , , , ( , , ), .

, - ( ). , , , , :

char[] unchanged = new char[128];
for(int i = 0; i != 128; ++i)
  unchanged[i] = (char)i;
char[] error = new string('\uFFFD', 128).ToCharArray();
char[] block0 = (new string('\uFFFD', 13) + "---" + new string('\uFFFD', 112)).ToCharArray();

char[][] blocks = new char[8704][];
for(int i = 1; i != 8704; ++i)
  blocks[i] = error;
blocks[0] = unchanged;
blocks[64] = block0;

/* the above need only happen once, so it could be done with static members of a helper class that are initialised in a static constructor*/

StringBuilder sb = new StringBuilder(formattedString.Length);
foreach(char c in formattedString)
{
  int cAsI = (int)c;
  sb.Append(blocks[i / 128][i % 128]);
}
string ret = sb.ToString();
if(ret.IndexOf('\uFFFD') != -1)
    throw new ArgumentException("Unconvertable character");
formattedString = ret;

, ( ) , , , . , , (- ), , , .

, , , 384 , ( ), 109 000 , ( , ), , , , .

, , , " ", UTF-16, .NET, char ?

, , ( -, , ). US-ASCII, System.Text.Encoding EncoderFallback EncoderFallbackBuffer . , ( ) , .

+7

, . , , string.Replace.

:

var lookup = new Dictionary<char, char>
{
    { '`',  '-' },
    { 'இ', '-' },
    //next pair, etc, etc
};

var input = "blah இ blah ` blah";

var r;

var result = input.Select(c => lookup.TryGetValue(c, out r) ? r : c);

string output = new string(result.ToArray());

, ASCII:

string output = new string(input.Select(c => c <= 127 ? c : '-').ToArray());
+4

, , , , , .

.

  • , , StringBuilder , .
  • , , , , .
  • You can load the "from" and "to" characters at run time from the configuration file instead of hard-coding each conversion operation. Later, when more such requests were requested, you did not need to change the code - this can be done using the configuration.
+3
source

If they are all replaced with one line:

formattedString = string.Join("-", formattedString.Split('\u2013', '\u2014', '\u2015'));

or

foreach (char c in "\u2013\u2014\u2015") 
    formattedString = formattedString.Replace(c, '-');
0
source

Source: https://habr.com/ru/post/1786958/


All Articles