Using C # to determine if a name character is international

Question

Using C # to determine if a name character is international

I wrote a small console application (source below) to find and possibly rename files containing international characters, as they are a source of constant pain in most version control systems (some of which are given below). The code I use has a simple dictionary with characters to look for and replace (and damage every other character that uses more than one byte of memory), but it feels very hacky. What is the correct way (a) to find out if a symbol is international? and (b) what would be the best ASCII replacement character?

Let me provide some background information on why this is necessary. It so happened that the Danish character Å has two different encodings in UTF-8, both representing the same character. They are known as NFC and NFD encodings. Windows and Linux will create the default NFC encoding, but respect any encoding that it sets. A Mac converts all names (when saved to an HFS + partition) to NFD and therefore returns a different stream of bytes for the file name created in Windows. This effectively destroys Subversion, Git, and many other utilities that do not want to process this script correctly.

I am currently evaluating Mercurial, which is even worse when handling international characters. Being tired enough of these problems, you will either need to control the source code, or international in nature, and therefore we are here.

My current implementation:

public class Checker
{
    private Dictionary<char, string> internationals = new Dictionary<char, string>();
    private List<char> keep = new List<char>();
    private List<char> seen = new List<char>();

    public Checker()
    {
        internationals.Add( 'æ', "ae" );
        internationals.Add( 'ø', "oe" );
        internationals.Add( 'å', "aa" );
        internationals.Add( 'Æ', "Ae" );
        internationals.Add( 'Ø', "Oe" );
        internationals.Add( 'Å', "Aa" );

        internationals.Add( 'ö', "o" );
        internationals.Add( 'ü', "u" );
        internationals.Add( 'ä', "a" );
        internationals.Add( 'é', "e" );
        internationals.Add( 'è', "e" );
        internationals.Add( 'ê', "e" );

        internationals.Add( '¦', "" );
        internationals.Add( 'Ã', "" );
        internationals.Add( '©', "" );
        internationals.Add( ' ', "" );
        internationals.Add( '§', "" );
        internationals.Add( '¡', "" );
        internationals.Add( '³', "" );
        internationals.Add( '', "" );
        internationals.Add( 'º', "" );

        internationals.Add( '«', "-" );
        internationals.Add( '»', "-" );
        internationals.Add( '´', "'" );
        internationals.Add( '`', "'" );
        internationals.Add( '"', "'" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 147 } )[ 0 ], "-" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 148 } )[ 0 ], "-" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 153 } )[ 0 ], "'" );
        internationals.Add( Encoding.UTF8.GetString( new byte[] { 226, 128, 166 } )[ 0 ], "." );

        keep.Add( '-' );
        keep.Add( '=' );
        keep.Add( '\'' );
        keep.Add( '.' );
    }

    public bool IsInternationalCharacter( char c )
    {
        var s = c.ToString();
        byte[] bytes = Encoding.UTF8.GetBytes( s );
        if( bytes.Length > 1 && ! internationals.ContainsKey( c ) && ! seen.Contains( c ) )
        {
            Console.WriteLine( "X '{0}' ({1})", c, string.Join( ",", bytes ) );
            seen.Add( c );
            if( ! keep.Contains( c ) )
            {
                internationals[ c ] = "";
            }
        }
        return internationals.ContainsKey( c );
    }

    public bool HasInternationalCharactersInName( string name, out string safeName )
    {
        StringBuilder sb = new StringBuilder();
        Array.ForEach( name.ToCharArray(), c => sb.Append( IsInternationalCharacter( c ) ? internationals[ c ] : c.ToString() ) );
        int length = sb.Length;
        sb.Replace( "  ", " " );
        while( sb.Length != length )
        {
            sb.Replace( "  ", " " );
        }
        safeName = sb.ToString().Trim();
        string namePart = Path.GetFileNameWithoutExtension( safeName );
        if( namePart.EndsWith( "." ) )
            safeName = namePart.Substring( 0, namePart.Length - 1 ) + Path.GetExtension( safeName );
        return name != safeName;
    }
}

And it will be called as follows:

FileInfo file = new File( "Århus.txt" );
string safeName;    
if( checker.HasInternationalCharactersInName( file.Name, out safeName ) )
{
    // rename file 
}

+3

c # unicode character substitution ascii

Morten mertner Mar 20 '10 at 6:00

source share

3 answers

(a) . 127.

(b) NKFD / uni2ascii.

+2

dan04 20 . '10 6:24

, - :

string name = "Århus.txt";
string kd = name.Normalize(NormalizationForm.FormKD);
byte[] kd_bytes = Encoding.Unicode.GetBytes(kd);
byte[] ascii_bytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, kd_bytes);
string flattened = Encoding.ASCII.GetString(ascii_bytes);

Århus.txt A? rhus.txt, KD Å , 7- ASCII . ? .

, , . , .

EDIT:

I just tried æÆØ and they all turned into ?, so that might be too vague for you. However, this may give you some clues that lead to an answer.

+1

Jim flood Mar 20 '10 at 7:43

source share

Hans Passant · Accepted Answer · 2010-03-20T11:48:23+0000

The sad problem is to have this day and age. Obviously, the form of NFD that the MAC uses is causing you this headache. One thing you might consider is to remove diacritics from glyphs that make NFD different from NFC.

100%, ( ), :

public static string RemoveDiacriticals(string txt) {
  string nfd = txt.Normalize(NormalizationForm.FormD);
  StringBuilder retval = new StringBuilder(nfd.Length);
  foreach (char ch in nfd) {
    if (ch >= '\u0300' && ch <= '\u036f') continue;
    if (ch >= '\u1dc0' && ch <= '\u1de6') continue;
    if (ch >= '\ufe20' && ch <= '\ufe26') continue;
    if (ch >= '\u20d0' && ch <= '\u20f0') continue;
    retval.Append(ch);
  }
  return retval.ToString();
}

Using C # to determine if a name character is international

More articles: