How to replace Unicode characters with ASCII

I have the following command to replace Unicode characters with ASCII.

sed -i 's/Γƒ/A/g' 

Problem Γƒ not recognized by the sed command in my Unix environment, so I assume that you replace it with a hex value. What would the syntax look like if I used C3 instead?

I use this command as a template for other characters that I would like to replace with empty space, for example:

sed -i 's / Β© // g'

+5
source share
4 answers

You can use hexadecimal values ​​in sed.

 echo "Γƒ" | hexdump -C 00000000 c3 83 0a |...| 00000003 

Well, this character is a combination of two bytes of "c3 83". Let replace it with one byte "A":

 echo "Γƒ" |sed 's/\xc3\x83/A/g' A 

Explanation: \ x indicates for sed that the following hex code.

+9
source

Try setting LANG=C , and then run it in the Unicode range:
echo "hi ☠ there ☠" | LANG=C sed "s/[\x80-\xFF]//g"

+3
source

There is also uconv , from ICU .

Examples:

  • uconv -x "::NFD; [:Nonspacing Mark:] > ; ::NFC;" : remove accents
  • uconv -x "::Latin; ::Latin-ASCII;" : for transliteration latin / ascii
  • uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;" : to transliterate latin / ascii and remove the remaining code points> 0x7F
  • ...

echo "Γ€ l'Γ©cole ☠" | uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;" gives: A l'ecole

+3
source

You can use iconv:

 iconv -f utf-8 -t ascii//translit 
+2
source

Source: https://habr.com/ru/post/1207353/


All Articles