How to replace Unicode characters with ASCII

Question

How to replace Unicode characters with ASCII

I have the following command to replace Unicode characters with ASCII.

sed -i 's/Ã/A/g'

Problem Ã not recognized by the sed command in my Unix environment, so I assume that you replace it with a hex value. What would the syntax look like if I used C3 instead?

I use this command as a template for other characters that I would like to replace with empty space, for example:

sed -i 's / © // g'

+5

unix bash shell unicode sed

Sandeep johal Nov 21 '14 at 0:25

source share

4 answers

ajaaskel · Answer 1 · 2014-11-21T07:41:25+0000

You can use hexadecimal values in sed.

 echo "Ã" | hexdump -C 00000000 c3 83 0a |...| 00000003

Well, this character is a combination of two bytes of "c3 83". Let replace it with one byte "A":

 echo "Ã" |sed 's/\xc3\x83/A/g' A

Explanation: \ x indicates for sed that the following hex code.

user4401178 · Answer 2 · 2015-11-12T15:27:22+0000

Try setting LANG=C , and then run it in the Unicode range:
echo "hi ☠ there ☠" | LANG=C sed "s/[\x80-\xFF]//g"

julp · Answer 3 · 2015-11-12T18:08:19+0000

There is also uconv , from ICU .

Examples:

uconv -x "::NFD; [:Nonspacing Mark:] > ; ::NFC;" : remove accents
uconv -x "::Latin; ::Latin-ASCII;" : for transliteration latin / ascii
uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;" : to transliterate latin / ascii and remove the remaining code points> 0x7F
...

echo "À l'école ☠" | uconv -x "::Latin; ::Latin-ASCII; ([^\x00-\x7F]) > ;" gives: A l'ecole

tinySandy · Answer 4 · 2014-11-21T00:36:57+0000

You can use iconv:

 iconv -f utf-8 -t ascii//translit

How to replace Unicode characters with ASCII

More articles: