Sed: matching Unicode blocks with

Question

Sed: matching Unicode blocks with

I am desperately trying to replace some Unicode characters (graphemes) from a file with sed. However, I continue to fail for some of them, namely from unicode blocks:

\p{InHigh_Surrogates}: U+D800–U+DB7F \p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF \p{InLow_Surrogates}: U+DC00–U+DFFF

I tried (in the sed configuration file loaded via the -f switch):

 s/\p{InHigh_Surrogates}/###/ --> no effect at all s/\\p\{InHigh_Surrogates\}/###_D-NON-UTF8_###/ -> error message 'Invalid content of \{\}'

Has anyone received an offer? In addition, I'm not necessarily focused on using blocks, but I also did not try to determine the range of characters of the form \ xd800- \ xdfff.

Thank you Thomas

0

unicode sed utf-8 unicode-escapes

Drth Mar 17 '14 at 9:21

source share

1 answer

fedorqui · Answer 1 · 2014-03-17T09:27:14+0000

Try using the -r flag for sed:

 $ sed -r 's/\\p\{InHigh_Surrogates\}/###/g' file ###: U+D800–U+DB7F \p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF \p{InLow_Surrogates}: U+DC00–U+DFFF

From man sed :

-r, --regexp-extended
use extended regular expressions in a script.

Sed: matching Unicode blocks with

More articles: