Using sed, how can a regex match Chinese characters?

I decided to post the question, spending a lot of time and still not figuring out the problem. Also read a bunch of seemingly related messages, none of them correspond to my simple (?) Problem.

So, I have, perhaps, a large text file (> 1000 lines) containing Chinese Mandarin characters with an example line like:

"ref#2-5-1.jpg#2#一些 <variable> 内容#pic##" (the Chinese just means "some content"). 

All that needs to be changed is that between each character a space must be inserted, if it is not already:

"ref#2-5-1.jpg#2#一 些 <variable> 内 容#pic##".

I started naively with simple things like the following, but there is no match:

sed -e 's/\([\u4E00-\u9fff]\)/\1 /g' <test_utf_sed.txt > test_out.txt

where 4E00-9fff should be the code range for Chinese Mandarin. No wonder this didn't work, so I also wanted to try

sed -e 's/\([一-龻]\)/hello/g' <test_utf_sed.txt > test_out.txt

, bash (?) "一".

, :

sed -e 's/\(\u4E00\)/hello/g' <test_utf_sed.txt > test_out.txt //一
sed -e 's/\(\u4E9B\)/hello/g' <test_utf_sed.txt > test_out.txt //些

utf ( stackoverflow):

sed -e 's/\(\u'U+4E00\)/hello/g' <test_utf_sed.txt > test_out.txt

1) , ?

2) sed unicode , ?

3) :

step1: insert space after each character 
  //like 's/\(.\)/\1 /g')
step2: remove space after each chacter which is not a Chinese character 
  //like 's/\([a-zA-Z0-9]\) /\1/g')

, , . , utf-8 regex sed.

4) bash -3.2 MacOS 10.6.8 ( ).

5) - regEx-onliners , , .

, !

+4
2

Perl Unicode. , . sed:

perl -CIOED -p -e 's/\p{Script_Extensions=Han}/$& /g' filename

-CIOED Perl, -CIOED - utf8. -p , . -e Perl . .

, .

Perl Unicode.

+5

sed escape- \u (-). , bash -3.2, , ; ,

sed $'s/\u4E9B/hello/g'

.

, UTF-8 , , , , UTF-8 U + 4E00... U + 9FFF

(\xe4[\xb8-\xbf][\x80-\xbf]|[\xe5-\xe9][\x80-\xbf][\x80-\xbf])

( , sed , C.)

GNU sed , -r. MacOSX , -E. :

LANG=C sed -E \
       $'s/(\xe4[\xb8-\xbf][\x80-\xbf]|[\xe5-\xe9][\x80-\xbf][\x80-\xbf])/\\1 /g' \
       <test_utf_sed.txt >test_out.txt

( bash \x. $, sed escape- \x, \\1 \1. Mac bash, , sed escape- , , bash , .)


, utf-8 ; . :.

$ hd <<<"一些"
00000000  e4 b8 80 e4 ba 9b 0a                              |.......|

, (U + 4E00... U + 9FFF) , 一 E4 B8 80, 些 - E4 BA 9B. (0A - , , .)

+2

Source: https://habr.com/ru/post/1537303/


All Articles