Perl regular expression finds a character from a random set

Question

Perl regular expression finds a character from a random set

I have a file with Korean and Chinese characters. I want to find pairs where the khanja for the Korean word is used in brackets, for example: 한문 (漢文)

The search will look something like this: /[korean characters] \([chinese characters]\)/

How to specify Chinese or Korean characters or any other set, such as Cyrillic or Thai, for example?

+6

regex perl cjk

Nate glenn Jan 24 '12 at 0:00

source share

1 answer

ikegami · Accepted Answer · 2012-01-24T00:24:25+0000

Unicode provides properties that determine which characters the script belongs to. Characters can be matched based on their script property using \p{Script=...} .

I don’t know much about the languages you mentioned, but I think you want

\p{Script=Han} aka \p{Han} for the Chinese.
\p{Script=Hangul} aka \p{Hangul} for Korean.
\p{Script=Cyrillic} aka \p{Cyrl} for Cyrillic.
\p{Script=Thai} aka \p{Thai} for Thai.

You can look at perluniprops to find the one you are looking for, or you can use uniprops * to find which properties match a particular character.

 $ uniprops D55C U+D55C ‹한› \N{HANGUL SYLLABLE HAN} \w \pL \p{L_} \p{Lo} All Any Alnum Alpha Alphabetic Assigned InHangulSyllables L Lo Gr_Base Grapheme_Base Graph GrBase Hang Hangul Hangul_Syllables ID_Continue IDC ID_Start IDS Letter L_ Other_Letter Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word

To find out what characters are in a given property, you can use unichars *. (This has limited usefulness, as most CJK characters are not named.)

 $ unichars -au '\p{Han}' ⺀ U+2E80 CJK RADICAL REPEAT ⺁ U+2E81 CJK RADICAL CLIFF ⺂ U+2E82 CJK RADICAL SECOND ONE ⺃ U+2E83 CJK RADICAL SECOND TWO ⺄ U+2E84 CJK RADICAL SECOND THREE ... $ unichars -au '\p{Hangul}' ᄀ U+01100 HANGUL CHOSEONG KIYEOK ᄁ U+01101 HANGUL CHOSEONG SSANGKIYEOK ᄂ U+01102 HANGUL CHOSEONG NIEUN ᄃ U+01103 HANGUL CHOSEONG TIKEUT ᄄ U+01104 HANGUL CHOSEONG SSANGTIKEUT ...

* - uniprops and unichars are available from Unicode :: Tussle distro.

Perl regular expression finds a character from a random set

More articles: