Perl regular expression finds a character from a random set

I have a file with Korean and Chinese characters. I want to find pairs where the khanja for the Korean word is used in brackets, for example: ν•œλ¬Έ (ζΌ’ζ–‡)

The search will look something like this: /[korean characters] \([chinese characters]\)/

How to specify Chinese or Korean characters or any other set, such as Cyrillic or Thai, for example?

+6
source share
1 answer

Unicode provides properties that determine which characters the script belongs to. Characters can be matched based on their script property using \p{Script=...} .

I don’t know much about the languages ​​you mentioned, but I think you want

  • \p{Script=Han} aka \p{Han} for the Chinese.
  • \p{Script=Hangul} aka \p{Hangul} for Korean.
  • \p{Script=Cyrillic} aka \p{Cyrl} for Cyrillic.
  • \p{Script=Thai} aka \p{Thai} for Thai.

You can look at perluniprops to find the one you are looking for, or you can use uniprops * to find which properties match a particular character.

 $ uniprops D55C U+D55C β€Ήν•œβ€Ί \N{HANGUL SYLLABLE HAN} \w \pL \p{L_} \p{Lo} All Any Alnum Alpha Alphabetic Assigned InHangulSyllables L Lo Gr_Base Grapheme_Base Graph GrBase Hang Hangul Hangul_Syllables ID_Continue IDC ID_Start IDS Letter L_ Other_Letter Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word 

To find out what characters are in a given property, you can use unichars *. (This has limited usefulness, as most CJK characters are not named.)

 $ unichars -au '\p{Han}' βΊ€ U+2E80 CJK RADICAL REPEAT ⺁ U+2E81 CJK RADICAL CLIFF βΊ‚ U+2E82 CJK RADICAL SECOND ONE βΊƒ U+2E83 CJK RADICAL SECOND TWO βΊ„ U+2E84 CJK RADICAL SECOND THREE ... $ unichars -au '\p{Hangul}' α„€ U+01100 HANGUL CHOSEONG KIYEOK ᄁ U+01101 HANGUL CHOSEONG SSANGKIYEOK α„‚ U+01102 HANGUL CHOSEONG NIEUN ᄃ U+01103 HANGUL CHOSEONG TIKEUT α„„ U+01104 HANGUL CHOSEONG SSANGTIKEUT ... 

* - uniprops and unichars are available from Unicode :: Tussle distro.

+9
source

Source: https://habr.com/ru/post/906711/


All Articles