Unicode provides properties that determine which characters the script belongs to. Characters can be matched based on their script property using \p{Script=...} .
I donβt know much about the languages ββyou mentioned, but I think you want
\p{Script=Han} aka \p{Han} for the Chinese.\p{Script=Hangul} aka \p{Hangul} for Korean.\p{Script=Cyrillic} aka \p{Cyrl} for Cyrillic.\p{Script=Thai} aka \p{Thai} for Thai.
You can look at perluniprops to find the one you are looking for, or you can use uniprops * to find which properties match a particular character.
$ uniprops D55C U+D55C βΉνβΊ \N{HANGUL SYLLABLE HAN} \w \pL \p{L_} \p{Lo} All Any Alnum Alpha Alphabetic Assigned InHangulSyllables L Lo Gr_Base Grapheme_Base Graph GrBase Hang Hangul Hangul_Syllables ID_Continue IDC ID_Start IDS Letter L_ Other_Letter Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
To find out what characters are in a given property, you can use unichars *. (This has limited usefulness, as most CJK characters are not named.)
$ unichars -au '\p{Han}' βΊ U+2E80 CJK RADICAL REPEAT βΊ U+2E81 CJK RADICAL CLIFF βΊ U+2E82 CJK RADICAL SECOND ONE βΊ U+2E83 CJK RADICAL SECOND TWO βΊ U+2E84 CJK RADICAL SECOND THREE ... $ unichars -au '\p{Hangul}' α U+01100 HANGUL CHOSEONG KIYEOK α U+01101 HANGUL CHOSEONG SSANGKIYEOK α U+01102 HANGUL CHOSEONG NIEUN α U+01103 HANGUL CHOSEONG TIKEUT α U+01104 HANGUL CHOSEONG SSANGTIKEUT ...
* - uniprops and unichars are available from Unicode :: Tussle distro.
source share