PHP Regex expression involving Japanese

My goal is to filter out a Microsoft Word document (.docx), capturing all Japanese characters and cannes. The current code I'm working with is the following:

preg_match_all('~[\x{4e00}-\x{9faf}]([\x{3040}-\x{309f}]) \= ([az]) \=+~u', $data, $matches); 

According to some research, I found the Unicode values ​​of the Japanese text as follows: http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

An example of the data I'm working with looks like this:

ζ™‚ (と き) = toki = time; hour; occasion; moment γ‚’ = wo = particle denoting the direct object of the proposal (ζ™‚ = time) θΆ… え て (こ え て) = koete = cross

My ultimate goal is to be able to run preg_match_all data samples in a similar template that looks like "θΆ… え て (こ え て) = koete" Information before (and information inside () and latinization after = = =

The result I'm looking for will be a returned array that looks like this:

 array( 0 => array('ζ™‚', 'とき', 'toki'), 1 => array('θΆ…γˆγ¦', 'γ“γˆγ¦', 'koete') ); 

The first result in each array includes both β€œKanji, Hiragana, and possibly Katakana,” and the second result is only Hiragana, and the third result is just ordinary letter characters. I'm not too good with regex and add unicode Japaense, and I don't know, any help would really be appreciated! Thanks!

+6
source share
1 answer

If you use the /u modifier:

you can use special Unicode regular expression placeholders instead of numeric ranges,
 preg_match_all('/ ([\p{Han}\p{Katakana}\p{Hiragana}]+) # Kanji (?: [(] # optional part: paren ( ([\p{Hiragana}]+) # Hiragana [)] )? # closing paren ) \s*=\s* # spaces and = ([\w\s;=]+) # English letters /ux', $source, $matches, PREG_SET_ORDER ); print_r($matches); 

I noticed that Hiragana in parens is optional, so I made your regex a bit more complicated with (?: ... )? , which optionally includes this part.

Note that ordering the result is slightly different, because preg_match_all usually stores the full match string at index [0]:

 [0] => Array ( [0] => ζ™‚(とき) = toki = time; hour; occasion; moment [1] => ζ™‚[2] => とき[3] => toki = time; hour; occasion; moment ) 
+3
source

Source: https://habr.com/ru/post/886748/


All Articles