PHP Regex expression involving Japanese

Question

PHP Regex expression involving Japanese

My goal is to filter out a Microsoft Word document (.docx), capturing all Japanese characters and cannes. The current code I'm working with is the following:

preg_match_all('~[\x{4e00}-\x{9faf}]([\x{3040}-\x{309f}]) \= ([az]) \=+~u', $data, $matches);

According to some research, I found the Unicode values of the Japanese text as follows: http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

An example of the data I'm working with looks like this:

時 (とき) = toki = time; hour; occasion; moment を = wo = particle denoting the direct object of the proposal (時 = time) 超えて (こえて) = koete = cross

My ultimate goal is to be able to run preg_match_all data samples in a similar template that looks like "超えて (こえて) = koete" Information before (and information inside () and latinization after = = =

The result I'm looking for will be a returned array that looks like this:

 array( 0 => array('時', 'とき', 'toki'), 1 => array('超えて', 'こえて', 'koete') );

The first result in each array includes both “Kanji, Hiragana, and possibly Katakana,” and the second result is only Hiragana, and the third result is just ordinary letter characters. I'm not too good with regex and add unicode Japaense, and I don't know, any help would really be appreciated! Thanks!

+6

php regex unicode preg-match-all

Bryse meijer Apr 26 '11 at 23:06

source share

1 answer

mario · Accepted Answer · 2011-04-26T23:24:14+0000

If you use the /u modifier:

you can use special Unicode regular expression placeholders instead of numeric ranges,

 preg_match_all('/ ([\p{Han}\p{Katakana}\p{Hiragana}]+) # Kanji (?: [(] # optional part: paren ( ([\p{Hiragana}]+) # Hiragana [)] )? # closing paren ) \s*=\s* # spaces and = ([\w\s;=]+) # English letters /ux', $source, $matches, PREG_SET_ORDER ); print_r($matches);

I noticed that Hiragana in parens is optional, so I made your regex a bit more complicated with (?: ... )? , which optionally includes this part.

Note that ordering the result is slightly different, because preg_match_all usually stores the full match string at index [0]:

 [0] => Array ( [0] => 時(とき) = toki = time; hour; occasion; moment [1] => 時[2] => とき[3] => toki = time; hour; occasion; moment )

PHP Regex expression involving Japanese

More articles: