My goal is to filter out a Microsoft Word document (.docx), capturing all Japanese characters and cannes. The current code I'm working with is the following:
preg_match_all('~[\x{4e00}-\x{9faf}]([\x{3040}-\x{309f}]) \= ([az]) \=+~u', $data, $matches);
According to some research, I found the Unicode values ββof the Japanese text as follows: http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
An example of the data I'm working with looks like this:
ζ (γ¨ γ) = toki = time; hour; occasion; moment γ = wo = particle denoting the direct object of the proposal (ζ = time) θΆ
γ γ¦ (γ γ γ¦) = koete = cross
My ultimate goal is to be able to run preg_match_all data samples in a similar template that looks like "θΆ
γ γ¦ (γ γ γ¦) = koete" Information before (and information inside () and latinization after = = =
The result I'm looking for will be a returned array that looks like this:
array( 0 => array('ζ', 'γ¨γ', 'toki'), 1 => array('θΆ
γγ¦', 'γγγ¦', 'koete') );
The first result in each array includes both βKanji, Hiragana, and possibly Katakana,β and the second result is only Hiragana, and the third result is just ordinary letter characters. I'm not too good with regex and add unicode Japaense, and I don't know, any help would really be appreciated! Thanks!
source share