Regex for Pinyin Matching

I'm looking for a regular expression that can correctly match a real pinyin (for example, "sheng", "sou" (ignoring an invalid pinyin, for example, "shong", "sei"). Most of the regular expressions presented at the top In some cases, Google results match invalid pinyin.

Obviously, no matter what approach it takes, it will be a regular expression of the monster, and I am particularly interested in various approaches that could be solved to solve this problem. For example, Regular Expression Optimization uses Chinese pinyin parsers for parsing .

The actual pinyin table can be found here: http://pinyin.info/rules/initials_finals.html

+6
source share
2 answers

I went for a regular expression that grouped smaller regular expressions using the Pinyin initial (usually the first letter). So, the first group includes all the sounds “b”, “p” and “m”, then “f”, then “d” and “t”, etc.

This approach seems easy to read and should be easily edited (if it needs corrections or additions). I also added exceptions to begging groups to improve readability.

([mM]iu|[pmPM]ou|[bpmBPM](o|e(i|ng?)?|a(ng?|i|o)?|i(e|ng?|a[no])?|u))| ([fF](ou?|[ae](ng?|i)?|u))|([dD](e(i|ng?)|i(a[on]?|u))| [dtDT](a(i|ng?|o)?|e(i|ng)?|i(a[on]?|e|ng|u)?|o(ng?|u)|u(o|i|an?|n)?))| ([nN]eng?|[lnLN](a(i|ng?|o)?|e(i|ng)?|i(ang|a[on]?|e|ng?|u)?|o(ng?|u)|u(o|i|an?|n)?|ve?))| ([ghkGHK](a(i|ng?|o)?|e(i|ng?)?|o(u|ng)|u(a(i|ng?)?|i|n|o)?))| ([zZ]h?ei|[czCZ]h?(e(ng?)?|o(ng?|u)?|ao|u?a(i|ng?)?|u?(o|i|n)?))| ([sS]ong|[sS]hua(i|ng?)?|[sS]hei|[sS][h]?(a(i|ng?|o)?|en?g?|ou|u(a?n|o|i)?|i))| ([rR]([ae]ng?|i|e|ao|ou|ong|u[oin]|ua?n?))| ([jqxJQX](i(a(o|ng?)?|[eu]|ong|ng?)?|u(e|a?n)?))| (([aA](i|o|ng?)?|[oO]u?|[eE](i|ng?|r)?))| ([wW](a(i|ng?)?|o|e(i|ng?)?|u))| [yY](a(o|ng?)?|e|in?g?|o(u|ng)?|u(e|a?n)?) 

Here is an example Debuggex that I created.

Regular expression visualization

+7
source

I would use a combined approach, which is not the only regex.

Check out the valid pinyin:

  • grab word

  • capture letters from the beginning of a word while they are consonants. This separates the original sound from the final sound.

  • check that start and end are valid ...

  • ... and if so, let's see if their combination is allowed (via a table, for example this , but the records are just 1 and 0).

+2
source

Source: https://habr.com/ru/post/947507/