Preg_split vs mb_split

According to the PHP manual, the PCRE regular expression modifier u allows UTF-8 to be supported for both the template and the subject line.

Given this, is there any difference between using PCRE expressions with the u modifier and the corresponding line functions # t22> multibyte? (Assuming all strings are UTF-8 encoded.)


As an example, consider preg_split vs mb_split : Both

 preg_split('/' . $pattern . '/u', $string); 

and

 mb_split($pattern, $string); 

seem to return identical results. So which one is preferable? Does it even matter?

+5
source share
2 answers

The main difference is that the preg_ functions use the pcre library when the mb_ereg_ functions (including mb_split ) use the oniguruma library (used in ruby ​​before version 2.0).

The main reason is that oniguruma can work with several encodings (ASCII, UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, EUC-JP, EUC-TW, EUC-KR, EUC - CN, Shift_JIS, Big5, GB18030, KOI8-R, CP1251, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO - 8859-16) when pcre cannot.

Please note that this list does not have a large number of encodings available for mb_ , such as mb_detect_encoding (UTF-7, ArmSCII-8, CP866), restricting the relevance of mb_ereg_ functions. (Since you need to convert the string to a supported encoding before embarking on it, and convert it back.)

Two mechanisms of regular expressions have more or less the same functions, however, you can find some differences (not exhaustive how this happens):

Oniguruma does not support:

  • Literal character classes with single-byte unicode characters that must be written without curly braces. Example: \pN displayed as pN , you need to write: \p{N}
  • Unicode character classes: Xan, Xps, Xsp, Xwd
  • unshielded square brackets in a character class: Oniguruma see [][] as two empty character classes when pcre sees a character class containing ] and [
  • \K function
  • alias \R for newline
  • which use Python syntax (?P<name>...) . Only (?<name>...) or (?'name'...) allowed.
  • Links to groups with something other than Oniguruma syntax: \g<name> (Perl syntax (?&name) and (?1) or (?R) not allowed).
  • backtrace control verbs

PCRE does not support:

  • Duplicate named groups (default). To enable this feature, you need to use the modifier (?J) .
  • numbered backlinks with the syntax \k<...> . You can write \k<name> , but not \k<1> or \k<-1> .
  • backlinks to a specific level of the nest. Oniguruma can do this using \k<name+n> , where n is the level of the nest.


To match newlines with a period, Oniguruma uses the m modifier when PCRE uses the s modifier. In the mb_ereg_ functions mb_ereg_ dot corresponds to new characters by default. (Thus, the modifier m enabled by default).

PCRE uses the s modifier to match a newline with a period. The m modifier behaves differently in PCRE; it changes the values ​​of the ^ and $ bindings from the "start" and "end" lines to the "start" and "end" lines.

With Oniguruma, the meaning of these anchors does not change; they always coincide with the beginning and end of the line. To match the line limit, it uses \A and \z , also available with PCRE.

Note that Oniguruma was forked to give Onigmo (used in current versions of Ruby), which implements more Perl features and syntax elements, and this is more like PCRE.

+6
source

As long as you work strictly with UTF-8 , you will be fine too. If you used another charset , then it would be recommended to use mb_split() , since the u modifier with PCRE does not allow you to specify charset , instead treating the strings as UTF-8 instead.

Regarding scaling and long-term viability, I would recommend using mb_split() from the very beginning, so that you are covered if UTF-8 used or not on the road.

+2
source

Source: https://habr.com/ru/post/1245424/


All Articles