Match php regex word boundary in utf-8

Question

Match php regex word boundary in utf-8

I have the following php code in utf-8 php file:

var_dump(setlocale(LC_CTYPE, 'de_DE.utf8', 'German_Germany.utf-8', 'de_DE', 'german')); var_dump(mb_internal_encoding()); var_dump(mb_internal_encoding('utf-8')); var_dump(mb_internal_encoding()); var_dump(mb_regex_encoding()); var_dump(mb_regex_encoding('utf-8')); var_dump(mb_regex_encoding()); var_dump(preg_replace('/\bweiß\b/iu', 'weiss', 'weißbier'));

I want the last regular expression to replace only complete words, not parts of words.

On my Windows computer, it returns:

 string 'German_Germany.1252' (length=19) string 'ISO-8859-1' (length=10) boolean true string 'UTF-8' (length=5) string 'EUC-JP' (length=6) boolean true string 'UTF-8' (length=5) string 'weißbier' (length=9)

On a web server (linux) I get:

 string(10) "de_DE.utf8" string(10) "ISO-8859-1" bool(true) string(5) "UTF-8" string(10) "ISO-8859-1" bool(true) string(5) "UTF-8" string(9) "weissbier"

Thus, the regex works as I expected on windows, but not on linux.

So, the main question: how should I write my regular expression only to match word boundaries?

Secondary questions are how I can tell Windows that I want to use utf-8 in my php application.

+11

php regex pcre utf-8 word-boundary

tomsv Mar 12 '10 at 13:08

source share

4 answers

Guess it was related to Error # 52971

PCRE-Meta-Characters, such as \b \w , does not work with unicode strings.

and fixed in PHP 5.3.4

PCRE extension: Fixed bug # 52971 (PCRE metacharacters not working with utf-8).

+4

bobble bubble Dec 10 '16 at 10:32

source share

this is what i have found so far. By rewriting the search and replace patterns as follows:

 $before = '(^|[^\p{L}])'; $after = '([^\p{L}]|$)'; var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', 'weißbier')); // Test some other cases: var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', 'weiß')); var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', 'weiß bier')); var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', ' weiß'));

I get the desired result:

 string 'weißbier' (length=9) string 'weiss' (length=5) string 'weiss bier' (length=10) string ' weiss' (length=6)

on my windows machine running apache, and on a hosted linux web server using apache.

I guess there is a better way to do this.

Also, I would still like to install my Windows computer on utf-8.

+3

tomsv Mar 12 '10 at 14:37

source share

According to this comment , this is a bug in PHP. Does \W use any benefits instead of \b ?

0

ntd Mar 14 '10 at 14:25

source share

Alan Moore · Accepted Answer · 2010-03-15 17:12

Even in UTF-8 mode, standard class abbreviations, such as \w and \b , do not support Unicode. You just need to use Unicode abbreviations as you developed it, but you can make it a little less ugly by using return paths instead of alternating:

 /(?<!\pL)weiß(?!\pL)/u

Notice also how I left the curly braces from the shortened Unicode classes; you can do this when the class name consists of one letter.

Match php regex word boundary in utf-8

More articles: