Why this end of line (\\ b) is not recognized as a word boundary in stringr / ICU and Perl

Question

Why this end of line (\\ b) is not recognized as a word boundary in stringr / ICU and Perl

Using stringr , I tried to detect the € sign at the end of the line as follows:

 str_detect("my text €", "€\\b") # FALSE

Why is this not working? It works in the following cases:

 str_detect("my text a", "a\\b") # TRUE - letter instead of € grepl("€\\b", "2009in €") # TRUE - base R solution

But it also fails in perl mode:

 grepl("€\\b", "2009in €", perl=TRUE) # FALSE

So what is wrong with €\\b regex? The regular expression €$ works in all cases ...

+6

regex r pcre stringr

Rentrop Dec 15 '16 at 23:23

source share

2 answers

\b

equivalently

 (?:(?<!\w)(?=\w)|(?<=\w)(?!\w))

that is, it matches

between the word char and the non-word char,
between the word char and the beginning of the line, and
between the word char and the end of the line.

€ is a character, and characters are not word characters.

 $ uniprops € U+20AC <€> \N{EURO SIGN} \pS \p{Sc} All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode

If your language supports appearance and appearance, you can use the following to find the boundary between space and non-space (treating the beginning and end as a space).

 (?:(?<!\S)(?=\S)|(?<=\S)(?!\S))

+2

ikegami Dec 15 '16 at 23:47

source share

Wiktor stribiżew · Accepted Answer · 2016-12-15T23:47:55+0000

When you use the basic regex functions without perl=TRUE , the TRE regex flavor is used .

Looks like the word boundary TRE:

When used after a character other than a word, matches the end of the line position and
When used before a character other than a word, matches the start of a line position.

See R tests:

 > gsub("\\b\\)", "HERE", ") 2009in )") [1] "HERE 2009in )" > gsub("\\)\\b", "HERE", ") 2009in )") [1] ") 2009in HERE" >

This is not the usual behavior of the word boundary in the PCRE and ICU regular expression variants, where the word boundary before the character without the word matches the character of the previous word char, excluding the beginning of the line position (and when used after the character without the word, the word character must appear immediately after the word boundary) :

There are three different positions that qualify as word boundaries:

- Before the first character in a string, if the first character is a character in a word.
- After the last character in a string, if the last character is a character of a word.
- Between two characters in a line, where one is a word symbol and the other is not a word symbol.

Why this end of line (\\ b) is not recognized as a word boundary in stringr / ICU and Perl

More articles: