Why this end of line (\\ b) is not recognized as a word boundary in stringr / ICU and Perl

Using stringr , I tried to detect the sign at the end of the line as follows:

 str_detect("my text €", "€\\b") # FALSE 

Why is this not working? It works in the following cases:

 str_detect("my text a", "a\\b") # TRUE - letter instead of € grepl("€\\b", "2009in €") # TRUE - base R solution 

But it also fails in perl mode:

 grepl("€\\b", "2009in €", perl=TRUE) # FALSE 

So what is wrong with €\\b regex? The regular expression €$ works in all cases ...

+6
source share
2 answers

When you use the basic regex functions without perl=TRUE , the TRE regex flavor is used .

Looks like the word boundary TRE:

  • When used after a character other than a word, matches the end of the line position and
  • When used before a character other than a word, matches the start of a line position.

See R tests:

 > gsub("\\b\\)", "HERE", ") 2009in )") [1] "HERE 2009in )" > gsub("\\)\\b", "HERE", ") 2009in )") [1] ") 2009in HERE" > 

This is not the usual behavior of the word boundary in the PCRE and ICU regular expression variants, where the word boundary before the character without the word matches the character of the previous word char, excluding the beginning of the line position (and when used after the character without the word, the word character must appear immediately after the word boundary) :

There are three different positions that qualify as word boundaries:

- Before the first character in a string, if the first character is a character in a word.
- After the last character in a string, if the last character is a character of a word.
- Between two characters in a line, where one is a word symbol and the other is not a word symbol.

+4
source
 \b 

equivalently

 (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) 

that is, it matches

  • between the word char and the non-word char,
  • between the word char and the beginning of the line, and
  • between the word char and the end of the line.

is a character, and characters are not word characters.

 $ uniprops € U+20AC <€> \N{EURO SIGN} \pS \p{Sc} All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode 

If your language supports appearance and appearance, you can use the following to find the boundary between space and non-space (treating the beginning and end as a space).

 (?:(?<!\S)(?=\S)|(?<=\S)(?!\S)) 
+2
source

Source: https://habr.com/ru/post/1013205/


All Articles