R regex: problems with character vectors containing NA

I tried to collapse all several (2 or more) whitespace in the elements of a vector into one using gsub() , for example:

 x1 <- c(" abc", "abc ", "abc") gsub("\\s{2,}", " ", x1) [1] " abc" "abc " "abc" 

But as soon as the vector contains NA , the failure fails:

 x2 <- c(NA, " abc", "abc ", "abc") gsub("\\s{2,}", " ", x2) [1] NA " " " " " " 

However, it works great if you use regular expressions like Perl:

 gsub("\\s{2,}", " ", x2, perl = TRUE) [1] NA " abc" "abc " "abc" 

Does anyone have any suggestions as to why R's own regular expressions behave this way? I am using R 3.1.1 on Linux x86-64 if this helps.

+6
source share
2 answers

I did not get hung up on the source code, but it also works if you use the useBytes=TRUE parameter (without perl=TRUE parameter). From the help: "if useBytes is TRUE match is done byte-by- useBytes , not character-by-character." This may be part of why it doesn't work on gsub .

However, regexpr , regexec and gregexpr each find all the correct positions (I replaced \\s with [[:space:]]: for readability and used only the output from regexpr :

 regexpr("[[:space:]]{2,}", x2) ## [1] NA 1 1 1 ## attr(,"match.length") ## [1] NA 5 9 6 

So, the regular expression itself is beautiful.

Update: A quick look at do_gsub in R 3.1.1 grep.c did not give much understanding (this is a twisted maze of if/else :), but I would almost like to call it a mistake.

+2
source

Just to wrap up this question: as some others suggested, behavior is actually a mistake. Reported and confirmed here:

https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16009

+1
source

Source: https://habr.com/ru/post/976191/


All Articles