R regex: problems with character vectors containing NA

Question

R regex: problems with character vectors containing NA

I tried to collapse all several (2 or more) whitespace in the elements of a vector into one using gsub() , for example:

 x1 <- c(" abc", "abc ", "abc") gsub("\\s{2,}", " ", x1) [1] " abc" "abc " "abc"

But as soon as the vector contains NA , the failure fails:

 x2 <- c(NA, " abc", "abc ", "abc") gsub("\\s{2,}", " ", x2) [1] NA " " " " " "

However, it works great if you use regular expressions like Perl:

 gsub("\\s{2,}", " ", x2, perl = TRUE) [1] NA " abc" "abc " "abc"

Does anyone have any suggestions as to why R's own regular expressions behave this way? I am using R 3.1.1 on Linux x86-64 if this helps.

+6

regex r

rseubert Oct 3 '14 at 6:41

source share

2 answers

hrbrmstr · Answer 1 · 2014-10-03T11:55:42+0000

I did not get hung up on the source code, but it also works if you use the useBytes=TRUE parameter (without perl=TRUE parameter). From the help: "if useBytes is TRUE match is done byte-by- useBytes , not character-by-character." This may be part of why it doesn't work on gsub .

However, regexpr , regexec and gregexpr each find all the correct positions (I replaced \\s with [[:space:]]: for readability and used only the output from regexpr :

 regexpr("[[:space:]]{2,}", x2) ## [1] NA 1 1 1 ## attr(,"match.length") ## [1] NA 5 9 6

So, the regular expression itself is beautiful.

Update: A quick look at do_gsub in R 3.1.1 grep.c did not give much understanding (this is a twisted maze of if/else :), but I would almost like to call it a mistake.

rseubert · Answer 2 · 2014-10-05T12:54:17+0000

Just to wrap up this question: as some others suggested, behavior is actually a mistake. Reported and confirmed here:

https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16009

R regex: problems with character vectors containing NA

More articles: