Parsing (possibly) non-existent numbers with a regular expression in R

I am trying to extract numbers from strings in R with a package stringr. Sometimes there are no numbers. Here are some sample lines:

str <- c(
"cash dividends per share $ - $ - $ - $ 0.08 $ 0.16 cash",
"cash dividends per share $ 0.01 $ 12.10 $ 0.01 $ 0.08 $ 0.16 hello",
"cash dividends per share $ - $ - $ 0.91 $ - $ 0.16 world",
"cash dividends per share - - 0.12 - 0.16 hsac",
"cash dividends per share $ - $ - $ - $ - $ 0.16 afterwards",
"cash dividends per share $0.12 $ - $0.1 $ - $ - comes",
"cash dividends per share 0.12 - 0.12 - - text",
"cash dividends per share... 0.12 - 0.12 - - random",
"cash dividends per share...0.123 0.321 - - 0.12 blu",
"cash dividends per share ..... $ 0.12 $ - $ 0.12  $ - $ - foo",
"cash dividends per share ..... $0.42 $0.42 $-  $- $- bar")

I have constructed the following regular expression, which IMO should correspond to all cases, but this is not so. Of course, I also tried different options, but I can not figure out the correct one (I do not even see the problem with the one I came across):

library("stringr")
rgxp <- "cash dividends [declared]* per share[ \\.]+[\\$]?[ ]?([-0.9\\.]+)[ ]?[\\$]?[ ]?([-0.9\\.]+)[ ]?[\\$]?[ ]?([-0.9\\.]+)[ ]?[\\$]?[ ]?([-0.9\\.]+)[ ]?[\\$]?[ ]?([-0.9\\.]+).*"
str_match_all(str, rgxp)

Do you see any problem that causes the above regular expression to be called?

Edit: I had to say that my desired result is a vector with five elements, that is, numbers or a hyphen if there is no number. Thanks!

+4
source share
2 answers

:

rgxp <- "([0-9]+\\.?[0-9]*)|(-)"
str_extract_all(str, rgxp)

[[1]]
[1] "-"    "-"    "-"    "0.08" "0.16"

[[2]]
[1] "0.01"  "12.10" "0.01"  "0.08"  "0.16" 

[[3]]
[1] "-"    "-"    "0.91" "-"    "0.16"

[[4]]
[1] "-"    "-"    "0.12" "-"    "0.16"

[[5]]
[1] "-"    "-"    "-"    "-"    "0.16"

[[6]]
[1] "0.12" "-"    "0.1"  "-"    "-"   

[[7]]
[1] "0.12" "-"    "0.12" "-"    "-"   

[[8]]
[1] "0.12" "-"    "0.12" "-"    "-"   

[[9]]
[1] "0.123" "0.321" "-"     "-"     "0.12" 

[[10]]
[1] "0.12" "-"    "0.12" "-"    "-"   

[[11]]
[1] "0.42" "0.42" "-"    "-"    "-"   

:

rgxp <- "[0-9]+\\.?[0-9]*"
str_extract_all(str, rgxp)

[[1]]
[1] "0.08" "0.16"

[[2]]
[1] "0.01"  "12.10" "0.01"  "0.08"  "0.16" 

[[3]]
[1] "0.91" "0.16"

[[4]]
[1] "0.12" "0.16"

[[5]]
[1] "0.16"

[[6]]
[1] "0.12" "0.1" 

[[7]]
[1] "0.12" "0.12"

[[8]]
[1] "0.12" "0.12"

[[9]]
[1] "0.123" "0.321" "0.12" 

[[10]]
[1] "0.12" "0.12"

[[11]]
[1] "0.42" "0.42"
+2

gsub , - , , read.table. na.strings = "-", , -. .

DF <- read.table(text = gsub("[^-0-9.]+|\\.{2,}", " ", str), fill = TRUE, na.strings = "-")

.frame:

> DF
      V1     V2   V3   V4   V5
1     NA     NA   NA 0.08 0.16
2  0.010 12.100 0.01 0.08 0.16
3     NA     NA 0.91   NA 0.16
4     NA     NA 0.12   NA 0.16
5     NA     NA   NA   NA 0.16
6  0.120     NA 0.10   NA   NA
7  0.120     NA 0.12   NA   NA
8  0.120     NA 0.12   NA   NA
9  0.123  0.321   NA   NA 0.12
10 0.120     NA 0.12   NA   NA
11 0.420  0.420   NA   NA   NA

. NA , : DF[is.na(DF)] <- 0.

+3

Source: https://habr.com/ru/post/1619707/


All Articles