Using regex and tidyr in R to split a column variable into the first instance of a match

Trying to split a column in an R data frame that contains more than one space in a variable, but I want to split only the first space. Example data frame:

df <- data.frame(game = c(1, 2, 3, 4, 5, 6), date = c("Monday Apr 3", "Tuesday Apr 4", "Wednesday Apr 5", "Thursday Apr 6", "Friday Apr 7", "Saturday Apr 8"))

I am trying to use tidyr to split the df 'date' column only in first place, so the day is in its own column:

  game       day date
1    1    Monday  Apr 3
2    2   Tuesday  Apr 4
3    3 Wednesday  Apr 5
4    4  Thursday  Apr 6
5    5    Friday  Apr 7
6    6  Saturday  Apr 8

The above problem. Below I tried and what is wrong.

In the tidyr documentation, the default value of “sep” is “a regular expression that matches any sequence of non-alphanumeric values”. Therefore, if I just do:

df %>% separate(date, c("day", "date"))

, (, "" "" " 3 " ). :

  game       day date
1    1    Monday  Apr
2    2   Tuesday  Apr
3    3 Wednesday  Apr
4    4  Thursday  Apr
5    5    Friday  Apr
6    6  Saturday  Apr
Warning message:
Too many values at 6 locations: 1, 2, 3, 4, 5, 6 

, ( , Sublime Text):

df %>% separate(date, c("day", "date"), sep='^[^\\s]*\\K\\s')

:

  game             day date
1    1    Monday Apr 3 <NA>
2    2   Tuesday Apr 4 <NA>
3    3 Wednesday Apr 5 <NA>
4    4  Thursday Apr 6 <NA>
5    5    Friday Apr 7 <NA>
6    6  Saturday Apr 8 <NA>
Warning message:
Too few values at 6 locations: 1, 2, 3, 4, 5, 6 

, ? ? ?

+4
3

extra merge:

library(tidyr)
df %>% separate(date, c("day", "date"), extra = "merge")

#  game       day  date
#1    1    Monday Apr 3
#2    2   Tuesday Apr 4
#3    3 Wednesday Apr 5
#4    4  Thursday Apr 6
#5    5    Friday Apr 7
#6    6  Saturday Apr 8
+8

Psidom . , , , \\K stringi, separate. stringi::stri_split_regex(df$date, '^[^\\s]*\\K\\s'). , , .

sep

# a space not followed by a digit
df %>% separate(date, c("day", "date"), sep = "\\s(?!\\d)")
#  game       day  date
#1    1    Monday Apr 3
#2    2   Tuesday Apr 4
#3    3 Wednesday Apr 5
#4    4  Thursday Apr 6
#5    5    Friday Apr 7
#6    6  Saturday Apr 8

:

\\K, , :

# a space preceded by 3 - 6 characters and "day". 
# 3 - 6 characters allows "Monday" and "Wednesday"
"(?<=.{3,6}day)\\s"
# same idea
"(?<=\\S{3,6}day)\\s"
# same idea
"(?<=.?.?.?...day)\\s"
# same idea, but using ^ to anchor and not using "day"
"(?<=^\\S{0,9})\\s"
# space followed by some other characters, a space, digit(s) and the end of the line
"\\s(?=.+\\s\\d+$)"
+1

, base R

cbind(df[1], read.csv(text=sub("\\s+", ",", df$date),
             header=FALSE, col.names = c("day", "date")))
#  game       day  date
#1    1    Monday Apr 3
#2    2   Tuesday Apr 4
#3    3 Wednesday Apr 5
#4    4  Thursday Apr 6
#5    5    Friday Apr 7
#6    6  Saturday Apr 8

extract tidyr

library(tidyr)
extract(df, date, into = c("day", "date"), "(\\S+)\\s+(.*)")
#   game       day  date
#1    1    Monday Apr 3
#2    2   Tuesday Apr 4
#3    3 Wednesday Apr 5
#4    4  Thursday Apr 6
#5    5    Friday Apr 7
#6    6  Saturday Apr 8
+1
source

Source: https://habr.com/ru/post/1665803/


All Articles