Splitting a line of text into columns of a data frame

I have a dataframe with lines of text that look like this:

         ANTALYA (GB) ch. 1960
    SHOOTIN WAR (USA) ch. 1998
    LORD AT WAR (ARG) ch. 1980

All caps are names, then location in (), color abbreviation, year. Names can be a few words. I want to split this single block of text into each component: name, location, color, year. I struggled with this for several days, and the best working solution I have is to simply put each word in separate columns, but it only works if the names have a certain length ... What do I do with the data, I can use it in this form, but it just doesn’t look beautiful, you know?

sepdf <- df %>% 
           separate(pedigree, into=c("Name1", "Name2", "Loc", "Col", "Year"), 
                    sep=" ", merge=TRUE)

I tried to just keep the name using "(" as a separator between the two columns, but I don't think R likes it, I'm trying to use parentheses as a separator ...

Any suggestions would be very valuable.

+4
source share
1 answer

For more complex pattern matching like yours, you can use the tidyr function extract, which allows you to create regex capture groups. Each group is inside a set of brackets ( ()):

library(tidyr)
extract(df, pedigree, into = c("Name", "Loc", "Col", "Year"), 
           regex = "^([A-Z ]+) \\((.*)\\) ([a-z]+\\.) (\\d+)$")
         Name Loc Col Year
1     ANTALYA  GB ch. 1960
2 SHOOTIN WAR USA ch. 1998
3 LORD AT WAR ARG ch. 1980

The regex that I used here:

  • ^ beginning of line
  • ([A-Z ]+) the first group contains several uppercase letters and spaces
  • \\(, then there is a space and an opening bracket (escaped with \)
  • (.*) - -
  • \\),
  • ([a-z]+\\.)
  • (\\d+),
  • $
+4

Source: https://habr.com/ru/post/1628742/


All Articles