Splitting a line of text into columns of a data frame

Question

Splitting a line of text into columns of a data frame

I have a dataframe with lines of text that look like this:

         ANTALYA (GB) ch. 1960
    SHOOTIN WAR (USA) ch. 1998
    LORD AT WAR (ARG) ch. 1980

All caps are names, then location in (), color abbreviation, year. Names can be a few words. I want to split this single block of text into each component: name, location, color, year. I struggled with this for several days, and the best working solution I have is to simply put each word in separate columns, but it only works if the names have a certain length ... What do I do with the data, I can use it in this form, but it just doesn’t look beautiful, you know?

sepdf <- df %>% 
           separate(pedigree, into=c("Name1", "Name2", "Loc", "Col", "Year"), 
                    sep=" ", merge=TRUE)

I tried to just keep the name using "(" as a separator between the two columns, but I don't think R likes it, I'm trying to use parentheses as a separator ...

Any suggestions would be very valuable.

+4

r parsing dataframe tidyr

Kelli humbird Feb 14 '16 at 20:49

source share

1 answer

docendo discimus · Accepted Answer · 2016-02-14T20:58:59+0000

For more complex pattern matching like yours, you can use the tidyr function extract, which allows you to create regex capture groups. Each group is inside a set of brackets ( ()):

library(tidyr)
extract(df, pedigree, into = c("Name", "Loc", "Col", "Year"), 
           regex = "^([A-Z ]+) \\((.*)\\) ([a-z]+\\.) (\\d+)$")
         Name Loc Col Year
1     ANTALYA  GB ch. 1960
2 SHOOTIN WAR USA ch. 1998
3 LORD AT WAR ARG ch. 1980

The regex that I used here:

^ beginning of line
([A-Z ]+) the first group contains several uppercase letters and spaces
\$, then there is a space and an opening bracket (escaped with $
(.*) - -
\\),
([a-z]+\\.)
(\\d+),
$

Splitting a line of text into columns of a data frame

More articles: