R semicolon limits column to rows

Question

R semicolon limits column to rows

I am using RStudio 2.15.0 and created an object from Excel using XLConnect with 3000+ rows and 12 columns. I am trying to split / split a column into rows, but I don't know if this is possible or how to do it. The example data below is using three columns in a join. any help on this would be great.

Below is the code that works for 2 columns.

v1 <- with(df, tapply(PolId, Description, FUN= function(x) { x1 <- paste(x, collapse=";") gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*);', '', x1, perl=TRUE)})) library(stringr) Description <- rep(names(v1), str_count(v1, '\\w+')) PolId <- scan(text=gsub(';+', ' ', v1), what='', quiet=TRUE) data.frame(PolId, Description)

Data examples

 PolId Description Document.Type ABC123;ABC456;ABC789; TEST1 Pol1 ABC123;ABC456;ABC789; TEST1 Pol1 ABC123;ABC456;ABC789; TEST1 Pol1 AAA123; TEST1 End1 AAA123; TEST2 End2 ABB123;ABC123; TEST3 End1 ABB123;ABC123; TEST3 End1

I want the result to be like this (replacing Polid's duplicate)

 PolId Description Document.Type ABC123 TEST1 Pol1 ABC456 TEST1 Pol1 ABC789 TEST1 Pol1 AAA123 TEST1 End1 AAA123 TEST2 End2 ABB123 TEST3 End1 ABC123 TEST3 End1

+4

split r delimiter

New2Programming Feb 25 '15 at 12:30

source share

3 answers

Here is the basic solution of R. Separate the PolId field using strplit , and for each such separation field, bind it to the corresponding Description . This gives a list of matrices that we rbind together. Finally, specify the column names.

 out <- do.call(rbind, Map(cbind, strsplit(DF$PolId, ";"), DF$Description)) colnames(out) <- colnames(DF)

giving:

 > out PolId Description [1,] "ABC123" "TEST1" [2,] "ABC456" "TEST1" [3,] "ABC789" "TEST1" [4,] "ABC123" "TEST1" [5,] "ABC456" "TEST1" [6,] "ABC789" "TEST1" [7,] "ABC123" "TEST1" [8,] "ABC456" "TEST1" [9,] "ABC789" "TEST1" [10,] "AAA123" "TEST1" [11,] "AAA123" "TEST2" [12,] "ABB123" "TEST3" [13,] "ABC123" "TEST3" [14,] "ABB123" "TEST3" [15,] "ABC123" "TEST3"

Note: We used this as input:

 DF <- structure(list(PolId = c("ABC123;ABC456;ABC789;", "ABC123;ABC456;ABC789;", "ABC123;ABC456;ABC789;", "AAA123;", "AAA123;", "ABB123;ABC123;", "ABB123;ABC123;"), Description = c("TEST1", "TEST1", "TEST1", "TEST1", "TEST2", "TEST3", "TEST3")), .Names = c("PolId", "Description" ), class = "data.frame", row.names = c(NA, -7L))

+7

G. grothendieck Feb 25 '15 at 12:46

source share

Here's a quick data.table possible solution

 library(data.table) unique(setDT(df)[, .(PolId = unlist(strsplit(as.character(PolId), ";"))), by = Description]) # Description PolId # 1: TEST1 ABC123 # 2: TEST1 ABC456 # 3: TEST1 ABC789 # 4: TEST1 AAA123 # 5: TEST2 AAA123 # 6: TEST3 ABB123 # 7: TEST3 ABC123

In your editing is another option (if you have more than two columns)

 library(splitstackshape) unique(cSplit(df, "PolId", ";", "long")) # PolId Description Document.Type # 1: ABC123 TEST1 Pol1 # 2: ABC456 TEST1 Pol1 # 3: ABC789 TEST1 Pol1 # 4: AAA123 TEST1 End1 # 5: AAA123 TEST2 End2 # 6: ABB123 TEST3 End1 # 7: ABC123 TEST3 End1

+5

David Arenburg Feb 25 '15 at 12:34

source share

akrun · Accepted Answer · 2015-02-25T12:37:06+0000

You can try unnest from tidyr after splitting the "PolId" column and get unique rows

 library(dplyr) library(tidyr) unnest(setNames(strsplit(df$PolId, ';'), df$Description), Description) %>% unique()

Or using base R with stack/strsplit/duplicated . Separate "PolId" ( strsplit ) with a separator ( ; ), name the items in the output list with the Description column, stack list to get "data.frame", and use duplicated to remove duplicate lines.

 df1 <- stack(setNames(strsplit(df$PolId, ';'), df$Description)) setNames(df1[!duplicated(df1),], names(df)) # PolId Description #1 ABC123 TEST1 #2 ABC456 TEST1 #3 ABC789 TEST1 #10 AAA123 TEST1 #11 AAA123 TEST2 #12 ABB123 TEST3 #13 ABC123 TEST3

Or another option without using strsplit

 v1 <- with(df, tapply(PolId, Description, FUN= function(x) { x1 <- paste(x, collapse=";") gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*);', '', x1, perl=TRUE)})) library(stringr) Description <- rep(names(v1), str_count(v1, '\\w+')) PolId <- scan(text=gsub(';+', ' ', v1), what='', quiet=TRUE) data.frame(PolId, Description) # PolId Description #1 ABC123 TEST1 #2 ABC456 TEST1 #3 ABC789 TEST1 #4 AAA123 TEST1 #5 AAA123 TEST2 #6 ABB123 TEST3 #7 ABC123 TEST3

R semicolon limits column to rows

More articles: