Differentiation of split data in a DataFrame into separate columns in R

I have a data frame that looks as such

ABC 1 3 X1=7;X2=8;X3=9 2 4 X1=10;X2=11;X3=12 5 6 X1=13;X2=14 

I would like to analyze column C into individual columns as such ...

 AB X1 X2 X3 1 3 7 8 9 2 4 10 11 12 5 6 13 14 NA 

How can this be done in R?

+6
source share
4 answers

First, here are sample data in the form of data.frame

 dd<-data.frame( A = c(1L, 2L, 5L), B = c(3L, 4L, 6L), C = c("X1=7;X2=8;X3=9", "X1=10;X2=11;X3=12", "X1=13;X2=14"), stringsAsFactors=F ) 

Now I define a small helper function for accepting vectors of type c("A=1","B=2") and change them to named vectors such as c(A="1", B="2") .

 namev<-function(x) { a<-strsplit(x,"=") setNames(sapply(a,'[',2), sapply(a,'[',1)) } 

and now I am doing the transformations

 #turn each row into a named vector vv<-lapply(strsplit(dd$C,";"), namev) #find list of all column names nm<-unique(unlist(sapply(vv, names))) #extract data from all rows for every column nv<-do.call(rbind, lapply(vv, '[', nm)) #convert everything to numeric (optional) class(nv)<-"numeric" #rejoin with original data cbind(dd[,-3], nv) 

and it gives you

  AB X1 X2 X3 1 1 3 7 8 9 2 2 4 10 11 12 3 5 6 13 14 NA 
+2
source

My cSplit function cSplit problems like this fun. Here it is in action:

 ## Load some packages library(data.table) library(devtools) ## Just for source_gist, really library(reshape2) ## Load `cSplit` source_gist("https://gist.github.com/mrdwab/11380733") 

First, separate the values ​​and create a “long” data set:

 ddL <- cSplit(cSplit(dd, "C", ";", "long"), "C", "=") ddL # AB C_1 C_2 # 1: 1 3 X1 7 # 2: 1 3 X2 8 # 3: 1 3 X3 9 # 4: 2 4 X1 10 # 5: 2 4 X2 11 # 6: 2 4 X3 12 # 7: 5 6 X1 13 # 8: 5 6 X2 14 

Then use dcast.data.table (or just dcast ) to go from "long" to "wide":

 dcast.data.table(ddL, A + B ~ C_1, value.var="C_2") # AB X1 X2 X3 # 1: 1 3 7 8 9 # 2: 2 4 10 11 12 # 3: 5 6 13 14 NA 
+3
source

Here is one possible approach:

 dat <- read.table(text="ABC 1 3 X1=7;X2=8;X3=9 2 4 X1=10;X2=11;X3=12 5 6 X1=13;X2=14", header=TRUE, stringsAsFactors = FALSE) library(qdapTools) dat_C <- strsplit(dat$C, ";") dat_C2 <- sapply(dat_C, function(x) { y <- strsplit(x, "=") rep(sapply(y, "[", 1), as.numeric(sapply(y, "[", 2))) }) data.frame(dat[, -3], mtabulate(dat_C2)) ## AB X1 X2 X3 ## 1 1 3 7 8 9 ## 2 2 4 10 11 12 ## 3 5 6 13 14 0 

EDIT To get NA values

 m <- mtabulate(dat_C2) m[m==0] <- NA data.frame(dat[, -3], m) 
+1
source

Here's a good, somewhat hacky way to get you there.

 ## read your data > dat <- read.table(h=T, text = "ABC 1 3 X1=7;X2=8;X3=9 2 4 X1=10;X2=11;X3=12 5 6 X1=13;X2=14", stringsAsFactors = FALSE) ## --- > s <- strsplit(dat$C, ";|=") > xx <- unique(unlist(s)[grepl('[AZ]', unlist(s))]) > sap <- t(sapply(seq(s), function(i){ wh <- which(!xx %in% s[[i]]); n <- suppressWarnings(as.numeric(s[[i]])) nn <- n[!is.na(n)]; if(length(wh)){ append(nn, NA, wh-1) } else { nn } })) ## see below for explanation > data.frame(dat[1:2], sap) # AB X1 X2 X3 # 1 1 3 7 8 9 # 2 2 4 10 11 12 # 3 5 6 13 14 NA 

Basically what happens in sap

  • check which values ​​are missing
  • change each element of list s to numeric
  • remove NA values ​​from (2)
  • insert NA to the correct position using append
  • transpose the result
+1
source

Source: https://habr.com/ru/post/970650/


All Articles