Embedding unique functions in a column in variable names and the original dummy coding functions in variables in R

You are having a problem with the dummy code for the next dataset.

Sample data, for example dataframe = mydata:

ID | NAMES | -- | -------------- | 1 | 4444, 333, 456 | 2 | 333 | 3 | 456, 765 | 

I would like to specify only unique variables in NAMES as column and code variables if each row has this variable or not ie 1 or 0

Output Required:

 ID | NAMES | 4444 | 333 | 456 | 765 | -- | -------------- |------|-----|-----|-----| 1 | 4444, 333, 456 | 1 | 1 | 1 | 0 | 2 | 333 | 0 | 1 | 0 | 0 | 3 | 456, 765 | 0 | 0 | 1 | 1 | 

what i have done so far is creating a unique vector

 split <- str_split(string = mydata$NAMES,pattern = ",") vec <- unique(str_trim(unlist(split))) remove <- "" vec <- as.data.frame(vec[! vec %in% remove]) colnames(vec) <- "var" vecRef <- as.vector(vec$var) namesCast <- dcast(data = vec,formula = .~var) namesCast <- nameCast[,2:ncol(namesCast)] 

This gives a vector of unique NAMES with removed spaces / irregularities. From there, I have no idea how to do the mapping / dummy code, so any help would be greatly appreciated!

+5
source share
1 answer

You can use cSplit_e from my splitstackshape package, for example:

 library(splitstackshape) cSplit_e(mydata, "NAMES", sep = ",", type = "character", fill = 0) # ID NAMES NAMES_333 NAMES_4444 NAMES_456 NAMES_765 # 1 1 4444, 333, 456 1 1 1 0 # 2 2 333 1 0 0 0 # 3 3 456, 765 0 0 1 1 

If you want to see the underlying function that is called when using these arguments, you can look at splitstackshape:::charMat , which takes the list generated by strsplit and creates a matrix from it.

Calling the function directly will give you something like this:

 splitstackshape:::charMat( lapply(strsplit(as.character(mydata$NAMES), ","), function(x) gsub("^\\s+|\\s$", "", x))) # 333 4444 456 765 # [1,] 1 1 1 NA # [2,] 1 NA NA NA # [3,] NA NA 1 1 
+5
source

Source: https://habr.com/ru/post/1208258/


All Articles