R: Converting column values ​​to their own binary coded columns (Dummy Variables)

I have several CSV files with columns such as gender, age, diagnosis, etc.

They are currently encoded as such:

ID, gender, age, diagnosis 1, male, 42, asthma 1, male, 42, anxiety 2, male, 19, asthma 3, female, 23, diabetes 4, female, 61, diabetes 4, female, 61, copd 

The goal is to convert this data to this target format :

Sidenote: if possible, it would be great to also add the original column names to the new column names, for example. 'age_42' or 'gender_female.'

 ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0 2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0 3, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0 4, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1 

I tried using the reshape2 dcast() function, but I get combinations that lead to extremely sparse matrices. Here's a simplified example with only age and gender:

 data.train <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length) ID, male19, male23, male42, male61, female19, female23, female42, female61 1, 0, 0, 1, 0, 0, 0, 0, 0 2, 1, 0, 0, 0, 0, 0, 0, 0 3, 0, 0, 0, 0, 0, 1, 0, 0 4, 0, 0, 0, 0, 0, 0, 0, 1 

Having seen that this is a fairly common task in preparing machine learning data, I believe that there may be other libraries (which I do not know) that can perform this conversion.

+6
source share
5 answers

A base R will be

  (!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L # values #ID 19 23 42 61 anxiety asthma copd diabetes female male # 1 0 0 1 0 1 1 0 0 0 1 # 2 1 0 0 0 0 1 0 0 0 1 # 3 0 1 0 0 0 0 0 1 1 0 # 4 0 0 0 1 0 0 1 1 1 0 

If you need also the original name

  (!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L # Val #ID age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma #1 0 0 1 0 1 1 #2 1 0 0 0 0 1 #3 0 1 0 0 0 0 #4 0 0 0 1 0 0 # Val #ID diagnosis_copd diagnosis_diabetes gender_female gender_male #1 0 0 0 1 #2 0 0 0 1 #3 0 1 1 0 #4 1 1 1 0 

data

 df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male", "male", "male", "female", "female", "female"), age = c(42L, 42L, 19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma", "diabetes", "diabetes", "copd")), .Names = c("ID", "gender", "age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame") 
+4
source

You need a melt / dcast (called recast ) to convert all columns to a single column and avoid combinations

 library(reshape2) recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L) # ID 19 23 42 61 anxiety asthma copd diabetes female male # 1 1 0 0 1 0 1 1 0 0 0 1 # 2 2 1 0 0 0 0 1 0 0 0 1 # 3 3 0 1 0 0 0 0 0 1 1 0 # 4 4 0 0 0 1 0 0 1 1 1 0 

According to your Sidenote, you can add variable here to get the added names too

 recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L) # ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd # 1 1 0 1 0 0 1 0 1 1 0 # 2 2 0 1 1 0 0 0 0 1 0 # 3 3 1 0 0 1 0 0 0 0 0 # 4 4 1 0 0 0 0 1 0 0 1 # diagnosis_diabetes # 1 0 # 2 0 # 3 1 # 4 1 
+8
source

The caret package has a "dummify" function.

 library(caret) library(dplyr) predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df) 
+5
source

Using reshape from the R base:

 d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_") a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_") g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_") new.dat <- cbind(ID=d["ID"], g[,grepl("_", names(g))], a[,grepl("_", names(a))], d[,grepl("_", names(d))]) # convert factors columns to character (if necessary) # taken from @Marek answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231 new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character) new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0 new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1 # ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma #1 1 1 0 1 0 0 0 1 #3 2 1 0 0 1 0 0 1 #4 3 0 1 0 0 1 0 0 #5 4 0 1 0 0 0 1 0 # diagnosis_anxiety diagnosis_diabetes diagnosis_copd #1 1 0 0 #3 0 0 0 #4 0 1 0 #5 0 1 1 
+3
source

Below is a slightly longer path with dcast() and merge() . Since gender and age are not unique by ID, a function is created to turn its length into a dummy variable ( dum() ). On the other hand, the diagnosis is made unambiguously by changing the formula.

 library(reshape2) data.raw <- read.table(header = T, sep = ",", text = " id, gender, age, diagnosis 1, male, 42, asthma 1, male, 42, anxiety 2, male, 19, asthma 3, female, 23, diabetes 4, female, 61, diabetes 4, female, 61, copd") # function to create a dummy variable dum <- function(x) { if(length(x) > 0) 1 else 0 } # length of dignosis by id, gender and age diag <- dcast(data.raw, formula = id + gender + age ~ diagnosis, fun.aggregate = length)[,-c(2,3)] # length of gender by id gen <- dcast(data.raw, formula = id ~ gender, fun.aggregate = dum) # length of age by id age <- dcast(data.raw, formula = id ~ age, fun.aggregate = dum) merge(merge(gen, age, by = "id"), diag, by = "id") # id female male 19 23 42 61 anxiety asthma copd diabetes #1 1 0 1 0 0 1 0 1 1 0 0 #2 2 0 1 1 0 0 0 0 1 0 0 #3 3 1 0 0 1 0 0 0 0 0 1 #4 4 1 0 0 0 0 1 0 0 1 1 

Actually, I don’t know your model very well, but your setup may be too big, because R handles factors on the formula object. For example, if gender is the answer, the following matrix will be generated within R. Therefore, while you are not going to fit yourself, it is enough to establish the appropriate data types and formula.

 data.raw$age <- as.factor(data.raw$age) model.matrix(gender ~ ., data = data.raw[,-1]) #(Intercept) age23 age42 age61 diagnosis asthma diagnosis copd diagnosis diabetes #1 1 0 1 0 1 0 0 #2 1 0 1 0 0 0 0 #3 1 0 0 0 1 0 0 #4 1 1 0 0 0 0 1 #5 1 0 0 1 0 0 1 #6 1 0 0 1 0 1 0 

If you need all levels of each variable, you can do this by suppressing the capture in model.matrix and using the trick from all-levels-of-a-factor-in-a-model-matrix-in-r

 # Using Akrun df1, first change all variables, except ID, to factor df1[-1] <- lapply(df1[-1], factor) # Use model.matrix to derive dummy coding m <- data.frame(model.matrix( ~ 0 + . , data=df1, contrasts.arg = lapply(df1[-1], contrasts, contrasts=FALSE))) # Collapse to give final solution aggregate(. ~ ID, data=m, max) 
+1
source

Source: https://habr.com/ru/post/987351/


All Articles