Changing data types of data columns based on a template with corresponding columns in R

I have 2 data frames.

  • Template - I will use data types from this data frame.
  • df - I want to change the data types of this data frame based on a template.

I want to change the data types of the second data frame based on the first. Suppose I have a below data frame that I use as a template.

> template id <- c(1,2,3,4) a <- c(1,4,5,6) b <- as.character(c(0,1,1,4)) c <- as.character(c(0,1,1,0)) d <- c(0,1,1,0) template <- data.frame(id,a,b,c,d, stringsAsFactors = FALSE) > str(template) 'data.frame': 4 obs. of 5 variables: $ id: num 1 2 3 4 $ a : num 1 4 5 6 $ b : chr "0" "1" "1" "4" $ c : chr "0" "1" "1" "0" $ d : num 0 1 1 0 

I am looking for things below.

  • To make data type data the same for df,
  • It should have the same columns that are in the template.

** Note. He should add extra columns with all NA if they are not available in df.

 > df id <- c(6,7,12,14,1,3,4,4) a <- c(0,1,13,1,3,4,5,6) b <- c(1,4,12,3,4,5,6,7) c <- c(0,0,13,3,4,45,6,7) e <- c(0,0,13,3,4,45,6,7) df <- data.frame(id,a,b,c,e) > str(df) 'data.frame': 8 obs. of 5 variables: $ id: num 6 7 12 14 1 3 4 4 $ a : num 0 1 13 1 3 4 5 6 $ b : num 1 4 12 3 4 5 6 7 $ c : num 0 0 13 3 4 45 6 7 $ e : num 0 0 13 3 4 45 6 7 

The required conclusion is

 > output id abcd 1 6 0 1 0 NA 2 7 1 4 0 NA 3 12 13 12 13 NA 4 14 1 3 3 NA 5 1 3 4 4 NA 6 3 4 5 45 NA 7 4 5 6 6 NA 8 4 6 7 7 NA > str(output) 'data.frame': 8 obs. of 5 variables: $ id: num 6 7 12 14 1 3 4 4 $ a : num 0 1 13 1 3 4 5 6 $ b : chr "1" "4" "12" "3" ... $ c : chr "0" "0" "13" "3" ... $ d : logi NA NA NA NA NA NA ... 

My attempts are

 template <- fread("template.csv"),header=TRUE,stringsAsFactors = FALSE) n <- names(template) template[,(n) := lapply(.SD,function(x) gsub("[^A-Za-z0-90 _/.-]","", as.character(x)))] n <- names(df) df[,(n) := lapply(.SD,function(x) gsub("[^A-Za-z0-90 _/.-]","", as.character(x)))] output <- rbindlist(list(template,df),use.names = TRUE,fill = TRUE,idcol="template") 

After that, I write the output data frame and then re-read it with write.csv to get the data types. But, I messed up the data types. Please suggest any suitable way to handle this.

+5
source share
2 answers

I would do

 res = data.frame( lapply(setNames(,names(template)), function(x) if (x %in% names(df)) as(df[[x]], class(template[[x]])) else template[[x]][NA_integer_] ), stringsAsFactors = FALSE) 

or using magrittr

 library(magrittr) setNames(, names(template)) %>% lapply(. %>% { if (. %in% names(df)) as(df[[.]], class(template[[.]])) else template[[.]][NA_integer_] }) %>% data.frame(stringsAsFactors = FALSE) 

check ...

 'data.frame': 8 obs. of 5 variables: $ id: num 6 7 12 14 1 3 4 4 $ a : num 0 1 13 1 3 4 5 6 $ b : chr "1" "4" "12" "3" ... $ c : chr "0" "0" "13" "3" ... $ d : num NA NA NA NA NA NA NA NA 

I would suggest looking at the vetr package if you are going to do a lot of things like this. It has a good approach to templates for data frames and their columns.

+2
source

Here is the code that does what you want.

 require(tidyverse) new_types <- map_df(template, class) %>% t %>% as.data.frame(stringsAsFactors = F) %>% rownames_to_column %>% setNames(c('col', 'type')) new_data <- df %>% gather(col, value) %>% right_join(new_types, by='col') %>% group_by(col) %>% mutate(rownum = row_number()) %>% ungroup %>% complete(col, rownum=1:max(rownum)) %>% group_by(col) %>% summarize(val = list(value), type=first(type)) %>% mutate(new_val = map2(val, type, ~as(.x, .y, strict = T))) %>% select(col, new_val) %>% spread(col, new_val) %>% unnest 

The main idea here is to use map2() from the purrr package to use the as() function from base R. This function takes an object (for example, a vector or a column from a data frame) and a character string that describes the new type and returns a forced object. This is the main opportunity you need.

My new_types dataframe simply lists the column names of the template and (character string) named by their type in the data frame.

With the exception of the line map2() , everything else is ridiculous data that can be improved.

Some key features:

  • right_join It is important here to keep only the columns you need.
  • lines from mutate(rownum = row_number()) to complete(col, rownum=1:max(rownum)) needed only when the target df has columns that are not in the template - they guarantee that the resulting number NA will be same as for other columns.
+1
source

Source: https://habr.com/ru/post/1275022/


All Articles