How can I stack data in R?

I have 20 different CSV files, and I need to somehow drain the data in R so that I can get a general picture of the data. I am currently copying and pasting columns in excel to make one large dataset. However, I am sure that there is a faster and more efficient way to do this in R, as it will eventually take some time.

In addition, to make matters worse, some variable names do not match in each data set. for example, VARIABLE1 is written as variable1 in some datasets. How would I fix this in R since I understand that R is case sensitive?

Any help would be greatly appreciated. Thanks!

+4
source share
3 answers

The easiest and fastest way to do this is if you (or want you to be) familiar with the data.table package this way (not tested):

 require(data.table) in_pth <- "path_to_csv_files" # directory where CSV files are located, not the files. files <- list.files(in_pth, full.names=TRUE, recursive=FALSE, pattern="\\.csv$") out <- rbindlist(lapply(files, fread)) 

list.files options:

  • full.names = TRUE will return the full path to your file. Suppose your in_pth <- "c:\\my_csv_folder" and there are two files inside it: 01.csv and 02.csv . Then full.names=TRUE will return c:\\my_csv_folder\\01.csv and c:\\my_csv_folder\\02.csv ( full path ).

  • recursive = FALSE will not search inside directories in your in_pth folder. Suppose you have two more csv files in c:\\my_csv_folder\\another_folder . Now, if you want to load these files inside this, you can set recursive=TRUE , which will check the files until you find all the directories that will be searched down.

  • pattern=\\.csv$ : This is a regular expression for detecting uploaded files. If your folder, in addition to csv files, also has text files (.txt), then specifying this template, you download only csv . If your folder has only CSV files, then this is not necessary.


data.table functions:

  • rbindlist avoids conflicts in column names by preserving the name of the previous data table. That is, if you have two data.table dt1, dt2 with column names x,y and a,b respectively, then rbindlist(dt1,dt2) will take care of changing a,b to x,y and rbindlist(dt2, dt1) will take care of changing x,y to a,b .

  • fread most often handles columns, header separators, etc. and very fast (although still experimental, so you can check your result to make sure that everything is fine (even if it is stable)).

+3
source

@Denis: It's also worth looking into the plyr package for this. rbind.fill(...) allows you to combine data.frames per line.

 install.packages("plyr") library(plyr) 

help (rbind.fill) For more details see below:

rbinds list of data frames filling in the missing columns with NA.

Using

rbind.fill(...) Arguments

... input data frames for joining rows. The first argument may be a list of data frames, in which case all other arguments are ignored.

More details

This extension for rbind , which is added to columns that are absent in all inputs, accepts a list of data frames and is much faster.

The names and types of columns in the output will be displayed in the order in which they were found. Validation is not performed to ensure that each column has a consistent input type.

As far as I know, there is no cbind.fill ; however, there is a custom function cbind.fill that allows you to combine data.frames column by column. More details here .

There are two solutions: one dependent on rbind.fill in the plyr package , and the other independent of rbind.fill .

0
source

Another way that does not use external packages is to use the cbind () command: it does column binding. Therefore, if you need to use different tables, you can simply pass them as arguments to cbind (), and they will be attached

-1
source

Source: https://habr.com/ru/post/1487851/


All Articles