Fast way to read xlsx files in R

Question

Fast way to read xlsx files in R

This is the next question of this . What is the fastest way to read .xlsx files in R?

I am using library(xlsx) to read from 36 .xlsx files. It is working. However, the problem is that this is a very time-consuming process (more than 30 minutes), especially if you consider not so much data in each file (matrix size 3 * 3652 in each file). To this end, is it better to deal with such a problem, please? Is there another quick way to read .xlsx in R? Or can I quickly put 36 files in one csv file and then read in R?

Also, I just realized that readxl cannot write xlsx. Is there a copy for writing instead of reading?

"The answer to those who voted for this question":

This question is about the fact, not the so-called "self-confident answers and spam," because speed is time and time, but not a fact.

Further update:

Perhaps you can explain to us in plain language why some method works much faster than others. Of course, I am confused by this.

+12

r xlsx

LaTeXFan Jun 14 '17 at 7:27

source share

1 answer

Mark heckmann · Accepted Answer · 2018-02-09T12:53:42+0000

Here is a little test. Results: readxl::read_xlsx on average is about two times faster than openxlsx::read.xlsx in a different number of rows ( n ) and columns ( p ) using standard settings.

 options(scipen=999) # no scientific number format nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000) pp <- c(1, 5, 10, 20, 30, 40, 50) # create some excel files l <- list() # save results tmp_dir <- tempdir() for (n in nn) { for (p in pp) { name <- cat("\n\tn:", n, "p:", p) flush.console() m <- matrix(rnorm(n*p), n, p) file <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx") # write write.xlsx(m, file) # read elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"] df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p, elapsed = elapsed, stringsAsFactors = F, row.names = NULL) l <- append(l, list(df)) elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"] df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p, elapsed = elapsed, stringsAsFactors = F, row.names = NULL) l <- append(l, list(df)) } } # results d <- do.call(rbind, l) library(ggplot2) ggplot(d, aes(n, elapsed, color= fun)) + geom_line() + geom_point() + facet_wrap( ~ paste("columns:", p)) + xlab("Number of rows") + ylab("Seconds")

Fast way to read xlsx files in R

More articles: