What is the fastest way to write a large data frame like json in R?

I need to write a large data frame to a file as JSON in R. I am using the rjson package. The approach below is slow ...

for (i in 1:nrow(df)) { write.table(toJSON(df[i,]),"[FILENAME]", row.names=FALSE,col.names=FALSE,quote=FALSE,append=TRUE) } 

So, I tried this:

 write.table(toJSON(df),"FILENAME]", row.names=FALSE,col.names=FALSE,quote=FALSE,append=TRUE) 

Which is suffocating because toJSON () cannot handle a very long string. So I would probably like to write out pieces of the data table at a time. What is the recommended approach for this? If it includes split() , could you provide some psuedocode?

+6
source share
2 answers

Here's a big (ger) dataset

 big = iris[rep(seq_len(nrow(iris)), 1000),] 

a for loop with toJSON(df[i,]) creates a flat file of key-value pairs representing each row, while toJSON(df) creates column vectors; they are very different. We strive for the equivalent toJSON(df[i,]) , but format it as a single JSON string.

Start with munging big in the lol list-lists with each inner element named (turning the factor into a character so as not to confuse json further), so lol looks like list(big[1,], big[2,], ...) > but with names for each element.

 big1 <- Map(function(x, nm) setNames(x, rep(nm, length(x))), big, names(big)) big1$Species <- as.character(big1$Species) lol <- unname(do.call(Map, c(list, big1))) 

Then we turn it into a json vector using the rjson and splitIndices library provided by the parallel library (possibly other ways to generate splitting)

 chunks <- 10 json <- sapply(splitIndices(length(lol), chunks), function(idx) toJSON(lol[idx])) 

We could write json chunks to a file, but they are not quite legal - everything except the last line should end with "," but end with "]", and everything except the first should start with nothing, but start instead with "[".

 substring(json[-length(json)], nchar(json)[-length(json)]) = "," substring(json[-1], 1, 1) = "" 

Then they are ready to write to the file.

 fl <- tempfile() writeLines(json, fl) 

The union and, of course, many special cases for forced use of the column type are not processed,

 library(parallel) ## just for splitIndices; no parallel processing here... library(json) fastJson <- function(df, fl, chunks=10) { df1 = Map(function(x, nm) setNames(x, rep(nm, length(x))), df, names(df)) df1 <- lapply(df1, function(x) { if (is(x, "factor")) as.character(x) else x }) lol = unname(do.call(Map, c(list, df1))) idx <- splitIndices(length(lol), chunks) json <- sapply(idx, function(i) toJSON(lol[i])) substring(json[-length(json)], nchar(json)[-length(json)]) <- "," substring(json[-1], 1, 1) <- "" writeLines(json, fl) } 

WITH

 > fastJson(big, tempfile()) > system.time(fastJson(big, fl <- tempfile())) user system elapsed 2.340 0.008 2.352 > system(sprintf("wc %s", fl)) 10 10 14458011 /tmp/RtmpjLEh5h/file3fa75d00a57c 

In contrast, simply substituting large ones (without any JSON parsing or writing to a file) takes a lot of time:

 > system.time(for (i in seq_len(nrow(big))) big[i,]) user system elapsed 57.632 0.088 57.835 

Opening this file for adding, once for each line, does not take much time compared to the setting

 > system.time(for (i in seq_len(nrow(big))) { con <- file(fl, "a"); close(con) }) user system elapsed 2.320 0.580 2.919 
+8
source

Extremely slow with your first approach is that every time you call write.table , the file opens, the handle moves to the bottom of the file, data is written, and then the file is closed. It will be much faster if you open the file only once and use the file descriptor. Like this:

 fh <- file("[FILENAME]", "w") for (i in 1:nrow(df)) { write.table(toJSON(df[i,]), fh, row.names = FALSE, col.names = FALSE, quote = FALSE) } close(fh) 

I also removed append = TRUE , as implied (hence, not necessarily) when using the file connection.

+1
source

Source: https://habr.com/ru/post/954209/


All Articles