I have an R script to upload multiple text files to a directory and save data in a compressed .rda format. It looks like this:
#!/usr/bin/Rscript --vanilla args <- commandArgs(TRUE) ## arg[1] is the folder name outname <- paste(args[1], ".rda", sep="") files <- list.files(path=args[1], pattern=".txt", full=TRUE) tmp <- list() if(file.exists(outname)){ message("found ", outname) load(outname) tmp <- get(args[1]) # previously read stuff files <- setdiff(files, names(tmp)) } if(is.null(files)) message("no new files") else { ## read the files into a list of matrices results <- plyr::llply(files, read.table, .progress="text") names(results) <- files assign(args[1], c(tmp, results)) message("now saving... ", args[1]) save(list=args[1], file=outname) } message("all done!")
The files are quite large (15 MB each, of which usually 50), so the launch of this script usually takes several minutes, a significant part of which is taken with the recording of .rda results.
I often update the directory with new data files, so I would like to add them to previously saved and compressed data. This is what I do above, checking if there is already an output file with this name. The final step is still pretty slow saving the .rda file.
Is there a smarter way to do this in some kind of package, keeping a trace of which files were read and saving it faster?
I saw that knitr
uses tools:::makeLazyLoadDB
to save its cached calculations, but this function is not documented, so I'm not sure where it makes sense to use it.
source share