How to conditionally remove quotes in write.csv?

When using write.csv you can significantly reduce the file size (about 25% for large data sets) by removing quotes with quote=FALSE . However, this can lead to read.csv malfunctioning if there are commas in your data. For instance:

 x <- data.frame(a=1:2,b=c("hello,","world")) dim(x) [1] 2 2 f <- tempfile() write.csv(x,f,row.names=FALSE,quote=FALSE) dim(read.csv(f)) [1] 2 2 read.csv(f) ab 1 hello NA 2 world NA 

Observe column name misalignment and data loss and the addition of false data.

Is it possible to remove quotes at all, but save them for fields with commas in the data?

+5
source share
4 answers

The solution I went with was a combination of @TimPietzcker and @BenBolker comments.

quote can be a number vector to indicate which columns are cited. Although I would prefer to only indicate if necessary, this allowed me to almost completely reduce the file size in my case (also using na="" ).

 commas <- which(sapply(x, function(y) any(grepl(",",y)))) write.csv(x,f,row.names=FALSE,quote=commas) read.csv(f) ab 1 1 hello, 2 2 world 
+6
source

If others are looking for a similar solution, I just wrote a complete replacement for write.csv ( write.csv.minimal.quote ), which is only cited if necessary:

 quote.if.required <- function(x, qmethod=c("double", "escape"), sep=",", eol="\n") { qmethod <- match.arg(qmethod) x <- as.character(x) mask.quote.sub <- grepl('"', x, fixed=TRUE) mask.quote.sep <- grepl(sep, x, fixed=TRUE) | grepl(eol, x, fixed=TRUE) qstring <- switch(qmethod, escape="\\\\\"", double="\"\"") x[mask.quote.sub] <- paste0('"', gsub('"', qstring, x[mask.quote.sub]), '"') x[mask.quote.sep & !mask.quote.sub] <- paste0('"', x[mask.quote.sep & !mask.quote.sub], '"') x } write.csv.minimal.quote <- function(x, file="", ..., qmethod=c("double", "escape"), row.names=FALSE, sep=",", eol="\n", quote) { qmethod <- match.arg(qmethod) if (!is.data.frame(x)) { cn <- colnames(x) x <- as.data.frame(x) colnames(x) <- cn } else { cn <- colnames(x) } cn <- quote.if.required(cn, qmethod=qmethod, sep=sep, eol=eol) x <- as.data.frame(lapply(x, quote.if.required, qmethod=qmethod, sep=sep, eol=eol)) if (is.logical(row.names) && row.names) { row.names <- quote.if.required(base::row.names(x), qmethod=qmethod, sep=sep, eol=eol) } else if (is.character(row.names)) { row.names <- quote.if.required(row.names, qmethod=qmethod, sep=sep, eol=eol) } write.table(x, file=file, append=FALSE, sep=",", dec=".", eol="\n", col.names=cn, row.names=row.names, quote=FALSE) } #tmp <- data.frame('"abc'=1:3, "def,hij"=c("1,2", "3", '4"5'), klm=6:8) #names(tmp) <- c('"abc', "def,hij", "klm") #write.csv.minimal.quote(tmp, file="test.csv") 
+2
source

This is my implementation of the idea @Bill Denney suggested. I think this is partly because it is more rude and understandable to me, but mainly because I wrote it :)

 ##' Write CSV files with quotes same as MS Excel 2013 or newer ##' ##' R inserts quotes where MS EExcel CSV export no longer inserts quotation marks on character ##' variables, except when the cells include commas or quotation marks. ##' This function generates CSV files that are, so far as we know ##' in exactly the same style as MS Excel CSV export files. ##' ##' This works by manually inserting quotation marks where necessary and ##' turning FALSE R own method to insert quotation marks. ##' @param xa data frame ##' @param file character string for file name ##' @param row.names Default FALSE for row.names ##' @return the return from write.table. ##' @author Paul Johnson ##' @examples ##' set.seed(234) ##' x1 <- data.frame(x1 = c("a", "b,c", "b", "The \"Washington, DC\""), ##' x2 = rnorm(4), stringsAsFactors = FALSE) ##' x1 ##' dn <- tempdir() ##' fn <- tempfile(pattern = "testcsv", fileext = ".csv") ##' writeCSV(x1, file = fn) ##' readLines(fn) ##' x2 <- read.table(fn, sep = ",", header = TRUE, stringsAsFactors = FALSE) ##' all.equal(x1,x2) writeCSV <- function(x, file, row.names = FALSE){ xischar <- colnames(x)[sapply(x, is.character)] for(jj in xischar){ x[ , jj] <- gsub('"', '""', x[ , jj], fixed = TRUE) needsquotes <- grep('[\",]', x[ ,jj]) x[needsquotes, jj] <- paste0("\"", x[needsquotes, jj], "\"") } write.table(x, file = file, sep = ",", quote = FALSE, row.names = row.names) } 

Conclusion from the example:

 > set.seed(234) > x1 <- data.frame(x1 = c("a", "b,c", "b", "The \"Washington, DC\""), + x2 = rnorm(4), stringsAsFactors = FALSE) > x1 x1 x2 1 a 0.6607697 2 b,c -2.0529830 3 b -1.4992061 4 The "Washington, DC" 1.4712331 > dn <- tempdir() > fn <- tempfile(pattern = "testcsv", fileext = ".csv") > writeCSV(x1, file = fn) > readLines(fn) [1] "x1,x2" [2] "a,0.660769736644892" [3] "\"b,c\",-2.052983003941" [4] "b,-1.49920605110092" [5] "\"The \"\"Washington, DC\"\"\",1.4712331168047" > x2 <- read.table(fn, sep = ",", header = TRUE, stringsAsFactors = FALSE) > all.equal(x1,x2) [1] TRUE > 
+2
source

If the value contains a comma, wrap it in quotation marks. Then write.csv with quote = FALSE .

 library(stringr) options(useFancyQuotes = FALSE) d <- data.frame( x = c("no comma", "has,comma") ) d$x <- with(d, ifelse(str_detect(x, ","), dQuote(x), as.character(x))) filename <- "test.csv" write.csv(d, file = filename, quote = FALSE, row.names= FALSE) noquote(readLines(filename)) ## [1] x no comma "has,comma" read.csv(filename) ## x ## 1 no comma ## 2 has,comma 

(You can replace grepl with str_detect and paste with dQuote if you want.)


On the other hand, I don’t think that for most datasets you get about 25% file size savings. If small files are important to you, you'd better compress the file (see zip and tar in the utils package) or save it in a binary file (see save and rhdf5 package) or, possibly, in the database.

+1
source

Source: https://habr.com/ru/post/1202145/


All Articles