I have several CSV files that contain numbers in the local German style, i.e. with a semicolon as a decimal separator, and a dot as a thousand separator, for example. 10.380.45. Values in the CSV file are separated by a ";" Files also contain columns of class characters, date, date and time, and logic.
The problem with read.table functions is that you can specify a decimal separator with dec = ",", but not a thousandth dot separator. (If I'm wrong, please correct me)
I know that preprocessing is a workaround, but I want to write my code so that others can use it without me.
I found a way to read the CSV file the way I want it using read.csv2, setting my own classes, as seen in the following example. Based on the most elegant way to load csv with a dot as a thousands separator in R
df_test_write <- cbind.data.frame(c("a","b","c","d","e","f","g","h","i","j",rep("k",times=200)),
c("5.200,39","250,36","1.000.258,25","3,58","5,55","10.550,00","10.333,00","80,33","20.500.000,00","10,00",rep("3.133,33",times=200)),
c("25.03.2015","28.04.2015","03.05.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016",rep("08.08.2016",times=200)),
stringsAsFactors=FALSE)
colnames(df_test_write) <- c("col_text","col_num","col_date")
write.csv2(df_test_write,file="Test.csv",quote=FALSE,row.names=FALSE)
setClass('myNum')
setAs("character","myNum", function(from) as.numeric(gsub(",","\\.",gsub("\\.","",from))))
# own date class
library(lubridate)
setClass('myDate')
setAs("character","myDate",function(from) dmy(from))
# Read the csv file, in colClasses the columns class can be defined
df_test_readcsv <- read.csv2(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
colClasses = c(
col_text = "character",
col_num = "myNum",
col_date = "myDate"
)
)
My problem is that different datasets have up to 200 columns and 350,000 rows. With the top solution, I need 40 to 60 seconds to download one CSV file, and I would like to speed it up.
In the course of my research, I found fread()from the package data.tablethat very quickly. Downloading a CSV file takes 3 to 5 seconds.
, . colClasses, , , , fread https://github.com/Rdatatable/data.table/issues/491
. :
library(data.table)
df_test_readfread1 <- fread(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
dec = ",",
sep=";",
verbose=TRUE)
str(df_test_readfread1)
df_test_readfread2 <- fread(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
colClasses = c(
col_text = "character",
col_num = "myNum",
col_date = "myDate"
),
sep=";",
verbose=TRUE)
str(df_test_readfread2)
, : CSV , 10.380,45 fread?
(: CSV ?)
, , ; -).