I am trying to load a large data file (about 20 million lines) using fread()
from the data.table package. However, some lines cause big problems.
Minimal example:
text.csv contains: id, text 1,"""Oops"",\""The"",""Georgia""" fread("text.csv", sep=",") Error in fread("text.csv", sep = ",") : Not positioned correctly after testing format of header row. ch=',' In addition: Warning message: In fread("text.csv", sep = ",") : Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: id, text
read.table()
works slightly better, but is too slow and too inefficient.
> read.table("text.csv", header = TRUE, sep=",") id text 1 1 "Oops",\\"The","Georgia"
I understand that my text file is not formatted correctly, but it is too large for editing in practice.
Any help is greatly appreciated.
EDIT:
A small selection of actual data records:
sample1.txt, a good record: materiale_id,dk5,description,creator,subject-phrase,title,type 125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music > fread("sample1.txt", sep=",") materiale_id dk5 description creator subject-phrase 1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin NA title type 1: Koncert i Copenhagen Jazz House den 26.1.1995 music sample2.txt, a good and a bad record: materiale_id,dk5,description,creator,subject-phrase,title,type 125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music 150012-leksikon:100019,,"Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...",,"[""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]",it - elementer i databehandling,article > fread("sample2.txt", sep=",") Empty data.table (0 rows) of 11 cols: 150012-leksikon:100019,V2,Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...,V4,[""Informatik","it"...
EDIT 2:
Update to version 3.2.3 and data. Table 1.9.6. helps with this, but creates problems with other entries:
sample3.txt, a good and a bad record: materiale_id,dk5,description,creator,subject-phrase,title,type 125030-katalog:000236595,,,Red Tampa Solist prf,"[""Tom"",""Georgia"",""1929-1930""]","Georgia Tom, 1929-1930",music 125030-katalog:000236596,,,Jane Lucas (Solist),"[""1928-1931"",""Tom,\""The"",""Georgia"",""Accompanist""]","Georgia Tom,""The Accompanist"" (1928-1931)",music > s3 <- fread("sample3.txt", sep=",") Error in fread("sample3.txt", sep = ",") : Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
EDIT 3:
Updating version 1.9.7 of the data tables breaks fread()
as a whole:
> s3 <- fread("sample3.txt", sep=",") Error in fread("sample3.txt", sep = ",") : showProgress is not type integer but type 'logical'. Please report.
EDIT 4:
It seems like the problem in my file is when the entries contain the string \\"
(litteraly, not regexp). Apparently, one backslash is too much, which leads to the fact that fread()
does not correctly interpret the double quote as end of line where it should be taken litteraly.
My best solutions are currently designed for this:
m1 <- readLines("data.csv", encoding="UTF-8") m2 <- gsub("\\\\\"", "\\\"", m1) writeLines(m2, "data_new.csv", useBytes = TRUE) m3 <- fread("data_new.csv", encoding="UTF-8", sep=",")
It seems to work.
I do not understand this 100%, so any clarification is more than welcome.