Using fread to read data with double quotes and invalid escape characters

I am trying to load a large data file (about 20 million lines) using fread() from the data.table package. However, some lines cause big problems.

Minimal example:

 text.csv contains: id, text 1,"""Oops"",\""The"",""Georgia""" fread("text.csv", sep=",") Error in fread("text.csv", sep = ",") : Not positioned correctly after testing format of header row. ch=',' In addition: Warning message: In fread("text.csv", sep = ",") : Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: id, text 

read.table() works slightly better, but is too slow and too inefficient.

 > read.table("text.csv", header = TRUE, sep=",") id text 1 1 "Oops",\\"The","Georgia" 

I understand that my text file is not formatted correctly, but it is too large for editing in practice.

Any help is greatly appreciated.

EDIT:

A small selection of actual data records:

 sample1.txt, a good record: materiale_id,dk5,description,creator,subject-phrase,title,type 125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music > fread("sample1.txt", sep=",") materiale_id dk5 description creator subject-phrase 1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin NA title type 1: Koncert i Copenhagen Jazz House den 26.1.1995 music sample2.txt, a good and a bad record: materiale_id,dk5,description,creator,subject-phrase,title,type 125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music 150012-leksikon:100019,,"Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...",,"[""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]",it - elementer i databehandling,article > fread("sample2.txt", sep=",") Empty data.table (0 rows) of 11 cols: 150012-leksikon:100019,V2,Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...,V4,[""Informatik","it"... 

EDIT 2:

Update to version 3.2.3 and data. Table 1.9.6. helps with this, but creates problems with other entries:

 sample3.txt, a good and a bad record: materiale_id,dk5,description,creator,subject-phrase,title,type 125030-katalog:000236595,,,Red Tampa Solist prf,"[""Tom"",""Georgia"",""1929-1930""]","Georgia Tom, 1929-1930",music 125030-katalog:000236596,,,Jane Lucas (Solist),"[""1928-1931"",""Tom,\""The"",""Georgia"",""Accompanist""]","Georgia Tom,""The Accompanist"" (1928-1931)",music > s3 <- fread("sample3.txt", sep=",") Error in fread("sample3.txt", sep = ",") : Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread. 

EDIT 3:

Updating version 1.9.7 of the data tables breaks fread() as a whole:

 > s3 <- fread("sample3.txt", sep=",") Error in fread("sample3.txt", sep = ",") : showProgress is not type integer but type 'logical'. Please report. 

EDIT 4:

It seems like the problem in my file is when the entries contain the string \\" (litteraly, not regexp). Apparently, one backslash is too much, which leads to the fact that fread() does not correctly interpret the double quote as end of line where it should be taken litteraly.

My best solutions are currently designed for this:

 m1 <- readLines("data.csv", encoding="UTF-8") m2 <- gsub("\\\\\"", "\\\"", m1) writeLines(m2, "data_new.csv", useBytes = TRUE) m3 <- fread("data_new.csv", encoding="UTF-8", sep=",") 

It seems to work.

I do not understand this 100%, so any clarification is more than welcome.

+5
source share
1 answer

Not a data.table solution, but you can try:

 # read the file with 'readLines' tmp <- readLines("trl.txt") # create a column name vector of the first line nms <- trimws(strsplit(tmp[1],',')[[1]]) # convert 'tmp' to a dataframe except the first line tmp <- as.data.frame(tmp[-1]) # use 'separate' from 'tidyr' to split into two columns library(tidyr) df1 <- separate(tmp, "tmp[-1]", nms, sep=",", extra = "merge") 

which gives:

 > df1 id text 1 1 """Oops"",\\""The"",""Georgia""" 

Update for editing 1 . With the new data example, fread seems to usually read the data:

 > s1 <- fread("sample1.txt", sep=",") > s1 materiale_id dk5 description creator subject-phrase title type 1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin NA Koncert i Copenhagen Jazz House den 26.1.1995 music > s2 <- fread("sample2.txt", sep=",") > s2 materiale_id dk5 1: 125030-katalog:000000003 [78.793] 2: 150012-leksikon:100019 description 1: Privatoptagelse. - Liveoptagelse 2: Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med... creator subject-phrase 1: Frederik Lundin 2: [""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""] title type 1: Koncert i Copenhagen Jazz House den 26.1.1995 music 2: it - elementer i databehandling article 

Update for editing 2 and 3:

When viewing an error message:

Error in fread("sample3.txt", sep = ",") : Waiting for 7 columns, but line 3 contains text after processing all the columns. It is very likely that this is due to one or more fields having embedded sep=',' and / or (unescaped) '\ n' in unbalanced quotes. fread cannot handle such ambiguous cases, and these lines may not have been read as expected. Please read the section on quotes in ?fread .

and then when you look at the second row of sample3.txt , you will see that the fourth column also contains a comma. You can solve this in three steps:

1: Read the file with readLines and replace the open and close character of the fourth column with another quotation mark:

 r3 <- readLines("sample3.txt") r3 <- gsub('\"[',"'",r3,fixed=TRUE) r3 <- gsub(']\"',"'",r3,fixed=TRUE) 

2: Write it to a text file:

  writeLines(r3, "sample3-1.txt") 

3: Now you can read it with fread (or read.table / read.csv ). Since the number of column headers does not match the number of columns, you will need to use header = FALSE . Also explicitly set the quotation mark to the new quotation mark as it was inserted in step 2:

 s3 <- fread("sample3-1.txt", quote = "\'", header = FALSE, skip = 1) 

which gives:

 > s3 V1 V2 V3 V4 V5 V6 V7 V8 1: 125030-katalog:000236595 NA NA Red Tampa Solist prf ""Tom"",""Georgia"",""1929-1930"" "Georgia Tom 1929-1930" music 2: 125030-katalog:000236596 NA NA Jane Lucas (Solist) ""1928-1931"",""Tom,\\""The"",""Georgia"",""Accompanist"" "Georgia Tom ""The Accompanist"" (1928-1931)" music 

After that, you can assign the column names as follows:

 names(s3) <- c("character","vector","with","eight","column","names") 

NOTE. I used the fairly recent version (two weeks) of version v.1.9.7 for this

+2
source

Source: https://habr.com/ru/post/1243845/


All Articles