Sparklyr ignores line separator

Question

Sparklyr ignores line separator

I am trying to read .csv 2GB ~ (5mi lines) in sparklyr with:

bigcsvspark <- spark_read_csv(sc, "bigtxt", "path", delimiter = "!", infer_schema = FALSE, memory = TRUE, overwrite = TRUE, columns = list( SUPRESSED COLUMNS AS = 'character'))

And getting the following error:

 Job aborted due to stage failure: Task 9 in stage 15.0 failed 4 times, most recent failure: Lost task 9.3 in stage 15.0 (TID 3963, 10.1.4.16): com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000). Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content: ---lines of my csv---[\n] ---begin of a splited line --- Parser Configuration: CsvParserSettings: ... default settings ...

and

 CsvFormat: Comment character=\0 Field delimiter=! Line separator (normalized)=\n Line separator sequence=\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=10599, column=6, record=8221, charIndex=4430464, headers=[---SUPRESSED HEADER---], content parsed=---more lines without the delimiter.---

As shown above, at some point, the line separator begins to be ignored. In pure R, you can read without problems, just read.csv passing the path and delimiter.

+5

r csv sparklyr

Jader martins Oct 13 '17 at 19:01

source share

1 answer

edgararuiz · Accepted Answer · 2017-10-20T01:56:04+0000

it looks like the file is not really CSV, I'm wondering if spark_read_text() work better in this situation. You should be able to print all lines in Spark and split the lines into fields in memory. The last part will be the most difficult.

Sparklyr ignores line separator

More articles: