Reading CSV with line breaks in pyspark

Question

Reading CSV with line breaks in pyspark

Read CSV with line breaks in pyspark I want to read with pyspark the “legal” (this follows RFC4180 ) CSV, which has interrupt lines (CRLFs) in some lines. The following code example shows what it looks like when you open it using Notepad ++:

I am trying to read it using sqlCtx.read.load using format = 'com.databricks.spark.csv. , and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark version 2.1.0.2.

Is there any command or alternative way to read csv that allows me to read these two lines only as one?

+4

python-3.x csv apache-spark pyspark

mjimcua Sep 14 '17 at 12:45

source share

1 answer

T. gawęda · Answer 1 · 2017-09-14T13:00:39+0000

You can use "csv" instead of the Databricks CSV - the latter now redirects to the default Spark reader. But, this is just a hint :)

In Spark 2.2 a new option has been added - wholeFile. If you write this:

spark.read.option("wholeFile", "true").csv("file.csv")

it will read all files and process multi-line CSV.

In Spark 2.1 there is no such option. You can read the file using sparkContext.wholeTextFileor just use the new version

Reading CSV with line breaks in pyspark

More articles: