Reading CSV with line breaks in pyspark

Read CSV with line breaks in pyspark I want to read with pyspark the “legal” (this follows RFC4180 ) CSV, which has interrupt lines (CRLFs) in some lines. The following code example shows what it looks like when you open it using Notepad ++:

enter image description here

I am trying to read it using sqlCtx.read.load using format = 'com.databricks.spark.csv. , and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark version 2.1.0.2.

Is there any command or alternative way to read csv that allows me to read these two lines only as one?

+4
source share
1 answer

You can use "csv" instead of the Databricks CSV - the latter now redirects to the default Spark reader. But, this is just a hint :)

In Spark 2.2 a new option has been added - wholeFile. If you write this:

spark.read.option("wholeFile", "true").csv("file.csv")

it will read all files and process multi-line CSV.

In Spark 2.1 there is no such option. You can read the file using sparkContext.wholeTextFileor just use the new version

+1
source

Source: https://habr.com/ru/post/1685674/


All Articles