Reading csv files with quoted fields containing embedded commas

Question

Reading csv files with quoted fields containing embedded commas

I read the csv file in Pyspark as follows:

df_raw=spark.read.option("header","true").csv(csv_path)

However, the data file contains fields with embedded commas, which should not be treated as commas. How can I handle this at Pyspark? I know pandas can handle this, but can Spark? The version I'm using is Spark 2.0.0.

Here is an example that works in pandas but does not work with Spark:

In [1]: import pandas as pd

In [2]: pdf = pd.read_csv('malformed_data.csv')

In [3]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True)

In [4]: pdf[['col12','col13','col14']]
Out[4]:
                    col12                                             col13  \
0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE
1                     NaN                     OUTKAST#THROOTS~WUTANG#RUNDMC

   col14
0   23.0
1    0.0

In [5]: sdf.select("col12","col13",'col14').show()
+------------------+--------------------+--------------------+
|             col12|               col13|               col14|
+------------------+--------------------+--------------------+
|"32 XIY ""W""   JK|              RE LK"|SOMETHINGLIKEAPHE...|
|              null|OUTKAST#THROOTS~W...|                 0.0|
+------------------+--------------------+--------------------+

File contents:

    col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED
61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE

+16

csv apache-spark pyspark apache-spark-sql apache-spark-2.0

femibyte Nov 04 '16 at 0:34

source share

3 answers

Tagar · Answer 1 · 2017-07-17T07:45:01+0000

I noticed that your problematic line has escaping that uses double quotes:

"32 XIY" "W" "JK, RE LK"

,

32 XIY "W" JK, RE LK

RFC-4180, . 2 -

, , , ,

, Excel, , .

Spark ( Spark 2.1) RFC, (\). , Spark escape-:

.option('quote', '"')
.option('escape', '"')

, , .

Spark csv Apache Spark, , :

https://github.com/databricks/spark-csv

2018: Spark 3.0 , RFC-. . SPARK-22236 .

Allie Rogers · Answer 2 · 2018-03-19T02:07:05+0000

, Scala: (!); , , param:

.option("quote", "\"")
.option("escape", "\"")

Spark 2.3, , Tagar - .

mrsrinivas · Answer 3 · 2016-11-04T03:03:27+0000

(comma), quotes . Spark SQL CSV- Spark 2.0.

df = session.read
  .option("header", "true")
  .csv("csv/file/path")

Reading csv files with quoted fields containing embedded commas

More articles: