I read the csv file in Pyspark as follows:
df_raw=spark.read.option("header","true").csv(csv_path)
However, the data file contains fields with embedded commas, which should not be treated as commas. How can I handle this at Pyspark? I know pandas can handle this, but can Spark? The version I'm using is Spark 2.0.0.
Here is an example that works in pandas but does not work with Spark:
In [1]: import pandas as pd
In [2]: pdf = pd.read_csv('malformed_data.csv')
In [3]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True)
In [4]: pdf[['col12','col13','col14']]
Out[4]:
col12 col13 \
0 32 XIY "W" JK, RE LK SOMETHINGLIKEAPHENOMENON
1 NaN OUTKAST
col14
0 23.0
1 0.0
In [5]: sdf.select("col12","col13",'col14').show()
+------------------+--------------------+--------------------+
| col12| col13| col14|
+------------------+--------------------+--------------------+
|"32 XIY ""W"" JK| RE LK"|SOMETHINGLIKEAPHE...|
| null|OUTKAST
+------------------+--------------------+--------------------+
File contents:
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY ""W"" JK, RE LK",SOMETHINGLIKEAPHENOMENON
61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST