Reading csv files with quoted fields containing embedded commas

I read the csv file in Pyspark as follows:

df_raw=spark.read.option("header","true").csv(csv_path)

However, the data file contains fields with embedded commas, which should not be treated as commas. How can I handle this at Pyspark? I know pandas can handle this, but can Spark? The version I'm using is Spark 2.0.0.

Here is an example that works in pandas but does not work with Spark:

In [1]: import pandas as pd

In [2]: pdf = pd.read_csv('malformed_data.csv')

In [3]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True)

In [4]: pdf[['col12','col13','col14']]
Out[4]:
                    col12                                             col13  \
0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE
1                     NaN                     OUTKAST#THROOTS~WUTANG#RUNDMC

   col14
0   23.0
1    0.0

In [5]: sdf.select("col12","col13",'col14').show()
+------------------+--------------------+--------------------+
|             col12|               col13|               col14|
+------------------+--------------------+--------------------+
|"32 XIY ""W""   JK|              RE LK"|SOMETHINGLIKEAPHE...|
|              null|OUTKAST#THROOTS~W...|                 0.0|
+------------------+--------------------+--------------------+

File contents:

    col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED
61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE
+16
source share
3 answers

I noticed that your problematic line has escaping that uses double quotes:

"32 XIY" "W" "JK, RE LK"

,

32 XIY "W" JK, RE LK

RFC-4180, . 2 -

  1. , , , ,

, Excel, , .

Spark ( Spark 2.1) RFC, (\). , Spark escape-:

.option('quote', '"')
.option('escape', '"')

, , .

Spark csv Apache Spark, , :

https://github.com/databricks/spark-csv

2018: Spark 3.0 , RFC-. . SPARK-22236 .

+25

, Scala: (!); , , param:

.option("quote", "\"")
.option("escape", "\"")

Spark 2.3, , Tagar - .

+17

(comma), quotes . Spark SQL CSV- Spark 2.0.

df = session.read
  .option("header", "true")
  .csv("csv/file/path")

more about the CSV reader here .

+2
source

Source: https://habr.com/ru/post/1659775/


All Articles