How to parse csv that uses ^ A (i.e. \ 001) as a delimiter with spark csv?

Question

How to parse csv that uses ^ A (i.e. \ 001) as a delimiter with spark csv?

Awfully new for spark and hive and big data and scala and all. I am trying to write a simple function that takes sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^ A character (e.g. \ 001) as a delimiter, and the dataset is huge, so I can't just do "s / \ 001 /, / g" on it. In addition, the fields may contain commas or other characters that I can use as a separator.

I know that the spark-csv package that I use has a separator parameter, but I don’t know how to set it so that it reads \ 001 as a single character, and not something like escaped 0, 0, and 1. Maybe should i use hiveContext or something else?

+7

scala delimiter hive spark-csv apache-spark

Norsul ronsul Mar 15 '16 at 9:47

source share

2 answers

In Spark 2.x and the CSV API, use the sep parameter:

 val df = spark.read .option("sep", "\u0001") .csv("path_to_csv_files")

0

Mark rajcok May 07 '19 at 16:46

source share

Daniel Zolnai · Accepted Answer · 2016-03-15T09:55:50+0000

If you are checking the GitHub page, there is a delimiter parameter for spark-csv (as you also noted). Use it like this:

 val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .option("delimiter", "\u0001") .load("cars.csv")

How to parse csv that uses ^ A (i.e. \ 001) as a delimiter with spark csv?

More articles: