How to parse csv that uses ^ A (i.e. \ 001) as a delimiter with spark csv?

Awfully new for spark and hive and big data and scala and all. I am trying to write a simple function that takes sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^ A character (e.g. \ 001) as a delimiter, and the dataset is huge, so I can't just do "s / \ 001 /, / g" on it. In addition, the fields may contain commas or other characters that I can use as a separator.

I know that the spark-csv package that I use has a separator parameter, but I don’t know how to set it so that it reads \ 001 as a single character, and not something like escaped 0, 0, and 1. Maybe should i use hiveContext or something else?

+7
source share
2 answers

If you are checking the GitHub page, there is a delimiter parameter for spark-csv (as you also noted). Use it like this:

 val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .option("delimiter", "\u0001") .load("cars.csv") 
+19
source

In Spark 2.x and the CSV API, use the sep parameter:

 val df = spark.read .option("sep", "\u0001") .csv("path_to_csv_files") 
0
source

Source: https://habr.com/ru/post/1245073/


All Articles