How to use a dual channel as a separator in CSV?

Question

How to use a dual channel as a separator in CSV?

Spark 1.5 and Scala 2.10.6

I have a data file that uses "||" as a delimiter. It’s hard for me to figure this out to create a data frame. Can I use multiple delimiters to create a data frame? The code works with one broken pipe, but not with multiple delimiters.

My code is:

val customSchema_1 = StructType(Array( StructField("ID", StringType, true), StructField("FILLER", StringType, true), StructField("CODE", StringType, true))); val df_1 = sqlContext.read .format("com.databricks.spark.csv") .schema(customSchema_1) .option("delimiter", "¦¦") .load("example.txt")

Example file:

 12345¦¦ ¦¦10

+6

scala apache-spark

Sfatima Dec 21 '16 at 17:05

source share

1 answer

evan.oman · Accepted Answer · 2016-12-21T21:50:29+0000

So the actual error that is emitted here is:

 java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ¦¦

The docs confirm this limitation, and I checked the Spark 2.0 csv reader, and it has the same requirement.

Given all this, if your data is simple enough, if you don’t have entries containing ¦¦ , I would upload your data like this:

 scala> :pa // Entering paste mode (ctrl-D to finish) val customSchema_1 = StructType(Array( StructField("ID", StringType, true), StructField("FILLER", StringType, true), StructField("CODE", StringType, true))); // Exiting paste mode, now interpreting. customSchema_1: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,true), StructField(FILLER,StringType,true), StructField(CODE,StringType,true)) scala> val rawData = sc.textFile("example.txt") rawData: org.apache.spark.rdd.RDD[String] = example.txt MapPartitionsRDD[1] at textFile at <console>:31 scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> val rowRDD = rawData.map(line => Row.fromSeq(line.split("¦¦"))) rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:34 scala> val df = sqlContext.createDataFrame(rowRDD, customSchema_1) df: org.apache.spark.sql.DataFrame = [ID: string, FILLER: string, CODE: string] scala> df.show +-----+------+----+ | ID|FILLER|CODE| +-----+------+----+ |12345| | 10| +-----+------+----+

How to use a dual channel as a separator in CSV?

More articles: