Spark 2.0, an implicit encoder, processes a missing column when type is the [Seq [String]] option (scala)

Question

Spark 2.0, an implicit encoder, processes a missing column when type is the [Seq [String]] option (scala)

I am having some problems with data encoding when some columns of type Option [Seq [String]] are missing in our data source. Ideally, I would like the missing column data to be populated with None.

Scenario:

We have several parquet files that we read that have column1 but not column2.

We load data from these parquet files into Datasetand pass it as MyType.

case class MyType(column1: Option[String], column2: Option[Seq[String]])

sqlContext.read.parquet("dataSource.parquet").as[MyType]

org.apache.spark.sql.AnalysisException: Cannot resolve ' column2' data input columns: [column1];

Is there a way to create a dataset with column2 data like None?

+4

scala apache-spark apache-spark-dataset

pigate Jan 03 '17 at 23:50

source share

1 answer

user6910411 · Accepted Answer · 2017-01-04T00:51:23+0000

In simple cases, you can provide the original schema, which is a superset of the expected schemas. For example, in your case:

val schema = Seq[MyType]().toDF.schema

Seq("a", "b", "c").map(Option(_))
  .toDF("column1")
  .write.parquet("/tmp/column1only")

val df = spark.read.schema(schema).parquet("/tmp/column1only").as[MyType]
df.show

+-------+-------+
|column1|column2|
+-------+-------+
|      a|   null|
|      b|   null|
|      c|   null|
+-------+-------+

df.first

MyType = MyType(Some(a),None)

This approach can be a little fragile , so in general, you'd better use SQL literals to fill in the blanks:

spark.read.parquet("/tmp/column1only")
  // or ArrayType(StringType)
  .withColumn("column2", lit(null).cast("array<string>"))
  .as[MyType]
  .first

MyType = MyType(Some(a),None)

Spark 2.0, an implicit encoder, processes a missing column when type is the [Seq [String]] option (scala)

More articles: