Skipping fields in a record using a spark auto

Question

Skipping fields in a record using a spark auto

Update: spark-avro has been updated to support this scenario. https://github.com/databricks/spark-avro/releases/tag/v3.1.0

I have an AVRO file created by a third party outside my control that I need to process using a spark. An AVRO schema is a record in which one of the fields is a mixed type of union:

{ "name" : "Properties", "type" : { "type" : "map", "values" : [ "long", "double", "string", "bytes" ] }

This is not supported using spark-avro reader:

In addition to the types listed above, it supports reading three types of union types: union (int, long) union (float, double) union (something, null), where something is one of the supported Avro types listed above, or is one from the supported union types.

Considering the evolution and resolution of the AVRO scheme, I expect that I can read the file, skipping the problem field, indicating another reader circuit that omits this field. According to the AVRO scheme permission documentation , it should work:

if the record record contains a field with a name that is not in the reader record, the record value for this field is ignored.

So, I modified with

  val df = sqlContext.read.option("avroSchema", avroSchema).avro(path)

Where avroSchema is the same scheme used by the author, but without a problem field.

But still, I get the same error regarding mixed union types.

Is this circuit evolution scenario supported with AVRO? with auro sparks? Is there any other way to achieve my goal?

Update: I tested the same script (same file) with Apache Avro 1.8.1, and it works as expected. Then it should be special with spark-avro . any ideas?

+5

apache-spark avro spark-avro

itaysk Nov 03 '16 at 15:29

source share

1 answer

itaysk · Answer 1 · 2016-11-13T12:36:19+0000

Update: spark-avro has been updated to support this scenario. https://github.com/databricks/spark-avro/releases/tag/v3.1.0

Actually this does not answer my question, but to another solution for the same problem.

Since spark-avro does not currently have this feature (see my comment on the question) - I used avro org.apache.avro instead . mapreduce and spark newAPIHadoopFile . Here is a simple example:

 val path = "..." val conf = new SparkConf().setAppName("avro test") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val sc = new SparkContext(conf) val avroRdd = sc.newAPIHadoopFile(path, classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable])

in contrast to spark-avro, official avro libs support mixed connection types and circuit evolution.

Skipping fields in a record using a spark auto

More articles: