Map of Avro files in Java class with different field names

I have a problem with a simple spark task that reads an Avro file and then saves it as a Hive parquet table.

I have two types of files, in general they are the same, but the key structure is slightly different - field names.

Type 1

root
|-- pk: strucnt (nullable = true)
    |-- term_id: string (nullale = true)

Type 2

root
|-- pk: strucnt (nullable = true)
    |-- id: string (nullale = true)

I read Avro using spark-avro. And then map this DF to bean as follows

Dataset<SomeClass> df = avroDF.as(Encoders.bean(SomeClass.class));

SomeClass is a simple single-field class with a getter and installer.

public class SomeClass{
    private String term_id;
    ...
}

So, if I read Avro type 1, that's fine. But if I read Avro type 2, an error occurs. And vice versa, if I change the field name toprivate String id;

Is there a universal solution to my problem? I found @AvroName, but it does not allow to set multiple names. Thank.

+4
source share
2

StructType avroExtendedSchema = avroDF.schema().add("id",DataTypes.StringType);
avroDF.map(row->RowFactory(row.getStruct(0),row.getStruct(0).getString(0)), 
       RowEncoder.apply(avroExtendedSchema)).toDF();

, DF "id" . "pk" .

avroDF.drop("pk");

PS :

root
|-- pk: strucnt (nullable = true)
    |-- id: int(nullale = true)

, :

DataType keyType = avroDF.select("pk.*").schema().fields[0].dataType();
StructType avroExtendedSchema = avroDF.schema().add("id",keyType);
avroDF.map(row->RowFactory(row.getStruct(0),row.getStruct(0).get(0)), 
       RowEncoder.apply(avroExtendedSchema)).drop("pk").toDF();

\String.

+1

- , . :

val newName = Seq("id", "x1", "x2", "x3")
Dataset<SomeClass> df = avroDF.toDF(newNames: _*).as(Encoders.bean(SomeClass.class));

dataframe BeanClass, .

+1

Source: https://habr.com/ru/post/1692860/


All Articles