Spark int96 time type

When you create a timestamp column in a spark and save the parquet, you get a 12-byte integer column type (int96); I understand that the data is divided into 6 bytes for Julian day and 6 bytes for nanoseconds during the day.

This does not match any parquet logical type . Thus, the diagram in the parquet file does not indicate that the column is an integer.

My question is, how does Spark know to load such a column as a timestamp, rather than a large integer?

+5
source share
1 answer

Semantics are determined based on metadata. We will need import:

import org.apache.parquet.hadoop.ParquetFileReader import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.conf.Configuration 

data examples:

 val path = "/tmp/ts" Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts") .withColumn("ts", $"ts".cast("timestamp")) .write.mode("overwrite").parquet(path) 

and Hadoop configuration:

 val conf = spark.sparkContext.hadoopConfiguration val fs = FileSystem.get(conf) 

Now we can access the Spark metadata:

 ParquetFileReader .readAllFootersInParallel(conf, fs.getFileStatus(new Path(path))) .get(0) .getParquetMetadata .getFileMetaData .getKeyValueMetaData .get("org.apache.spark.sql.parquet.row.metadata") 

and the result:

 String = {"type":"struct","fields: [ {"name":"id","type":"integer","nullable":false,"metadata":{}}, {"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]} 

Equivalent information can also be stored in the metastore.

According to official documentation, this is used to ensure compatibility with Hive and Impala:

Some parquet systems, in particular Impala and Hive, save the timestamp in INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to ensure compatibility with these systems.

and can be controlled using the spark.sql.parquet.int96AsTimestamp property.

+3
source

Source: https://habr.com/ru/post/1265093/


All Articles