Semantics are determined based on metadata. We will need import:
import org.apache.parquet.hadoop.ParquetFileReader import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.conf.Configuration
data examples:
val path = "/tmp/ts" Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts") .withColumn("ts", $"ts".cast("timestamp")) .write.mode("overwrite").parquet(path)
and Hadoop configuration:
val conf = spark.sparkContext.hadoopConfiguration val fs = FileSystem.get(conf)
Now we can access the Spark metadata:
ParquetFileReader .readAllFootersInParallel(conf, fs.getFileStatus(new Path(path))) .get(0) .getParquetMetadata .getFileMetaData .getKeyValueMetaData .get("org.apache.spark.sql.parquet.row.metadata")
and the result:
String = {"type":"struct","fields: [ {"name":"id","type":"integer","nullable":false,"metadata":{}}, {"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
Equivalent information can also be stored in the metastore.
According to official documentation, this is used to ensure compatibility with Hive and Impala:
Some parquet systems, in particular Impala and Hive, save the timestamp in INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to ensure compatibility with these systems.
and can be controlled using the spark.sql.parquet.int96AsTimestamp property.