Use directories to trim partitions in Spark SQL

Question

Use directories to trim partitions in Spark SQL

I have data files (json in this example, but could also be avro) written in a directory structure, for example:

dataroot
+-- year=2015
    +-- month=06
        +-- day=01
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=02
            +-- data1.json
            +-- data2.json
            +-- data3.json
    +-- month=07
        +-- day=20
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=21
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=22
            +-- data1.json
            +-- data2.json

Using spark-sql Create a temporary table:

CREATE TEMPORARY TABLE dataTable
USING org.apache.spark.sql.json
OPTIONS (
  path "dataroot/*"
)

A table query works well, but I still cannot use directories to trim.

Is there a way to register the directory structure as partitions (without using Hive) to avoid scanning the entire tree on request? Let's say I want to compare the data for the first day of each month and read only the catalogs for these days.

With Apache Drill, I can use directories as predicates during a query using dir0, etc. Is it possible to do something similar with Spark SQL?

+4

apache-spark apache-spark-sql apache-drill

Lundahl Jul 24 '15 at 14:24

source

2

EXPLAIN, , .

, , Spark .

, Spark 1.6 , spark.sql.hive.convertMetastoreParquet false, , true ( ), , Spark ( ).

-1

Thomas Decaux 20 . '17 17:32

Arnon Rotem-Gal-Oz · Accepted Answer · 2015-07-25T13:16:51+0000

, SparkSQL. . http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

Use directories to trim partitions in Spark SQL

More articles: