I have data files (json in this example, but could also be avro) written in a directory structure, for example:
dataroot
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Using spark-sql Create a temporary table:
CREATE TEMPORARY TABLE dataTable
USING org.apache.spark.sql.json
OPTIONS (
path "dataroot/*"
)
A table query works well, but I still cannot use directories to trim.
Is there a way to register the directory structure as partitions (without using Hive) to avoid scanning the entire tree on request? Let's say I want to compare the data for the first day of each month and read only the catalogs for these days.
With Apache Drill, I can use directories as predicates during a query using dir0, etc. Is it possible to do something similar with Spark SQL?
source