How to load a specific Hive partition in a DataFrame Spark 1.6?

Question

How to load a specific Hive partition in a DataFrame Spark 1.6?

Spark 1.6 further according to the official doc we cannot add specific hive sections to the DataFrame
Till Spark 1.5 uses the following to work, and the data block will have an object column and data, as shown below -

DataFrame df = hiveContext.read().format("orc").load("path/to/table/entity=xyz")

However, this will not work in Spark 1.6.

If I give the base path as follows, it does not contain the column of the object I want in the DataFrame, as shown below -

 DataFrame df = hiveContext.read().format("orc").load("path/to/table/")

How can I load a specific hive section in a dataframe? What was the driver behind removing this feature?

I believe it was effective. Is there an alternative to achieving this in Spark 1.6?

According to my understanding, Spark 1.6 loads all partitions, and if I filter for certain partitions, it is inefficient, it gets into memory and gives GC (Garbage Collection) errors due to the fact that thousands of partitions are loaded into memory, and not into specific section

Please guide. Thanks in advance.

+5

hive apache-spark apache-spark-sql spark-dataframe

u449355 Jan 7 '16 at 15:38

source share

1 answer

u449355 · Accepted Answer · 2016-01-07T20:12:09+0000

To add a specific section to a DataFrame using Spark 1.6, we need to run the next first set of basePath , and then specify the path to the section that needs to be loaded

 DataFrame df = hiveContext.read().format("orc"). option("basePath", "path/to/table/"). load("path/to/table/entity=xyz")

Thus, the above code will only load a specific section in a DataFrame.

How to load a specific Hive partition in a DataFrame Spark 1.6?

More articles: