SparkSQL - direct reading of a parquet file

Question

SparkSQL - direct reading of a parquet file

I am switching from Impala to SparkSQL using the following code to read a table:

my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')

How do I call SparkSQL above so that it can return something like:

 'select col_A, col_B from my_table'

+9

scala hive hdfs apache-spark apache-spark-sql parquet

Edamame Dec 21 '16 at 2:03

source share

2 answers

Using simple SQL

JSON, ORC, Parquet, and CSV files can be requested without creating a table in the Spark DataFrame .

 //This Spark 2.x code you can do the same on sqlContext as well val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate spark.sql("select col_A, col_B from parquet.'hdfs://my_hdfs_path/my_db.db/my_table'") .show()

+10

mrsrinivas Mar 09 '17 at 6:30

source share

bob · Accepted Answer · 2016-12-21 02:14

After creating the Dataframe from the parquet file, you need to register it as a temporary table to run sql queries on it.

 val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet") df.printSchema // after registering as a table you will be able to run sql queries df.registerTempTable("people") sqlContext.sql("select * from people").collect.foreach(println)

SparkSQL - direct reading of a parquet file

Using simple SQL

More articles: