SparkSession initialization error - Cannot use spark.read

I tried to create a separate PySpark program that reads CSV and saves it in the bush table. I have problems setting up Spark session objects, conference objects, and contexts. Here is my code:

from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext, SparkSession from pyspark.sql.types import * conf = SparkConf().setAppName("test_import") sc = SparkContext(conf=conf) sqlContext = SQLContext(sc) spark = SparkSession.builder.config(conf=conf) dfRaw = spark.read.csv("hdfs:/user/..../test.csv",header=False) dfRaw.createOrReplaceTempView('tempTable') sqlContext.sql("create table customer.temp as select * from tempTable") 

And I get the error:

dfRaw = spark.read.csv ("hdfs: /user/../test.csv", header = False) AttributeError: the object 'Builder' does not have the attribute 'read'

How to properly configure a spark session object to use the read.csv command? Also, can someone explain the difference between the Session, Context, and Conference objects?

+9
source share
1 answer

There is no need to use both SparkContext and SparkSession to initialize Spark. SparkSession is a new, recommended use.

To initialize your environment, simply do:

 spark = SparkSession\ .builder\ .appName("test_import")\ .getOrCreate() 

You can run SQL commands by doing the following:

 spark.sql(...) 

Prior to Spark 2.0.0, three separate objects were used: SparkContext , SQLContext and HiveContext . They were used separately depending on what you wanted to do and the data types used.

With the introduction of the Dataset / DataFrame abstractions, the SparkSession object SparkSession become the main entry point into the Spark environment. You can still access other objects by first initializing SparkSession (say, in a variable called spark ), and then make spark.sparkContext / spark.sqlContext .

+9
source

Source: https://habr.com/ru/post/1272843/


All Articles