How did I read the parquet in PySpark written by Spark?

Question

How did I read the parquet in PySpark written by Spark?

I use two Jupyter laptops to do different things in analysis. In my Scala laptop, I write some of my cleared data to the parquet:

partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")

Then I move on to my Python notebook for reading in the data:

df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")

and I get the following error:

AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'

I looked through the documentation for spark discharges, and I don’t think I would need to specify a circuit. Has anyone come across something similar? Do I have to do something else when I save / load? Data lands in the object storage.

edit: I sing spark 2.0 in both readings and recordings.

edit2: This was done in the Data Experience project.

+18

python scala apache-spark pyspark data-science-experience

Ross lewis Mar 24 '17 at 4:02

2

Jeril · Answer 1 · 2017-11-09T17:10:59+0000

:

from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
    .master('local') \
    .appName('myAppName') \
    .config('spark.executor.memory', '5gb') \
    .config("spark.cores.max", "6") \
    .getOrCreate()

sc = spark.sparkContext

# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# to read parquet file
df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')

himanshuIIITian · Answer 2 · 2017-03-24T04:43:46+0000

parquet Spark Session . :

df = spark.read.parquet("swift2d://xxxx.keystone/commentClusters.parquet")

parquet load . , load (, , load parquet).

How did I read the parquet in PySpark written by Spark?

More articles: