How did I read the parquet in PySpark written by Spark?

I use two Jupyter laptops to do different things in analysis. In my Scala laptop, I write some of my cleared data to the parquet:

partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")

Then I move on to my Python notebook for reading in the data:

df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")

and I get the following error:

AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'

I looked through the documentation for spark discharges, and I don’t think I would need to specify a circuit. Has anyone come across something similar? Do I have to do something else when I save / load? Data lands in the object storage.

edit: I sing spark 2.0 in both readings and recordings.

edit2: This was done in the Data Experience project.

+18
2

:

from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
    .master('local') \
    .appName('myAppName') \
    .config('spark.executor.memory', '5gb') \
    .config("spark.cores.max", "6") \
    .getOrCreate()

sc = spark.sparkContext

# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# to read parquet file
df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')
+25

parquet Spark Session . :

df = spark.read.parquet("swift2d://xxxx.keystone/commentClusters.parquet")

parquet load . , load (, , load parquet).

+7

Source: https://habr.com/ru/post/1673080/


All Articles