How to write a parquet file using Spark (pyspark)?

I am new to Spark, and I am trying to convert a Dataframe to a parquet file in Spark, but have not done it yet. The documentation says that I can use the write.parquet function to create the file. However, when I run the script, it shows me: AttributeError: the object "RDD" does not have the attribute "write"

from pyspark import SparkContext sc = SparkContext("local", "Protob Conversion to Parquet ") # spark is an existing SparkSession df = sc.textFile("/temp/proto_temp.csv") # Displays the content of the DataFrame to stdout df.write.parquet("/output/proto.parquet") 

Do you know how to make this work?

The spark version I'm using is Spark 2.0.1 for Hadoop 2.7.3.

+6
source share
1 answer

The error occurred because the textFile method from SparkContext returned an RDD , and I needed a DataFrame .

SparkSession has an SQLContext under the hood. Therefore, I needed to use the DataFrameReader to read the CSV file correctly before converting it to a parquet file.

 spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() # read csv df = spark.read.csv("/temp/proto_temp.csv") # Displays the content of the DataFrame to stdout df.show() df.write.parquet("output/proto.parquet") 
+8
source

Source: https://habr.com/ru/post/1014677/


All Articles