How to automate the creation of StructType to pass RDD to a DataFrame

Question

How to automate the creation of StructType to pass RDD to a DataFrame

I want to save the RDD as a parquet file. To do this, I pass the RDD to the DataFrame , and then I use the structure to save the DataFrame as a parquet file:

  val aStruct = new StructType(Array(StructField("id",StringType,nullable = true), StructField("role",StringType,nullable = true))) val newDF = sqlContext.createDataFrame(filtered, aStruct)

The question is how to create aStruct automatically for all columns, assuming all of them are StringType ? Also, what is the meaning of nullable = true ? Does this mean that all empty values will be replaced with Null ?

+6

scala apache-spark rdd spark-dataframe

duckertito Nov 15 '16 at 15:06

source share

1 answer

evan.oman · Accepted Answer · 2016-11-15T15:21:03+0000

Why not use the built-in toDF ?

 scala> val myRDD = sc.parallelize(Seq(("1", "roleA"), ("2", "roleB"), ("3", "roleC"))) myRDD: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[60] at parallelize at <console>:27 scala> val colNames = List("id", "role") colNames: List[String] = List(id, role) scala> val myDF = myRDD.toDF(colNames: _*) myDF: org.apache.spark.sql.DataFrame = [id: string, role: string] scala> myDF.show +---+-----+ | id| role| +---+-----+ | 1|roleA| | 2|roleB| | 3|roleC| +---+-----+ scala> myDF.printSchema root |-- id: string (nullable = true) |-- role: string (nullable = true) scala> myDF.write.save("myDF.parquet")

nullable=true simply means that the specified column can contain null values (this is useful for int columns that usually don't have null values - int don't have NA or null ).

How to automate the creation of StructType to pass RDD to a DataFrame

More articles: