How to convert a DataFrame to a dataset in Apache Spark in Scala?

Question

How to convert a DataFrame to a dataset in Apache Spark in Scala?

I need to convert my dataframe to a data set, and I used the following code:

val final_df = Dataframe.withColumn( "features", toVec4( // casting into Timestamp to parse the string, and then into Int $"time_stamp_0".cast(TimestampType).cast(IntegerType), $"count", $"sender_ip_1", $"receiver_ip_2" ) ).withColumn("label", (Dataframe("count"))).select("features", "label") final_df.show() val trainingTest = final_df.randomSplit(Array(0.3, 0.7)) val TrainingDF = trainingTest(0) val TestingDF=trainingTest(1) TrainingDF.show() TestingDF.show() ///lets create our liner regression val lir= new LinearRegression() .setRegParam(0.3) .setElasticNetParam(0.8) .setMaxIter(100) .setTol(1E-6) case class df_ds(features:Vector, label:Integer) org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this) val Training_ds = TrainingDF.as[df_ds]

My problem is that I got the following error:

 Error:(96, 36) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. val Training_ds = TrainingDF.as[df_ds]

It seems that the number of values in the dataframe is different from the number of values in my class. However, I use case class df_ds(features:Vector, label:Integer) in my DataFrame TrainingDF, since it has a function vector and an integer label. Here is the TrainingDF frame:

 +--------------------+-----+ | features|label| +--------------------+-----+ |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,10...| 10| +--------------------+-----+

Also presented here is my original final_df dataframe:

 +------------+-----------+-------------+-----+ |time_stamp_0|sender_ip_1|receiver_ip_2|count| +------------+-----------+-------------+-----+ | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.3| 10.0.0.2| 10| +------------+-----------+-------------+-----+

However, I received the indicated error! Can anybody help me? Thank you in advance.

+5

scala apache-spark apache-spark-sql

user8131063 Jun 13 '17 at 8:51

source share

1 answer

stefanobaghino · Answer 1 · 2017-06-13T08:58:08+0000

The error message you are reading is a pretty good pointer.

When converting a DataFrame to a Dataset , you must have the correct Encoder for what is stored in the DataFrame strings.

Encoders for primitive types ( Int s, String s, etc.) and case classes provided by simply importing implications for your SparkSession as follows:

 case class MyData(intField: Int, boolField: Boolean) // eg val spark: SparkSession = ??? val df: DataFrame = ??? import spark.implicits._ val ds: Dataset[MyData] = df.as[MyData]

If this does not work, it is because the type you are trying to use a DataFrame is not supported. In this case, you will have to write your own Encoder : you can find more information about it here and see an example ( Encoder for java.time.LocalDateTime ) here .

How to convert a DataFrame to a dataset in Apache Spark in Scala?

More articles: