I need to convert my dataframe to a data set, and I used the following code:
val final_df = Dataframe.withColumn( "features", toVec4( // casting into Timestamp to parse the string, and then into Int $"time_stamp_0".cast(TimestampType).cast(IntegerType), $"count", $"sender_ip_1", $"receiver_ip_2" ) ).withColumn("label", (Dataframe("count"))).select("features", "label") final_df.show() val trainingTest = final_df.randomSplit(Array(0.3, 0.7)) val TrainingDF = trainingTest(0) val TestingDF=trainingTest(1) TrainingDF.show() TestingDF.show() ///lets create our liner regression val lir= new LinearRegression() .setRegParam(0.3) .setElasticNetParam(0.8) .setMaxIter(100) .setTol(1E-6) case class df_ds(features:Vector, label:Integer) org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this) val Training_ds = TrainingDF.as[df_ds]
My problem is that I got the following error:
Error:(96, 36) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. val Training_ds = TrainingDF.as[df_ds]
It seems that the number of values ββin the dataframe is different from the number of values ββin my class. However, I use case class df_ds(features:Vector, label:Integer) in my DataFrame TrainingDF, since it has a function vector and an integer label. Here is the TrainingDF frame:
+--------------------+-----+ | features|label| +--------------------+-----+ |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,19...| 19| |[1.497325796E9,10...| 10| +--------------------+-----+
Also presented here is my original final_df dataframe:
+------------+-----------+-------------+-----+ |time_stamp_0|sender_ip_1|receiver_ip_2|count| +------------+-----------+-------------+-----+ | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.2| 10.0.0.3| 19| | 05:49:56| 10.0.0.3| 10.0.0.2| 10| +------------+-----------+-------------+-----+
However, I received the indicated error! Can anybody help me? Thank you in advance.