How to make non-random splitting of Dataset on Apache Spark?

Question

How to make non-random splitting of Dataset on Apache Spark?

I know that I can do random splitting with the randomSplit method:

val splittedData: Array[Dataset[Row]] = preparedData.randomSplit(Array(0.5, 0.3, 0.2))

Is it possible to split the data into serial parts using some nonRandomSplit method?

Apache Spark 2.0.1. Thanks in advance.

UPD: the data order is important, I'm going to prepare my model for data with "smaller identifiers" and test it on data with "large identifiers". So I want to split the data into sequential parts without shuffling.

eg.

 my dataset = (0,1,2,3,4,5,6,7,8,9) desired splitting = (0.8, 0.2) splitting = (0,1,2,3,4,5,6,7), (8,9)

The only solution I can think of is to use count and limit , but probably better.

+5

apache-spark apache-spark-sql apache-spark-dataset apache-spark-2.0

Anton Dec 02 '16 at 14:50

source share

1 answer

Anton · Accepted Answer · 2016-12-02T17:25:13+0000

This is the solution I implemented: Dataset -> Rdd -> Dataset.

I'm not sure if this is the most efficient way to do this, so I will be happy to make the best decision.

 val count = allData.count() val trainRatio = 0.6 val trainSize = math.round(count * trainRatio).toInt val dataSchema = allData.schema // Zipping with indices and skipping rows with indices > trainSize. // Could have possibly used .limit(n) here val trainingRdd = allData .rdd .zipWithIndex() .filter { case (_, index) => index < trainSize } .map { case (row, _) => row } // Can't use .limit() :( val testRdd = allData .rdd .zipWithIndex() .filter { case (_, index) => index >= trainSize } .map { case (row, _) => row } val training = MySession.createDataFrame(trainingRdd, dataSchema) val test = MySession.createDataFrame(testRdd, dataSchema)

How to make non-random splitting of Dataset on Apache Spark?

More articles: