Split RDD to test K-fold: pyspark

I have a data set and I want to apply naive stories to it. I will check using the K-fold technique. My data has two classes, and they ordered, that is, if my data set has 100 rows, the first 50 of them belong to one class, and the next 50 belong to the second class. Therefore, I first want to shuffle the data, and then randomly form K-folds. The problem is that when I try randomSplit on an RDD, it creates RDDs of different sizes. My code and sample dataset is as follows:

documentDF = sqlContext.createDataFrame([ (0,"This is a cat".lower().split(" "), ), (0,"This is a dog".lower().split(" "), ), (0,"This is a pig".lower().split(" "), ), (0,"This is a mouse".lower().split(" "), ), (0,"This is a donkey".lower().split(" "), ), (0,"This is a monkey".lower().split(" "), ), (0,"This is a horse".lower().split(" "), ), (0,"This is a goat".lower().split(" "), ), (0,"This is a tiger".lower().split(" "), ), (0,"This is a lion".lower().split(" "), ), (1,"A mouse and a pig are friends".lower().split(" "), ), (1,"A pig and a dog are friends".lower().split(" "), ), (1,"A mouse and a cat are friends".lower().split(" "), ), (1,"A lion and a tiger are friends".lower().split(" "), ), (1,"A lion and a goat are friends".lower().split(" "), ), (1,"A monkey and a goat are friends".lower().split(" "), ), (1,"A monkey and a donkey are friends".lower().split(" "), ), (1,"A horse and a donkey are friends".lower().split(" "), ), (1,"A horse and a tiger are friends".lower().split(" "), ), (1,"A cat and a dog are friends".lower().split(" "), ) ], ["label","text"]) from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel from pyspark.mllib.linalg import Vectors from pyspark.ml.feature import CountVectorizer from pyspark.mllib.regression import LabeledPoint def mapper_vector(x): row = x.text return LabeledPoint(x.label,row) splitSize = [0.2]*5 print("splitSize"+str(splitSize)) print(sum(splitSize)) vect = documentDF.map(lambda x: mapper_vector(x)) splits = vect.randomSplit(splitSize, seed=0) print("***********SPLITS**************") for i in range(len(splits)): print("split"+str(i)+":"+str(len(splits[i].collect()))) 

This code outputs:

 splitSize[0.2, 0.2, 0.2, 0.2, 0.2] 1.0 ***********SPLITS************** split0:1 split1:5 split2:3 split3:5 split4:6 

The DF document had 20 rows, I needed 5 separate exclusive samples from this dataset that are the same size. However, it can be seen that all the splits have different sizes. What am I doing wrong?

Edit: According to zero323, I am not doing anything wrong. Then, if I want to get the final results (as described) without using ML CrossValidator, what do I need to change? Also, why are the numbers different? If each split has the same weight, shouldn't they have the same number of rows? Also, is there another way to randomize data?

+5
source share
1 answer

You are not doing anything wrong. randomSplit simply does not provide reliable guarantees regarding the distribution of data. It uses a BernoulliCellSampler (see How Sparks RDD.randomSplit actually shares RDD ), and the exact fractions may differ from run to run. This is normal behavior and should be perfectly acceptable for any real-size data set, where the differences should be statistically insignificant.

On the outside, Spark ML already provides a CrossValidator that can be used with ML pipelines (see How to cross-check the RandomForest model ? , for example, use).

+3
source

Source: https://habr.com/ru/post/1247395/


All Articles