I have a data set and I want to apply naive stories to it. I will check using the K-fold technique. My data has two classes, and they ordered, that is, if my data set has 100 rows, the first 50 of them belong to one class, and the next 50 belong to the second class. Therefore, I first want to shuffle the data, and then randomly form K-folds. The problem is that when I try randomSplit on an RDD, it creates RDDs of different sizes. My code and sample dataset is as follows:
documentDF = sqlContext.createDataFrame([ (0,"This is a cat".lower().split(" "), ), (0,"This is a dog".lower().split(" "), ), (0,"This is a pig".lower().split(" "), ), (0,"This is a mouse".lower().split(" "), ), (0,"This is a donkey".lower().split(" "), ), (0,"This is a monkey".lower().split(" "), ), (0,"This is a horse".lower().split(" "), ), (0,"This is a goat".lower().split(" "), ), (0,"This is a tiger".lower().split(" "), ), (0,"This is a lion".lower().split(" "), ), (1,"A mouse and a pig are friends".lower().split(" "), ), (1,"A pig and a dog are friends".lower().split(" "), ), (1,"A mouse and a cat are friends".lower().split(" "), ), (1,"A lion and a tiger are friends".lower().split(" "), ), (1,"A lion and a goat are friends".lower().split(" "), ), (1,"A monkey and a goat are friends".lower().split(" "), ), (1,"A monkey and a donkey are friends".lower().split(" "), ), (1,"A horse and a donkey are friends".lower().split(" "), ), (1,"A horse and a tiger are friends".lower().split(" "), ), (1,"A cat and a dog are friends".lower().split(" "), ) ], ["label","text"]) from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel from pyspark.mllib.linalg import Vectors from pyspark.ml.feature import CountVectorizer from pyspark.mllib.regression import LabeledPoint def mapper_vector(x): row = x.text return LabeledPoint(x.label,row) splitSize = [0.2]*5 print("splitSize"+str(splitSize)) print(sum(splitSize)) vect = documentDF.map(lambda x: mapper_vector(x)) splits = vect.randomSplit(splitSize, seed=0) print("***********SPLITS**************") for i in range(len(splits)): print("split"+str(i)+":"+str(len(splits[i].collect())))
This code outputs:
splitSize[0.2, 0.2, 0.2, 0.2, 0.2] 1.0 ***********SPLITS************** split0:1 split1:5 split2:3 split3:5 split4:6
The DF document had 20 rows, I needed 5 separate exclusive samples from this dataset that are the same size. However, it can be seen that all the splits have different sizes. What am I doing wrong?
Edit: According to zero323, I am not doing anything wrong. Then, if I want to get the final results (as described) without using ML CrossValidator, what do I need to change? Also, why are the numbers different? If each split has the same weight, shouldn't they have the same number of rows? Also, is there another way to randomize data?