Result of conversion to an empty RDD

I have an RDD (combinerRDD) on which I applied the below transforms

JavaPairRDD<String, Integer> counts = combinerRDD.mapToPair( new PairFunction<Tuple2<LongWritable, Text>, String, Integer>() { String filename; Integer count; Message message; @Override public Tuple2<String, Integer> call(Tuple2<LongWritable, Text> tuple) throws Exception { xlhrCount = 0; filename = ""; filename = "New_File"; for (JobStep js : message.getJobStep()) { if (js.getStepName().equals(StepName.NEW_STEP)) { count += 1; } } return new Tuple2<String, Integer>(filename, xlhrCount); } }).reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer count1, Integer count2) throws Exception { return (count1 + count2); } } ); 

My question is: when combinerRDD has some data inside, I get the correct result. But when combinerRDD empty, the result recorded in HDFS is only an empty _SUCCESS file. I expected 2 files in case of conversion to empty RDD, i.e. _SUCCESS and empty part-00000 file. I'm right? How many output files should I get.

For some reason, I ask about this because I got a different result in 2 clusters, the code running on cluster 1 led to the fact that the _SUCCESS file and cluster 2 led to _SUCCESS and the empty part-00000. I'm embarrassed right now. Is the result dependent on any cluster configuration?

Note: I am making a left join on newRDD.leftOuterJoin(combinerRDD) , which does not give me a result (when combinerRDD has only _SUCCESS), and newRDD contains a value.

0
source share
1 answer

Ok, so I found a solution. I am using spark-1.3.0, which is below the problem: i.e. a left outer join with an empty RDD gives an empty result.

https://issues.apache.org/jira/browse/SPARK-9236

I created an empty RDD pair as shown below:

 JavaRDD<Tuple2<LongWritable, Text>> emptyRDD = context.emptyRDD(); myRDD = JavaPairRDD.fromJavaRDD(emptyRDD); 

Now using:

 List<Tuple2<LongWritable, Text>> data = Arrays.asList(); JavaRDD<Tuple2<LongWritable, Text>> emptyRDD = context.parallelize(data); myRDD = JavaPairRDD.fromJavaRDD(emptyRDD); 

Now it works, i.e. my RDD is no longer empty. The fix is โ€‹โ€‹available in versions: 1.3.2, 1.4.2, 1.5.0 (link above).

0
source

Source: https://habr.com/ru/post/1268828/


All Articles