I have an RDD (combinerRDD) on which I applied the below transforms
JavaPairRDD<String, Integer> counts = combinerRDD.mapToPair( new PairFunction<Tuple2<LongWritable, Text>, String, Integer>() { String filename; Integer count; Message message; @Override public Tuple2<String, Integer> call(Tuple2<LongWritable, Text> tuple) throws Exception { xlhrCount = 0; filename = ""; filename = "New_File"; for (JobStep js : message.getJobStep()) { if (js.getStepName().equals(StepName.NEW_STEP)) { count += 1; } } return new Tuple2<String, Integer>(filename, xlhrCount); } }).reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer count1, Integer count2) throws Exception { return (count1 + count2); } } );
My question is: when combinerRDD has some data inside, I get the correct result. But when combinerRDD empty, the result recorded in HDFS is only an empty _SUCCESS file. I expected 2 files in case of conversion to empty RDD, i.e. _SUCCESS and empty part-00000 file. I'm right? How many output files should I get.
For some reason, I ask about this because I got a different result in 2 clusters, the code running on cluster 1 led to the fact that the _SUCCESS file and cluster 2 led to _SUCCESS and the empty part-00000. I'm embarrassed right now. Is the result dependent on any cluster configuration?
Note: I am making a left join on newRDD.leftOuterJoin(combinerRDD) , which does not give me a result (when combinerRDD has only _SUCCESS), and newRDD contains a value.
source share