Result of conversion to an empty RDD

Question

Result of conversion to an empty RDD

I have an RDD (combinerRDD) on which I applied the below transforms

JavaPairRDD<String, Integer> counts = combinerRDD.mapToPair( new PairFunction<Tuple2<LongWritable, Text>, String, Integer>() { String filename; Integer count; Message message; @Override public Tuple2<String, Integer> call(Tuple2<LongWritable, Text> tuple) throws Exception { xlhrCount = 0; filename = ""; filename = "New_File"; for (JobStep js : message.getJobStep()) { if (js.getStepName().equals(StepName.NEW_STEP)) { count += 1; } } return new Tuple2<String, Integer>(filename, xlhrCount); } }).reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer count1, Integer count2) throws Exception { return (count1 + count2); } } );

My question is: when combinerRDD has some data inside, I get the correct result. But when combinerRDD empty, the result recorded in HDFS is only an empty _SUCCESS file. I expected 2 files in case of conversion to empty RDD, i.e. _SUCCESS and empty part-00000 file. I'm right? How many output files should I get.

For some reason, I ask about this because I got a different result in 2 clusters, the code running on cluster 1 led to the fact that the _SUCCESS file and cluster 2 led to _SUCCESS and the empty part-00000. I'm embarrassed right now. Is the result dependent on any cluster configuration?

Note: I am making a left join on newRDD.leftOuterJoin(combinerRDD) , which does not give me a result (when combinerRDD has only _SUCCESS), and newRDD contains a value.

0

hadoop apache-spark

Neethu Oct 18 '17 at 14:44

source share

1 answer

Neethu · Answer 1 · 2017-10-20T03:37:03+0000

Ok, so I found a solution. I am using spark-1.3.0, which is below the problem: i.e. a left outer join with an empty RDD gives an empty result.

https://issues.apache.org/jira/browse/SPARK-9236

I created an empty RDD pair as shown below:

 JavaRDD<Tuple2<LongWritable, Text>> emptyRDD = context.emptyRDD(); myRDD = JavaPairRDD.fromJavaRDD(emptyRDD);

Now using:

 List<Tuple2<LongWritable, Text>> data = Arrays.asList(); JavaRDD<Tuple2<LongWritable, Text>> emptyRDD = context.parallelize(data); myRDD = JavaPairRDD.fromJavaRDD(emptyRDD);

Now it works, i.e. my RDD is no longer empty. The fix is available in versions: 1.3.2, 1.4.2, 1.5.0 (link above).

Result of conversion to an empty RDD

More articles: