I have a bunch of Hadoop SequenceFiles that were written with some Writable subclass that I wrote. Let me call him FishWritable.
This Writable worked well for a while, until I decided that for clarity, renaming the package was required. So now the full name is FishWritable com.vertebrates.fishes.FishWritable instead of com.mammals.fishes.FishWritable . This was a reasonable change, given how the volume of the package in question has changed.
Then, I find that none of my MapReduce jobs will start because they crash when trying to initialize the SequenceFileRecordReader:
java.lang.RuntimeException: java.io.IOException: WritableName can't load class: com.mammals.fishes.FishWritable at org.apache.hadoop.io.SequenceFile$Reader.getKeyClass(SequenceFile.java:1949) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1899) ...
Several solutions to this problem are immediately apparent. I can simply repeat all my previous tasks in order to restore the result using the last name of the key class that sequentially performs any dependent tasks. This, obviously, can be quite laborious, and sometimes even impossible.
Another possibility would be to write a simple task that reads the SequenceFile as text and replaces any instances of the class name with a new one. This is basically method # 1 with customization, which makes it less complicated. If I have many large files, this is still impractical.
Is there a better way to handle refactoring the fully qualified class names used in SequenceFiles? Ideally, Iām looking for a way to specify a new name for the backup class, if the specified one is not found, to allow the launch of both obsolete and updated types of this SequenceFile.
source share