Handling Writables Fully Qualified Name Changes in Hadoop SequenceFile

I have a bunch of Hadoop SequenceFiles that were written with some Writable subclass that I wrote. Let me call him FishWritable.

This Writable worked well for a while, until I decided that for clarity, renaming the package was required. So now the full name is FishWritable com.vertebrates.fishes.FishWritable instead of com.mammals.fishes.FishWritable . This was a reasonable change, given how the volume of the package in question has changed.

Then, I find that none of my MapReduce jobs will start because they crash when trying to initialize the SequenceFileRecordReader:

 java.lang.RuntimeException: java.io.IOException: WritableName can't load class: com.mammals.fishes.FishWritable at org.apache.hadoop.io.SequenceFile$Reader.getKeyClass(SequenceFile.java:1949) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1899) ... 

Several solutions to this problem are immediately apparent. I can simply repeat all my previous tasks in order to restore the result using the last name of the key class that sequentially performs any dependent tasks. This, obviously, can be quite laborious, and sometimes even impossible.

Another possibility would be to write a simple task that reads the SequenceFile as text and replaces any instances of the class name with a new one. This is basically method # 1 with customization, which makes it less complicated. If I have many large files, this is still impractical.

Is there a better way to handle refactoring the fully qualified class names used in SequenceFiles? Ideally, I’m looking for a way to specify a new name for the backup class, if the specified one is not found, to allow the launch of both obsolete and updated types of this SequenceFile.

+6
source share
2 answers

The org.apache.hadoop.io.WritableName class mentioned in the exception stack trace has some useful methods.

From the doc :

A utility that allows you to rename Writable implementation classes without exception to files containing their class name.

 // Add an alternate name for a class. public static void addName(Class writableClass, String name) 

In your case, you can call this before reading from SequenceFiles:

 WritableName.addName(com.vertebrates.fishes.FishWritable.class, "com.mammals.fishes.FishWritable"); 

Thus, when trying to read a com.mammals.fishes.FishWritable from the old SequenceFile, the new com.vertebrates.fishes.FishWritable class will be used.

PS: Why is the fish in the package of mammals in the first place ?;)

+1
source

Looking at the specification for the sequence file, it seems clear that there are no alternatives for alternative class names.

If I could not rewrite the data, another option is to add com.mammals.fishes.writable, extend com.vertebrates.fishes.writable and just annotate it as deprecated, so no one accidentally adds code to an empty shell. After quite some time, the data written by the old class will become obsolete and you can safely remove the mammalian class.

+3
source

Source: https://habr.com/ru/post/954119/


All Articles