What are the connections and differences between Hadoop Writable and java.io.serialization?

To implement the Writable interface, an object can be serialized in Hadoop. So what are the connections and differences between Hadoop Writable and java.io.serialization ?

+4
source share
2 answers

Key storage differences:

Java Serializable

Serializable does not imply that the class of stored values ​​is known and indicates instances with its class, i.e. it records metadata about the object, which includes the class name, field names and types, and its superclass. ObjectOutputStream and ObjectInputStream optimize this a bit, so 5-byte descriptors are written for instances of the class after the first. But sequences of objects with handles cannot be accessed randomly, since they rely on the state of the stream. This complicates things like sorting.

Hadoop written

When defining "Writable", you know the expected class. Therefore, Writables do not save their type in serialized representation, since when deserializing you know what is expected. eg. if the input key is LongWritable, so an empty LongWritable instance is asked to populate itself from the input stream. Since metadata should not be stored (class name, fields, their types, superclasses), this leads to significantly more compact binaries, simple random access and better performance.


Some good readings:

For Java Serializable:

Hadoop written

+15
source

In the words of Doug Reading:

The Writable interface is different from Serializable. Serializable does not assume that the stored value class is known. So each instance is marked with its class. ObjectOutputStream and ObjectInputStream optimizes this, so that 5-byte pens are written for class instances after the first. But sequences of objects with handles cannot be accessed randomly, since they rely on flow. This complicates things like sorting.

On the other hand, in terms of writing, it is assumed that the application knows the expected class. An application should be able to instantiate in to call readFields (). Therefore, the class should not be stored with each example. This results in significantly more compact binaries, easier random access, and generally higher performance.

Perhaps Hadoop could use Serializable. You can override writeObject or writeExternal for each class whose serialization was performance critical. (MapReduce is very intensive with i / o, so almost every class is performance-critical.) You can implement ObjectOutputStream.writeObjectOverride () and ObjectInputStream.readObjectOverride () to use a more compact representation that, for example, you did not need to mark every top level instance in a file with its class. This will probably require at least as much as Haddop's in Writable, ObjectWritable, etc. and the code will be a little more complicated as it will try to work with another model. But this may have the advantage of better built-in version control. Or that?

The mechanism of the Serializable version is for classes to define a static named serialVersionUID. This protects incompatible changes, but does not allow backward compatibility. For this, the application must explicitly allow with versions. He needs to reason to some extent about what was written during the reading in order to decide what to do. But the Serializeable version engine does not support this more or less than Writable.

You must go through this stream once.

+3
source

Source: https://habr.com/ru/post/1483577/


All Articles