High Performance Serialization: Java vs Google Protocol Buffers vs ...?

For some caching that I am going to do for the upcoming project, I was thinking about serializing Java. Namely, should it be used?

Now I previously wrote various serialization and deserialization (Externalizable) for various reasons in past years. Interoperability has become an even bigger problem these days, and I can foresee the need to interact with .Net applications, so I thought about using a platform-independent solution.

Does anyone have any experience with high performance GPB usage? How does it compare in terms of speed and efficiency with native Java serialization? Alternatively, are there any other schemes worth considering?

+44
java caching serialization protocol-buffers
Mar 15 '09 at 13:00
source share
7 answers

I did not compare protocol buffers with native Java serialization in terms of speed, but for the interoperability of native serialization, Java is a serious no-no. In most cases, it will also not be as space-efficient as protocol buffers. Of course, it is somewhat more flexible in terms of what it can store, and in terms of links, etc. Protocol buffers are very good at what it is designed for, and when it meets your needs, this is great - but there are obvious limitations due to compatibility (and other things).

I recently published the base buffer protocol framework in Java and .NET. The Java version is in the main Google project (in the benchmarks directory), the .NET version is in my C # port project . If you want to compare PB speed with Java serialization speed, you can write similar classes and compare them. If you are interested in interop, though, I really would not give a native Java serialization (or a native binary .NET serialization) a second thought.

There are other options for interoperable serialization, in addition to protocol buffers, although Thrift , JSON and YAML spring, and there are undoubtedly others.

EDIT: Well, if interop is not that important, it's worth trying to list the various qualities you want from the scope of serialization. One thing you should think about is version control - this is another thing that PB is designed to handle well, both backward and forward (so that new software can read old data and vice versa) - when you stick to the suggested rules , sure:)

Trying to be careful about Java performance versus native serialization, I really won’t be surprised that PB will still be faster. If you have such an opportunity, use the vm server - my recent tests showed that the server VM should be twice as fast when serializing and deserializing sample data. I think PB code is suitable for VM JIT server very nicely :)

As an example of performance indicators, serialization and deserialization of two messages (one 228 bytes, one 84,750 bytes), I got these results on my laptop using a VM server:

 Benchmarking benchmarks.GoogleSize $ SizeMessage1 with file google_message1.dat 
 Serialize to byte string: 2581851 iterations in 30.16s;  18.613789MB / s 
 Serialize to byte array: 2583547 iterations in 29.842s;  18.824497MB / s 
 Serialize to memory stream: 2210320 iterations in 30.125s;  15.953759MB / s 
 Deserialize from byte string: 3356517 iterations in 30.088s;  24.256632MB / s 
 Deserialize from byte array: 3356517 iterations in 29.958s;  24.361889MB / s 
 Deserialize from memory stream: 2618821 iterations in 29.821s;  09/19/4952MB / s 

 Benchmarking benchmarks.GoogleSpeed ​​$ SpeedMessage1 with file google_message1.dat 
 Serialize to byte string: 17068518 iterations in 29.978s;  123.802124MB / s 
 Serialize to byte array: 17520066 iterations in 30.043s;  126.802376MB / s 
 Serialize to memory stream: 7736665 iterations in 30.076s;  55.93307MB / s 
 Deserialize from byte string: 16123669 iterations in 30.073s;  116.57947MB / s 
 Deserialize from byte array: 16082453 iterations in 30.109s;  116.14243MB / s
 Deserialize from memory stream: 7496968 iterations in 30.03s;  54.283176MB / s 

 Benchmarking benchmarks.GoogleSize $ SizeMessage2 with file google_message2.dat 
 Serialize to byte string: 6266 iterations in 30.034s;  16.826494MB / s 
 Serialize to byte array: 6246 iterations in 30.027s;  16.776697MB / s 
 Serialize to memory stream: 6042 iterations in 29.916s;  16.288969MB / s 
 Deserialize from byte string: 4675 iterations in 29.819s;  12.644595MB / s 
 Deserialize from byte array: 4694 iterations in 30.093s;  12.580387MB / s 
 Deserialize from memory stream: 4544 iterations in 29.579s;  12.389998MB / s 

 Benchmarking benchmarks.GoogleSpeed ​​$ SpeedMessage2 with file google_message2.dat 
 Serialize to byte string: 39562 iterations in 30.055s;  106.16416MB / s 
 Serialize to byte array: 39715 iterations in 30.178s;  106.14035MB / s 
 Serialize to memory stream: 34161 iterations in 30.032s;  91.74085MB / s 
 Deserialize from byte string: 36934 iterations in 29.794s;  99.98019MB / s 
 Deserialize from byte array: 37191 iterations in 29.915s;  100.26867MB / s 
 Deserialize from memory stream: 36237 iterations in 29.846s;  97.92251MB / s 

β€œSpeed” and β€œsize” are whether the generated code is optimized for speed or code size. (Serialized data is the same in both cases. The "size" version is provided for the case when you have defined many messages and do not want to receive a lot of memory for the code.)

As you can see, for a smaller message this can be very fast - over 500 small messages serialized or deserialized in a millisecond. Even with a 87K message, it takes less than a millisecond for a message.

+55
Mar 15 '09 at 13:16
source share
β€” -

Another data point: this project:

http://code.google.com/p/thrift-protobuf-compare/

gives some idea of ​​the expected performance for small objects, including serializing Java to PB.

The results vary depending on your platform, but there are some general trends.

+14
Mar 31 '09 at 18:17
source share

If you mix PB and your own Java serialization for speed and efficiency, just upgrade to PB.

  • PB was designed to achieve such factors. See http://code.google.com/apis/protocolbuffers/docs/overview.html
  • PB data is very small, while Java serialization tends to replicate the whole object, including its signature. Why do I always get the name of my class, the name of the field ... serializable, although I know it inside out in the receiver?
  • Think about language development. It becomes difficult if one side uses Java, one side uses C ++ ...

Some developers offer Thrift, but I would use Google PB because "I believe in google" :-) .. Anyway, it's worth a look: http://stuartsierra.com/2008/07/10/thrift-vs- protocol-buffers

+6
Mar 15 '09 at 17:25
source share

What do you mean by high performance? If you want serialization in milliseconds, I suggest you use the simplest serialization method. If you want to use the sub-millionth second, you most likely will need a binary format. If you want significantly less than 10 microseconds, you most likely will need custom serialization.

I have not seen many tests for serialization / deserialization, but there is little support for less than 200 microseconds for serialization / deserialization.

Independent platform formats depend on the cost (depending on your side and waiting time), you may have to decide whether you want platform performance or independence. However, there is no reason why you cannot use both configuration options that you switch between the required ones.

+5
Mar 15 '09 at 13:53
source share

You can also watch FST , a replacement for the built-in JDK serialization, which should be faster and have lower results.

unprocessed estimates for frequent benchmarking that I have made in recent years:

100% = binary / structural approaches (e.g. SBE, fst-structs)

  • uncomfortable
  • postprocessing (creating "real" receiver-side objects) can eat up performance benefits and is never included in tests.

~ 10% -35% protobuf and derivatives

~ 10% -30% of fast serializers like FST and KRYO

  • convenient deserialized objects can be used most often without an additional translation code.
  • can be compressed for performance (annotations, class registration)
  • save links in the graph of objects (the object is not serialized twice)
  • can handle cyclic structures
  • universal solution, FST is fully compatible with JDK serialization

~ 2% -15% JDK Serialization

~ 1% -15% fast JSon (e.g. Jackson)

  • cannot process any graph of objects, but only a small subset of Java data structures
  • no ref restore

0.001-1% full JSon / XML plot (e.g. JSON.io)

These numbers are for a very rude impression of the order of magnitude. Note that performance is dependent on LOT in data structures that are serialized / compared. Thus, simple tests of a simple class are mostly useless (but popular: for example, ignoring unicode, no collections, ..).

see also

http://java-is-the-new-c.blogspot.de/2014/12/a-persistent-keyvalue-server-in-40.html

http://java-is-the-new-c.blogspot.de/2013/10/still-using-externalizable-to-get.html

+5
Dec 6 '12 at 22:17
source share

Here is the suggestion of the wall of the day :-) (you just cut something in my head that I want to try now) ...

If you can go for an entire caching solution, this might work: Project Darkstar . It is designed as a very high-performance game server, in particular, so that reading is fast (so good for the cache). It has a Java and C API, so I believe (I thought it took a lot of time since I looked at it, and then I did not think about it) so that you can save objects from Java and read them back to C and vice versa.

If nothing else gives you something to read today :-)

+1
Mar 15 '09 at 15:36
source share

For wired serialization, consider using the Externalizable interface. Used smart, you will have intimate information to decide how to optimally sort and undo certain fields. However, you need to correctly control the versions of each object - it’s easy to break the marker, but reconfigure the V2 object when your code supports V1, either breaks, or loses information, or worsens the damaged data in such a way that your applications cannot process it correctly. If you are looking for the best way, be careful, no library will solve your problem without any compromises. As a rule, libraries are suitable for most use cases and will have an additional advantage that they will adapt and improve over time without your input, if you choose an active open source project. And they can add performance problems, introduce errors, and even fix errors that have not yet affected you!

0
Jun 09 '15 at 21:06
source share



All Articles