Snappy or LZO for magazines that are then consumed by the Hope

Question

Snappy or LZO for magazines that are then consumed by the Hope

I have a service with a large volume. I am logging events. Every few minutes I write logs with gzip and turn them on S3. From there, we process the magazines using Amazon Hadoop - an elastic mapreduce - through Hive.

Right now on the servers, we get a processor surge in a few minutes when we loop and rotate the logs. We want to switch from gzip to lzo or snappy to reduce this CPU spike. We are a cpu-bound service, so we are ready to trade large log files for less CPU consumed during the turn.

I read a lot on LZO and Snappy (aka zippy). One of the advantages of LZO is that it is split into HDFS. However, our ~ 15 MB files are archived via Gzip, so I don’t think we will get the default block size of 64 MB in HDFS, so that doesn’t matter. Even if that were the case, we would just have to rotate the default to 128 MB.

Right now, I want to try it quickly, as it seems a little faster / less resource intensive. There seems to be no yum in the Amazon repository, so we probably have to tune / build anyway - so there aren't many trade-offs in terms of development time. I have heard some concerns about the LZO license, but I think I will just install it on our server if it does not come close to our code, right?

So what should I choose? Would it be better to work in Hadoop than in another? Has anyone done this with any implementation and had any problems that they could share?

+4

hadoop hive snappy lzo

John hinnegan Sep 26 '12 at 3:11

source share

1 answer

Eli · Answer 1 · 2013-03-28T22:49:12+0000

It may be too late, but Python-snappy provides a command line tool for quick compression / decompression:

Compress and decompress the file:
$ python -m snappy -c uncompressed_file compressed_file.snappy
$ python -m snappy -d compressed_file.snappy uncompressed_file
Stream compression and unpacking:
$ cat uncompressed_data | python -m snappy -c > compressed_data.snappy
$ cat compressed_data.snappy | python -m snappy -d > uncompressed_data

Snappy also sequentially decompresses 20% + faster than lzo , which is a pretty big win if you want the files you read a lot over hadoop. Finally, it is used by Google for things like BigTable and MapReduce , which is really important confirmation for me.

Snappy or LZO for magazines that are then consumed by the Hope

More articles: