Snappy or LZO for magazines that are then consumed by the Hope

I have a service with a large volume. I am logging events. Every few minutes I write logs with gzip and turn them on S3. From there, we process the magazines using Amazon Hadoop - an elastic mapreduce - through Hive.

Right now on the servers, we get a processor surge in a few minutes when we loop and rotate the logs. We want to switch from gzip to lzo or snappy to reduce this CPU spike. We are a cpu-bound service, so we are ready to trade large log files for less CPU consumed during the turn.

I read a lot on LZO and Snappy (aka zippy). One of the advantages of LZO is that it is split into HDFS. However, our ~ 15 MB files are archived via Gzip, so I donโ€™t think we will get the default block size of 64 MB in HDFS, so that doesnโ€™t matter. Even if that were the case, we would just have to rotate the default to 128 MB.

Right now, I want to try it quickly, as it seems a little faster / less resource intensive. There seems to be no yum in the Amazon repository, so we probably have to tune / build anyway - so there aren't many trade-offs in terms of development time. I have heard some concerns about the LZO license, but I think I will just install it on our server if it does not come close to our code, right?

So what should I choose? Would it be better to work in Hadoop than in another? Has anyone done this with any implementation and had any problems that they could share?

+4
source share
1 answer

It may be too late, but Python-snappy provides a command line tool for quick compression / decompression:

Compress and decompress the file:

$ python -m snappy -c uncompressed_file compressed_file.snappy

$ python -m snappy -d compressed_file.snappy uncompressed_file

Stream compression and unpacking:

$ cat uncompressed_data | python -m snappy -c > compressed_data.snappy

$ cat compressed_data.snappy | python -m snappy -d > uncompressed_data

Snappy also sequentially decompresses 20% + faster than lzo , which is a pretty big win if you want the files you read a lot over hadoop. Finally, it is used by Google for things like BigTable and MapReduce , which is really important confirmation for me.

+2
source

Source: https://habr.com/ru/post/1436277/


All Articles