I have a service with a large volume. I am logging events. Every few minutes I write logs with gzip and turn them on S3. From there, we process the magazines using Amazon Hadoop - an elastic mapreduce - through Hive.
Right now on the servers, we get a processor surge in a few minutes when we loop and rotate the logs. We want to switch from gzip to lzo or snappy to reduce this CPU spike. We are a cpu-bound service, so we are ready to trade large log files for less CPU consumed during the turn.
I read a lot on LZO and Snappy (aka zippy). One of the advantages of LZO is that it is split into HDFS. However, our ~ 15 MB files are archived via Gzip, so I donโt think we will get the default block size of 64 MB in HDFS, so that doesnโt matter. Even if that were the case, we would just have to rotate the default to 128 MB.
Right now, I want to try it quickly, as it seems a little faster / less resource intensive. There seems to be no yum in the Amazon repository, so we probably have to tune / build anyway - so there aren't many trade-offs in terms of development time. I have heard some concerns about the LZO license, but I think I will just install it on our server if it does not come close to our code, right?
So what should I choose? Would it be better to work in Hadoop than in another? Has anyone done this with any implementation and had any problems that they could share?
source share