How to make pigs work with lzo files?

So, I saw a couple of tutorials for this online, but everyone seems to be saying to do something else. In addition, each of them does not seem to indicate whether you are trying to work on a remote cluster or interact locally with a remote cluster, etc.

However, my goal is to get my local computer (mac) to get pigs to work with compressed lzo files that exist in a Hadoop cluster that has already been configured to work with lzo files. I already installed Hadoop locally and can get files from the cluster using hadoop fs -[command] .

I already have a pig installed locally and have a connection to the hadoop cluster when I run the scripts or when I just run the material through grunts. I can download and play with files other than lzo, just fine. My problem is only to figure out a way to download lzo files. Maybe I can just process them through an instance of the ElephantBird cluster? I have no idea, and found only minimal information on the Internet.

So, any short tutorial or answer for this would be awesome and hopefully help more people than just me.

+6
source share
1 answer

I recently got this for work and wrote a wiki for this for my colleagues. Here is an excerpt that details how to get PIG to work with lzos. Hope this helps someone!

NOTE. It is written with Mac in mind. The steps will be almost identical for other OSs, and this will definitely give you what you need to know to configure on Windows or Linux, but you will need a little extrapolation (obviously, change the Mac-centric folders to any OS, re use, etc. P...).

PIG connection before working with LZOs

It was the most unpleasant and time-consuming part for me - not because it is difficult, but because there are 50 different training programs on the Internet, none of which are useful. Anyway, what I did to get this working:

  • A hadoop-lzo clone from github at https://github.com/kevinweil/hadoop-lzo .

  • Compile it to get hasoop-lzo * .jar and native * .o libraries. You will need to compile this on a 64 bit machine.

  • Copy your own libraries to $ HADOOP_HOME / lib / native / Mac_OS_X-x86_64-64 /.

  • Copy java jar to $ HADOOP_HOME / lib and $ PIG_HOME / lib

  • Then configure hasoop and pig for the java.library.path property point to the lzo native libraries. You can do this in $ HADOOP_HOME / conf / mapred-site.xml with:

     <property> <name>mapred.child.env</name> <value>JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/</value> </property> 
  • Now try using the grunt shell by running pig again and make sure it still works. If this is not the case, you probably messed up something in mapred-site.xml and you should double check it.

  • Excellent! We are almost there. All you have to do is install an elephant bird. You can get this from https://github.com/kevinweil/elephant-bird (clone it).

  • Now, to make an elephant bird work, you will need a lot of preliminary requirements. They are listed on the page above and are subject to change, so I will not list them here. I would say that the versions on them are very important. If you get the wrong version and try running ant, you will get errors. Therefore, do not try to grab pre-reqs from brew or macports, as you are likely to get a newer version. Instead, just download the tarballs and create one for everyone.

  • : ant in the folder with the elephant bird to create a jar.

  • For simplicity, move all the corresponding banks (hadoop-lzo-xxxjar and elephant-bird-xxxjar) that you will often need to register somewhere where they are easy to find. / usr / local / lib / hadoop / ... works beautifully.

  • Try it all! Play with loading normal files and lzos in the grunt shell. Register the corresponding banks mentioned above, try downloading the file by restricting the output to a managed number and resetting it. This should work well if you are using a plain text file or lzo.

+4
source

Source: https://habr.com/ru/post/896452/


All Articles