Run a Pig request over data stored in Hive

Question

Run a Pig request over data stored in Hive

I would like to know how to run Pig requests stored in Hive format. I configured Hive to store compressed data (using this tutorial http://wiki.apache.org/hadoop/Hive/CompressedStorage ).

Before that, I used the usual Pig load function with a Hive delimiter (^ A). But now, Hive stores data in sequence files with compression. What download function to use?

Note that tight integration is not required, as mentioned here: Using Hive with Pig , just which download function is used to read the compressed sequence files created by Hive.

Thanks for all the answers.

+6

hadoop hive apache-pig

wlk Apr 21 '11 at 7:50

source share

1 answer

wlk · Accepted Answer · 2011-07-01T08:45:31+0000

Here's what I found out: Using a HiveColumnarLoader makes sense if you store the data as an RCFile. To load a spreadsheet using this, you first need to register several cans:

register /srv/pigs/piggybank.jar register /usr/lib/hive/lib/hive-exec-0.5.0.jar register /usr/lib/hive/lib/hive-common-0.5.0.jar a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.HiveColumnarLoader('ts int, user_id int, url string');

To load data from a Sequence file, you must use PiggyBank (as in the previous example). The SequenceFile loader from Piggybank should process compressed files:

 register /srv/pigs/piggybank.jar DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); a = LOAD '/user/hive/warehouse/table' USING SequenceFileLoader AS (int, int);

This does not work with Pig 0.7, because it cannot read the BytesWritable type and assign it to the Pig type, and you will get this exception:

 2011-07-01 10:30:08,589 WARN org.apache.pig.piggybank.storage.SequenceFileLoader: Unable to translate key class org.apache.hadoop.io.BytesWritable to a Pig datatype 2011-07-01 10:30:08,625 WARN org.apache.hadoop.mapred.Child: Error running child org.apache.pig.backend.BackendException: ERROR 0: Unable to translate class org.apache.hadoop.io.BytesWritable to a Pig datatype at org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78) at org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:132) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) at org.apache.hadoop.mapred.Child.main(Child.java:211)

How to compile a piggybank is described here: Unable to build a piggybank -> / home / build / ivy / lib does not exist

Run a Pig request over data stored in Hive

More articles: