Pig: force one pointer to the input line / line

I have a task on the flow of pigs in which the number of cartographers should equal the number of lines / lines in the input file. I know that setting

set mapred.min.split.size 16 set mapred.max.split.size 16 set pig.noSplitCombination true 

will ensure that each block is 16 bytes. But how can I guarantee that each map job has exactly one line as input? Lines are variable in length, so using a constant number for mapred.min.split.size and mapred.max.split.size not the best solution.

Here is the code I intend to use:

 input = load 'hdfs://cluster/tmp/input'; DEFINE CMD `/usr/bin/python script.py`; OP = stream input through CMD; dump OP; 

SOLVE! Thanks zsxwing

And, if anyone else comes across this weird stuff, know this:

For Pig to create one file for each input file, you must set

 set pig.splitCombination false 

but not

 set pig.noSplitCombination true 

Why is this so, I have no idea!

+4
source share
1 answer

Following your prompt, I looked at the source of the Pig to find out the answer.

Install pig.noSplitCombination in a Pig script does not work. In a pig.splitCombination script you need to use pig.splitCombination . Pig will then set pig.noSplitCombination to JobConf according to the value of pig.splitCombination .

If you want to install pig.noSplitCombination directly, you need to use the command line. For instance,

 pig -Dpig.noSplitCombination=true -f foo.pig 

The difference between the two methods is as follows: if you use the set statement in a Pig script, it is stored in the Pig properties. If you use -D , it is saved in the Hadoop configuration.

If you use set pig.noSplitCombination true , (pig.noSplitCombination, true) is stored in the Pig properties. But when Pig wants to initialize JobConf , it retrieves the value using pig.splitCombination from the Pig properties. Thus, your setting is not affected. Here are the source codes. The correct way to set pig.splitCombination false , as you mentioned.

If you use -Dpig.noSplitCombination=true , (pig.noSplitCombination, true) is stored in the Hadoop configuration. Because JobConf is copied from Configuration , the -D value is passed directly to JobConf .

Finally, PigInputFormat reads pig.noSplitCombination from JobConf to decide whether to use this combination. Here are the source codes.

+5
source

Source: https://habr.com/ru/post/1485705/


All Articles