Pig: effective filtering of the loaded list

Question

Pig: effective filtering of the loaded list

In Apache Pig (version 0.16.x), which of the most efficient methods filter a dataset based on an existing list of values for one of the fields in the dataset?

For example, (Updated by @inquisitive_mind hint)

Input: line-delimited file with one value per line my_codes.txt

'110' '100' '000'

sample_data.txt

 '110', 2 '110', 3 '001', 3 '000', 1

Desired output

 '110', 2 '110', 3 '000', 1

Script example

 %default my_codes_file 'my_codes.txt' %default sample_data_file 'sample_data.txt' my_codes = LOAD '$my_codes_file' as (code:chararray) sample_data = LOAD '$sample_data_file' as (code: chararray, point: float) desired_data = FILTER sample_data BY code IN (my_codes.code);

Error:

 Scalar has more than one row in the output. 1st : ('110'), 2nd :('100') (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )

I also tried FILTER sample_data BY code IN my_codes; but the sentence "IN" seems to require parentheses. I also tried FILTER sample_data BY code IN (my_codes); but got an error: A column must be projected from a relation to use it as a scalar

+1

apache-pig

Quetzalcoatl Jun 13 '17 at 21:48

source share

1 answer

VK_217 · Accepted Answer · 2017-06-13T22:59:20+0000

The file my_codes.txt has codes as a row instead of a column. Since you are loading it in one field, the codes should be as shown below.

 '110' '100' '000'

Alternatively you can use JOIN

 joined_data = JOIN sample_date BY code,my_codes BY code; desired_data = FOREACH joined_data GENERATE $0,$1;

Pig: effective filtering of the loaded list

More articles: