Pig: effective filtering of the loaded list

In Apache Pig (version 0.16.x), which of the most efficient methods filter a dataset based on an existing list of values ​​for one of the fields in the dataset?

For example, (Updated by @inquisitive_mind hint)

Input: line-delimited file with one value per line my_codes.txt

'110' '100' '000' 

sample_data.txt

 '110', 2 '110', 3 '001', 3 '000', 1 

Desired output

 '110', 2 '110', 3 '000', 1 

Script example

 %default my_codes_file 'my_codes.txt' %default sample_data_file 'sample_data.txt' my_codes = LOAD '$my_codes_file' as (code:chararray) sample_data = LOAD '$sample_data_file' as (code: chararray, point: float) desired_data = FILTER sample_data BY code IN (my_codes.code); 

Error:

 Scalar has more than one row in the output. 1st : ('110'), 2nd :('100') (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" ) 

I also tried FILTER sample_data BY code IN my_codes; but the sentence "IN" seems to require parentheses. I also tried FILTER sample_data BY code IN (my_codes); but got an error: A column must be projected from a relation to use it as a scalar

+1
source share
1 answer

The file my_codes.txt has codes as a row instead of a column. Since you are loading it in one field, the codes should be as shown below.

 '110' '100' '000' 

Alternatively you can use JOIN

 joined_data = JOIN sample_date BY code,my_codes BY code; desired_data = FOREACH joined_data GENERATE $0,$1; 
+1
source

Source: https://habr.com/ru/post/1271069/


All Articles