In Apache Pig (version 0.16.x), which of the most efficient methods filter a dataset based on an existing list of values ββfor one of the fields in the dataset?
For example, (Updated by @inquisitive_mind hint)
Input: line-delimited file with one value per line my_codes.txt
'110' '100' '000'
sample_data.txt
'110', 2 '110', 3 '001', 3 '000', 1
Desired output
'110', 2 '110', 3 '000', 1
Script example
%default my_codes_file 'my_codes.txt' %default sample_data_file 'sample_data.txt' my_codes = LOAD '$my_codes_file' as (code:chararray) sample_data = LOAD '$sample_data_file' as (code: chararray, point: float) desired_data = FILTER sample_data BY code IN (my_codes.code);
Error:
Scalar has more than one row in the output. 1st : ('110'), 2nd :('100') (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
I also tried FILTER sample_data BY code IN my_codes; but the sentence "IN" seems to require parentheses. I also tried FILTER sample_data BY code IN (my_codes); but got an error: A column must be projected from a relation to use it as a scalar
source share