Pig FILTER returns an empty bag that I cannot use COUNT

I am trying to calculate how many values ​​in a dataset match a filter condition, but I ran into problems when the filter does not match any records.

There are many columns in my data structure, but only three are used for this example: key is the data key for the set (not unique), value is the value of the floating number written, nominal_value is the float representing the nominal value.

Our use case now is to find the number of values ​​that are 10% or more below the nominal value.

I am doing something like this:

 filtered_data = FILTER data BY value <= (0.9 * nominal_value); filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE COUNT(filtered_data.value); DUMP filtered_count; 

In most cases, there are no values ​​outside the nominal range, so filtered_data empty (or null. Do not know how to find out which one). This results in filtered_count also empty / null, which is undesirable.

How can I build a statement that will return 0 when filtered_data empty / null? I tried several options that I found on the Internet:

 -- Extra parens in COUNT required to avoid syntax error filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE COUNT((filtered_data.value is null ? {} : filtered_data.value)); 

that leads to:

 Two inputs of BinCond must have compatible schemas. left hand side: #1259:bag{} right hand side: #1261:bag{#1260:tuple(cf#1038:float)} 

and

 filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE (filtered_data.value is null ? 0 : COUNT(filtered_data.value)); 

leading to empty / null results.

+4
source share
1 answer

As you set it up right now, you will lose information about any keys for which the number of negative values ​​is 0. Instead, I would recommend saving all the keys so that you can see a positive confirmation that the counter was 0, instead of displaying it due to lack of. To do this, simply use the indicator, and then SUM , which:

 data2 = FOREACH data GENERATE key, ((value <= 0.9*nominal_value) ? 1 : 0) AS bad; bad_count = FOREACH (GROUP data2 BY key) GENERATE group, SUM(data2.bad); 
+3
source

Source: https://habr.com/ru/post/1495581/


All Articles