Group by multiple fields and output tuple

I have a feed in the following format:

Hour Key ID Value 1 K1 001 3 1 K1 002 2 2 K1 005 4 1 K2 002 1 2 K2 003 5 2 K2 004 6 

and I want to group the feed (Hour, Key) , then sum the Value , but save the ID as a tuple:

 ({1, K1}, {001, 002}, 5) ({2, K1}, {005}, 4) ({1, K2}, {002}, 1) ({2, K2}, {003, 004}, 11) 

I know how to use FLATTEN to generate the sum of Value , but I don’t know how to output ID as a tuple. This is what I have so far:

 A = LOAD 'data' AS (Hour:chararray, Key:chararray, ID:chararray, Value:int); B = GROUP A BY (Hour, Key); C = FOREACH B GENERATE FLATTEN(group) AS (Hour, Key), SUM(A.Value) AS Value ; 

Can you explain how to do this? Appreciate it!

+6
source share
1 answer

You just need to use the bag projection operator,. . This will create a new package in which the tuples have only the element (s) you have specified. In your case, use A.ID In fact, you are already using this operator to enter SUM input - entering the amount is a bag of singleton tuples that you create by projecting the Value field.

 A = LOAD 'data' AS (Hour:chararray, Key:chararray, ID:chararray, Value:int); B = GROUP A BY (Hour, Key); C = FOREACH B GENERATE FLATTEN(group) AS (Hour, Key), A.ID, SUM(A.Value) AS Value ; 
+7
source

Source: https://habr.com/ru/post/947629/


All Articles