Grouped by the amount of value in Pig

I am stuck on this issue for a while. I have a data file that looks like this:

2012/01/01 Name1 "Category1,Category2,Category3" 2012/01/01 Name2 "Category2,Category3" 2012/01/01 Name3 "Category1,Category5" 

Each item is associated with a comma-separated list of categories. I would like to be able to group by category name to get the result as follows:

 Category1 Name1, Name3 Category2 Name1, Name2 ... Category5 Name3 

(more precisely, I do not need element names - only the number of samples of the number of elements in this category)

I ended up writing UDF to take a comma separated category field and convert it to a Pig bag. My data schema now looks something like this: {date: chararray, name: chararray, categories: {t: (category:chararray)}}

I’m stuck in the next step - actually grouping by nested total value. I tried variants of the nested FOREACH statement without any luck. For instance:

 x = FOREACH myData { categoryNames = FOREACH categories GENERATE category; GENERATE myData.Name, categoryNames; } 

My thought was that such a syntax could generate tuples (Name, category) that I can run GROUP. However, the actual result is the whole bag, leading me back to square 1. I’m from ideas on how to proceed - help / feedback will be most appreciated. Thanks!

+4
source share
1 answer

Assuming each name is unique in your data file, you can FLATTEN a bag of categories, then GROUP by category and COUNT the number of names by category.

eg.

 name_category = FOREACH data GENERATE name, FLATTEN(categories) AS category; category_group = GROUP name_category BY category; category_count = FOREACH category_group GENERATE FLATTEN(group) AS category, COUNT(name_category) AS count; 
+5
source

Source: https://habr.com/ru/post/1394720/