I am stuck on this issue for a while. I have a data file that looks like this:
2012/01/01 Name1 "Category1,Category2,Category3" 2012/01/01 Name2 "Category2,Category3" 2012/01/01 Name3 "Category1,Category5"
Each item is associated with a comma-separated list of categories. I would like to be able to group by category name to get the result as follows:
Category1 Name1, Name3 Category2 Name1, Name2 ... Category5 Name3
(more precisely, I do not need element names - only the number of samples of the number of elements in this category)
I ended up writing UDF to take a comma separated category field and convert it to a Pig bag. My data schema now looks something like this: {date: chararray, name: chararray, categories: {t: (category:chararray)}}
I’m stuck in the next step - actually grouping by nested total value. I tried variants of the nested FOREACH statement without any luck. For instance:
x = FOREACH myData { categoryNames = FOREACH categories GENERATE category; GENERATE myData.Name, categoryNames; }
My thought was that such a syntax could generate tuples (Name, category) that I can run GROUP. However, the actual result is the whole bag, leading me back to square 1. I’m from ideas on how to proceed - help / feedback will be most appreciated. Thanks!
source share