Grouped by the amount of value in Pig

Question

Grouped by the amount of value in Pig

I am stuck on this issue for a while. I have a data file that looks like this:

2012/01/01 Name1 "Category1,Category2,Category3" 2012/01/01 Name2 "Category2,Category3" 2012/01/01 Name3 "Category1,Category5"

Each item is associated with a comma-separated list of categories. I would like to be able to group by category name to get the result as follows:

 Category1 Name1, Name3 Category2 Name1, Name2 ... Category5 Name3

(more precisely, I do not need element names - only the number of samples of the number of elements in this category)

I ended up writing UDF to take a comma separated category field and convert it to a Pig bag. My data schema now looks something like this: {date: chararray, name: chararray, categories: {t: (category:chararray)}}

I’m stuck in the next step - actually grouping by nested total value. I tried variants of the nested FOREACH statement without any luck. For instance:

 x = FOREACH myData { categoryNames = FOREACH categories GENERATE category; GENERATE myData.Name, categoryNames; }

My thought was that such a syntax could generate tuples (Name, category) that I can run GROUP. However, the actual result is the whole bag, leading me back to square 1. I’m from ideas on how to proceed - help / feedback will be most appreciated. Thanks!

+4

user-defined-functions apache-pig

Inverseofverse Feb 04 '12 at 12:49

source share

1 answer

Romain · Accepted Answer · 2012-02-04T21:42:40+0000

Assuming each name is unique in your data file, you can FLATTEN a bag of categories, then GROUP by category and COUNT the number of names by category.

eg.

 name_category = FOREACH data GENERATE name, FLATTEN(categories) AS category; category_group = GROUP name_category BY category; category_count = FOREACH category_group GENERATE FLATTEN(group) AS category, COUNT(name_category) AS count;

Grouped by the amount of value in Pig

More articles: