Pig 0.11.1 - Grouping groups in a time range

Question

Pig 0.11.1 - Grouping groups in a time range

I have a dataset, A , which has a timestamp, visitor, URL:

 (2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com) (2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com) (2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com)

I want to measure the number of visits per user for the URL in the time window, say 10 minutes, but as a pivot window, which increases to the nearest minute. The output will be:

 (2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2) (2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1)

To simplify arithmetic, I change the timestamp to a minute of the day, as:

 (840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */

To iterate over “A” in the travel time window, I create a dataset of B minutes during the day:

 (0) (1) (2) . . . . (1440)

Ideally, I want to do something like:

 A = load 'dataset1' AS (ts, visitor, uri) B = load 'dataset2' as (minute) foreach B { C = filter A by ts > minute AND ts < minute + 10; D = GROUP C BY (visitor, uri); foreach D GENERATE group, count(C) as mycnt; } DUMP B;

I know that GROUP is not allowed inside the FOREACH loop, but is there a workaround to achieve the same result?

Thanks!

+6

range mapreduce hadoop apache-pig

Joe nate Aug 1 '13 at 20:38

source share

2 answers

 A = load 'dataSet1' as (ts, visitor, uri); houred = FOREACH A GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, uri; hour_frequency1 = GROUP houred BY (hour, user);

Something like this should help ExtractHour - it's UDF, you can create something similar for the required duration. Then group by hours, and then to the user, you can use GENERATE for counting.

http://pig.apache.org/docs/r0.7.0/tutorial.html

0

Vijay kukkala Aug 1 '13 at 21:06

source share

mr2ert · Accepted Answer · 2013-08-01T21:52:45+0000

Maybe you can do something like this?

NOTE. . It depends on the minutes you create for integer logs. If it is not, you can round to the nearest minute.

myudf.py

 #!/usr/bin/python @outputSchema('expanded: {(num:int)}') def expand(start, end): return [ (x) for x in range(start, end) ]

myscript.pig

 register 'myudf.py' using jython as myudf ; -- A1 is the minutes. Schema: -- A1: {minute: int} -- A2 is the logs. Schema: -- A2: {minute: int,name: chararray} -- These schemas should change to fit your needs. B = FOREACH A1 GENERATE minute, FLATTEN(myudf.expand(minute, minute+10)) AS matchto ; -- B is in the form: -- 1 1 -- 1 2 -- .... -- 2 2 -- 2 3 -- .... -- 100 100 -- 100 101 -- etc. -- Now we join on the minute in the second column of B with the -- minute in the log, then it is just grouping by the minute in -- the first column and name and counting C = JOIN B BY matchto, A2 BY minute ; D = FOREACH (GROUP C BY (B::minute, name)) GENERATE FLATTEN(group), COUNT(C) as count ;

I am a little worried about the speed for large magazines, but it should work. Let me know if you need to explain anything.

Pig 0.11.1 - Grouping groups in a time range

myudf.py

myscript.pig

More articles: