Pig 0.11.1 - Grouping groups in a time range

I have a dataset, A , which has a timestamp, visitor, URL:

 (2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com) (2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com) (2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com) 

I want to measure the number of visits per user for the URL in the time window, say 10 minutes, but as a pivot window, which increases to the nearest minute. The output will be:

 (2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2) (2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1) 

To simplify arithmetic, I change the timestamp to a minute of the day, as:

 (840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */ 

To iterate over “A” in the travel time window, I create a dataset of B minutes during the day:

 (0) (1) (2) . . . . (1440) 

Ideally, I want to do something like:

 A = load 'dataset1' AS (ts, visitor, uri) B = load 'dataset2' as (minute) foreach B { C = filter A by ts > minute AND ts < minute + 10; D = GROUP C BY (visitor, uri); foreach D GENERATE group, count(C) as mycnt; } DUMP B; 

I know that GROUP is not allowed inside the FOREACH loop, but is there a workaround to achieve the same result?

Thanks!

+6
source share
2 answers

Maybe you can do something like this?

NOTE. . It depends on the minutes you create for integer logs. If it is not, you can round to the nearest minute.

myudf.py

 #!/usr/bin/python @outputSchema('expanded: {(num:int)}') def expand(start, end): return [ (x) for x in range(start, end) ] 

myscript.pig

 register 'myudf.py' using jython as myudf ; -- A1 is the minutes. Schema: -- A1: {minute: int} -- A2 is the logs. Schema: -- A2: {minute: int,name: chararray} -- These schemas should change to fit your needs. B = FOREACH A1 GENERATE minute, FLATTEN(myudf.expand(minute, minute+10)) AS matchto ; -- B is in the form: -- 1 1 -- 1 2 -- .... -- 2 2 -- 2 3 -- .... -- 100 100 -- 100 101 -- etc. -- Now we join on the minute in the second column of B with the -- minute in the log, then it is just grouping by the minute in -- the first column and name and counting C = JOIN B BY matchto, A2 BY minute ; D = FOREACH (GROUP C BY (B::minute, name)) GENERATE FLATTEN(group), COUNT(C) as count ; 

I am a little worried about the speed for large magazines, but it should work. Let me know if you need to explain anything.

+2
source
 A = load 'dataSet1' as (ts, visitor, uri); houred = FOREACH A GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, uri; hour_frequency1 = GROUP houred BY (hour, user); 

Something like this should help ExtractHour - it's UDF, you can create something similar for the required duration. Then group by hours, and then to the user, you can use GENERATE for counting.

http://pig.apache.org/docs/r0.7.0/tutorial.html

0
source

Source: https://habr.com/ru/post/950850/


All Articles