I have a dataset, A , which has a timestamp, visitor, URL:
(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com) (2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com) (2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com)
I want to measure the number of visits per user for the URL in the time window, say 10 minutes, but as a pivot window, which increases to the nearest minute. The output will be:
(2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp:
To simplify arithmetic, I change the timestamp to a minute of the day, as:
(840, joe, hxxp:
To iterate over “A” in the travel time window, I create a dataset of B minutes during the day:
(0) (1) (2) . . . . (1440)
Ideally, I want to do something like:
A = load 'dataset1' AS (ts, visitor, uri) B = load 'dataset2' as (minute) foreach B { C = filter A by ts > minute AND ts < minute + 10; D = GROUP C BY (visitor, uri); foreach D GENERATE group, count(C) as mycnt; } DUMP B;
I know that GROUP is not allowed inside the FOREACH loop, but is there a workaround to achieve the same result?
Thanks!