Basic counting is done as indicated in other answers and in the documentation for pigs:
logs = LOAD 'log'; all_logs_in_a_bag = GROUP logs ALL; log_count = FOREACH all_logs_in_a_bag GENERATE COUNT(logs); dump log_count
You are right that counting is inefficient, even when using the pig built into COUNT, because it will use one gear. However, today I had a revelation that one way to speed it up would be to reduce the use of the RAM ratio that we are counting on.
In other words, when calculating the ratio, we actually do not care about the data itself, so let's use as little RAM as possible. You were on the right track with your first iteration of the script counter.
logs = LOAD 'log' ones = FOREACH logs GENERATE 1 AS one:int; counter_group = GROUP ones ALL; log_count = FOREACH counter_group GENERATE COUNT(ones); dump log_count
This will work with much larger relationships than the previous script, and should be much faster. The main difference between this and your original script is that we do not need to summarize anything.
WattsInABox Jan 13 '16 at 0:24 2016-01-13 00:24
source share