PIG how to count the number of lines in an alias

Question

PIG how to count the number of lines in an alias

I did something similar to count the number of lines in an alias in PIG:

logs = LOAD 'log' logs_w_one = foreach logs generate 1 as one; logs_group = group logs_w_one all; logs_count = foreach logs_group generate SUM(logs_w_one.one); dump logs_count;

This seems too inefficient. Please enlighten me if there is a better way!

+45

hadoop apache-pig

kee Mar 28 '12 at 3:29

source share

7 answers

Be careful, COUNT your first item in the bag should not be empty. Alternatively, you can use the COUNT_STAR function to count all rows.

+29

Kevin Mar 28 2018-12-12T00:

source share

Arnon Rotem-Gal-Oz already answered this question some time ago, but I thought that some might like this slightly more concise version.

 LOGS = LOAD 'log'; LOG_COUNT = FOREACH (GROUP LOGS ALL) GENERATE COUNT(LOGS);

+29

Jerome Serrano Mar 27 '14 at 21:21

source share

Basic counting is done as indicated in other answers and in the documentation for pigs:

 logs = LOAD 'log'; all_logs_in_a_bag = GROUP logs ALL; log_count = FOREACH all_logs_in_a_bag GENERATE COUNT(logs); dump log_count

You are right that counting is inefficient, even when using the pig built into COUNT, because it will use one gear. However, today I had a revelation that one way to speed it up would be to reduce the use of the RAM ratio that we are counting on.

In other words, when calculating the ratio, we actually do not care about the data itself, so let's use as little RAM as possible. You were on the right track with your first iteration of the script counter.

 logs = LOAD 'log' ones = FOREACH logs GENERATE 1 AS one:int; counter_group = GROUP ones ALL; log_count = FOREACH counter_group GENERATE COUNT(ones); dump log_count

This will work with much larger relationships than the previous script, and should be much faster. The main difference between this and your original script is that we do not need to summarize anything.

+4

WattsInABox Jan 13 '16 at 0:24

source share

USE COUNT_STAR

 LOGS= LOAD 'log'; LOGS_GROUP= GROUP LOGS ALL; LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT_STAR(LOGS);

+2

hello_abhishek Feb 27 '16 at 8:19

source share

Here is the optimized version. All of the above solutions would require the pigs to read and write the full tuple when counting, this script below just writes' 1'-s

 DEFINE row_count(inBag, name) RETURNS result { X = FOREACH $inBag generate 1; $result = FOREACH (GROUP X ALL PARALLEL 1) GENERATE '$name', COUNT(X); };

Use it as

 xxx = row_count(rows, 'rows_count');

0

Igor Katkov Aug 13 '15 at 20:08

source share

What you want is to count all rows in a relation (Pig Latin dataset)

It is very easy after the following steps:

 logs = LOAD 'log'; --relation called logs, using PigStorage with tab as field delimiter logs_grouped = GROUP logs ALL;--gives a relation with one row with logs as a bag number = FOREACH LOGS_GROUP GENERATE COUNT_STAR(logs);--show me the number

I have to say that it is important that Kevin indicates that COUNT is used instead of COUNT_STAR, we will only have the number of rows whose first field is not null.

I also like the syntax of one line of Jerome, it is more concise, but in order to be didactic, I prefer to divide it into two parts and add a comment.

In general, I prefer:

 numerito = FOREACH (GROUP CARGADOS3 ALL) GENERATE COUNT_STAR(CARGADOS3);

over

 name = GROUP CARGADOS3 ALL number = FOREACH name GENERATE COUNT_STAR(CARGADOS3);

0

Javier Bañez Feb 26 '16 at 9:46

source share

Arnon Rotem-Gal-Oz · Accepted Answer · 2012-03-28 05:02

COUNT is part of the pig see manual

 LOGS= LOAD 'log'; LOGS_GROUP= GROUP LOGS ALL; LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);

PIG how to count the number of lines in an alias

More articles: