Hadoop Pig Account Number

I am learning how to use Hadoop Pig now.

If I have an input file:

a,b,c,true s,c,v,false a,s,b,true ... 

The last field is the one that I need to calculate ... Therefore, I want to know how many "true" and "false" in this file.

I'm trying to:

 records = LOAD 'test/input.csv' USING PigStorage(','); boolean = foreach records generate $3; groups = group boolean all; 

Now I'm stuck. I want to use:

 count = foreach groups generate count('true');" 

To get the number "true", but I always get the error:

2013-08-07 16: 32: 36,677 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Failed to resolve the account using the import: [, org.apache.pig.builtin., Org. apache.pig.impl.builtin.] Details in the log file: /etc/pig/pig_1375911119028.log

Can someone tell me where the problem is?

+6
source share
1 answer

Two things. First, count should be count . In a pig, all built-in functions must be called with all caps.

Secondly, count counts the number of values ​​in the bag, not the value. Therefore, you should group true / false and then count :

 boolean = FOREACH records GENERATE $3 AS trueORfalse ; groups = GROUP boolean BY trueORfalse ; counts = FOREACH groups GENERATE group AS trueORfalse, COUNT(boolean) ; 

So, now the DUMP output for counts will look something like this:

 (true, 2) (false, 1) 

If you need true and false values ​​in their own relationships, you can FILTER output of counts . However, it would probably be better to SPLIT boolean , then make two separate accounts:

 boolean = FOREACH records GENERATE $3 AS trueORfalse ; SPLIT boolean INTO alltrue IF trueORfalse == 'true', allfalse IF trueORfalse == 'false' ; tcount = FOREACH (GROUP alltrue ALL) GENERATE COUNT(alltrue) ; fcount = FOREACH (GROUP allfalse ALL) GENERATE COUNT(allfalse) ; 
+10
source

Source: https://habr.com/ru/post/951263/


All Articles