I am having problems with a SUMming bag of values due to a data type error.
When I upload a csv file whose lines look like this:
6 574 false 10.1.72.23 2010-05-16 13:56:19 +0930 fbcdn.net static.ak.fbcdn.net 304 text/css 1 /rsrc.php/zPTJC/hash/50l7x7eg.css http pwong
Using the following:
logs_base = FOREACH raw_logs GENERATE FLATTEN( EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"') ) as ( account_id: int, bytes: long, cached: chararray, ip: chararray, time: chararray, domain: chararray, host: chararray, status: chararray, mime_type: chararray, page_view: chararray, path: chararray, protocol: chararray, username: chararray );
All fields seem to be loaded in order and with the correct type, as shown by the describe command:
grunt> describe logs_base logs_base: {account_id: int,bytes: long,cached: chararray,ip: chararray,time: chararray,domain: chararray,host: chararray,status: chararray,mime_type: chararray,page_view: chararray,path: chararray,protocol: chararray,username: chararray}
Whenever I execute SUM using:
bytesCount = FOREACH (GROUP logs_base ALL) GENERATE SUM(logs_base.bytes);
and save or reset the contents, the mapreduce process will fail with an error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing sum in Initial at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:87) at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:65) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:79) ... 15 more
The line that attracts my attention:
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
This makes me think that the extract function does not convert the byte field to the required data type (long).
Is there a way to force the extract function to convert to the required data types? How can I assign a value without having to do FOREACH in all records? (The same thing happens when converting time to unix timestamp and trying to find MIN. I would definitely like to find a solution that does not require unnecessary predictions).
Any pointers would be appreciated. Many thanks for your help.
Regards, Jorge S.
PS I am launching this online on the Amazon Elastic Card service.