Amazon Elastic Map Reduce to analyze s3 logs

I use EMR to analyze web nginx logs. But I need to process the logs so that they can fall into rows and columns in order to simplify the query. Thus, I made two tables - rawlog, processed as follows:

create table rawlog(line string) row format delimited fields terminated by '\t' lines terminated by '\n' LOCATION 's3://istreamanalytics/logs/'; CREATE EXTERNAL TABLE processedlog ( day string, hour int, playSessionId string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'; 

and added a ruby ​​script to the hive that can perform the conversion, the script looks like this:

 #!/usr/bin/env ruby mon={"Jan" => '01',"Feb" => '02',"Mar" => '03',"Apr" => '04',"May" => '05',"Jun" => '06',"Jul" => '07',"Aug" => '08',"Sep" => '09',"Oct" => '10',"Nov" => '11',"Dec" => '12'} STDIN.each_line do |line| if line =~ /(\d+)\/(\w+)\/(\d+):(\d+):\d+:\d+ \+\d+] "GET \/api\?playSessionId=(^&*)/ d = "#{$3}-#{mon$2}-#{$1}" h = $4 pid = $5 puts "#{d}\t#{h}\t#{pid}" end end 

Now when I start the task using the following command for the hive:

 from rawlog insert overwrite table processedlog select transform (line) using 'ruby /mnt/var/lib/hive_081/downloaded_resources/hive_transformer.rb' as (day String, hour INT, playSessionId String); 

I get the following error:

 Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there no reduce operator Starting Job = job_201206061145_0015, Tracking URL = http://domU-12-31-39-0F-86-07.compute-1.internal:9100/jobdetails.jsp?jobid=job_201206061145_0015 Kill Command = /home/hadoop/.versions/0.20.205/libexec/../bin/hadoop job -Dmapred.job.tracker=10.193.133.241:9001 -kill job_201206061145_0015 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2012-06-08 09:47:49,644 Stage-1 map = 0%, reduce = 0% 2012-06-08 09:48:50,267 Stage-1 map = 0%, reduce = 0% 2012-06-08 09:48:52,278 Stage-1 map = 100%, reduce = 100% Ended Job = job_201206061145_0015 with errors Error during job, obtaining debugging information... Examining task ID: task_201206061145_0015_m_000002 (and more) from job job_201206061145_0015 Exception in thread "Thread-41" java.lang.RuntimeException: Error while reading from task log url at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130) at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211) at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://10.254.139.143:9103/tasklogtaskid=attempt_201206061145_0015_m_000000_2&start=-8193 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436) at java.net.URL.openStream(URL.java:1010) at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120) ... 3 more Counters: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask MapReduce Jobs Launched: Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec 

Can someone tell me what happened?

+6
source share
3 answers

EMR is a very general magazine tool.

Why not use more specialized technology.

eg:.

At least with Sumo, you could make such processing a lot easier.

0
source

The only thing I would like to do is make sure that the script is working correctly before EMR. Using EMR to test a script should be the last step in this process. Also, this is usually a major configuration issue.

Base search query defined:

http://entxtech.blogspot.com/2010/10/how-to-unit-test-apache-hive-scripts.html http://jairam.me/2011/09/08/hive-on-amazon-emr /

0
source

More detailed information about the error can be found in the log files or see the details here in your case: http://10.254.139.143:9103/tasklogtaskid=attempt_201206061145_0015_m_000000_2&start=-8193

0
source

Source: https://habr.com/ru/post/917612/


All Articles