Copying a file from s3: // to the local file system

Question

Copying a file from s3: // to the local file system

I am new to aws. I created a cluster and ssh'ed in the node master. When I try to copy files from s3: // my-bucket-name / to a local file: // home / hasoop folder in pig using:

cp s3://my-bucket-name/path/to/file file://home/hadoop

I get an error:

2013-06-08 18: 59: 00,267 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 29 99: Unexpected internal error. The AWS passkey identifier and secret passkey must be specified as the username or password (respectively) of the s3 URL or by setting t he fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey (respectively).

I can’t even get into my s3 bucket. I set AWS_ACCESS_KEY and AWS_SECRET_KEY without success. Also, I could not find the configuration file for the pig to set the appropriate fields.

Any help please?

Edit: I tried uploading the file to a pig using the full s3n: // uri

 grunt> raw_logs = LOAD 's3://XXXXX/input/access_log_1' USING TextLoader a s (line:chararray); grunt> illustrate raw_logs;

and I get the following error:

2013-06-08 19: 28: 33,342 [main] INFORMATION org.apache.pig.backend.hadoop.executionengi ne.HExecutionEngine - Connection to the hadoop file system at: file: /// 2013-06-08 19: 28 : 33,404 [home] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? fal se 2013-06-08 19: 28: 33,404 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2013-06-08 19: 28: 33,405 [home] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.MultiQueryOptimizer - size of the MR plan after optimization: 1 2013-06-08 19: 28: 33,405 [home] INFO Settings org.apache.pig.tools .pigstats.ScriptState - Pig script: added to the work 2013-06-08 19: 28: 33,429 [main] INFORMATION org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset. buffer.percen t is not installed, the default value is set 0.3 2013-06-08 19: 28: 33,430 [main] ERROR org.apache.pig.pen.ExampleGenerator - Error reading data. internal error creating job configuration. java.lang.RuntimeException: Internal error creating job configuration. at org.apache.pig.pen.ExampleGenerator.getExamples (ExampleGenerator.java: 160) at org.apache.pig.PigServer.getExamples (PigServer.java:1244) at org.apache.pig.tools.grunt.GruntParser.processIllustrate (GruntParser. Java: 722) at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate (PigS criptParser.java∗91) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse (PigScript Parser .java: 306) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError (GruntParser.j AVA: 189) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError (GruntParser.j) org.apache.pig.tools.grunt.Grunt.run (Grunt.java:69) at org.apache.pig.Main.run (Main.java∗00) at org.apache.pig.Main.main (Main. java: 114) at sun.reflect.NativeMethodAccessorImpl.invoke0 (native method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl. Java: 39) at sun.refle ct.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAcces sorImpl.java:25) in java.lang.reflect.Method.invoke (Method.java∗97) at org.apache.hadoop.util.RunJar.main (RunJar.java:187) 2013 -06-08 19: 28: 33,432 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 29 97: IOException thrown. Exception: internal error creating job configuration. Details in the log file: / home / hadoop / pig _1370719069857.log

+4

amazon-s3 amazon-web-services hadoop hdfs apache-pig

Vjune Jun 08 '13 at 19:06

source share

3 answers

cp in

 cp s3://my-bucket-name/path/to/file file://home/hadoop

Don't know about S3.

You can use:

 s3cmd get s3://some-s3-bucket/some-s3-folder/local_file.ext ~/local_dir/

I don’t know why s3cmd cp ... does not do what it needs, but s3cmd get ... works. And man s3cmd has:

  s3cmd get s3://BUCKET/OBJECT LOCAL_FILE Get file from bucket

+3

arntg Aug 7 '13 at 9:33

source share

I experienced this same error and finally got into the solution. However, I immediately changed two things, so I’m not sure that both of them are necessary (of course, one of them).

Firstly, I made sure that my S3 data and my EMR system are in the same region. When I had this problem, my data was in the east of the USA, and EMR was in the west of the USA. I am standardized for US East (Virginia), aka us-east-1, aka US Standard, aka DEFAULT, aka N. Virginia. It may not have been required, but it did not stop.

Secondly, when I received an error message, I started the swing by following the steps in one of the videos and gave it the "-x local" option. It turns out that "-x local" seems to be guaranteed to prevent access to s3 (see below).

The solution starts the start without parameters.

Hope this helps.

Gilles

 hadoop@domU-12-31-39-09-24-66 :~$ pig -x local 2013-07-03 00:27:15,321 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1-amzn (rexported) compiled Jun 24 2013, 18:37:44 2013-07-03 00:27:15,321 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1372811235317.log 2013-07-03 00:27:15,379 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found 2013-07-03 00:27:15,793 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt> ls s3://xxxxxx.xx.rawdata 2013-07-03 00:27:23,463 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively). Details at logfile: /home/hadoop/pig_1372811235317.log grunt> quit hadoop@domU-12-31-39-09-24-66 :~$ pig 2013-07-03 00:28:04,769 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1-amzn (rexported) compiled Jun 24 2013, 18:37:44 2013-07-03 00:28:04,771 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1372811284764.log 2013-07-03 00:28:04,873 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found 2013-07-03 00:28:05,639 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.210.43.148:9000 2013-07-03 00:28:08,765 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.210.43.148:9001 grunt> ls s3://xxxxxx.xx.rawdata s3://xxxxxx.xx.rawdata/rawdata<r 1> 19813 s3://xxxxxx.xx.rawdata/rawdata.csv<r 1> 19813 grunt>

+1

G benghiat Jul 03 '13 at 18:40

source share

SNeumann · Accepted Answer · 2013-06-09T06:38:39+0000

Firstly, you should use the S3N protocol (if files are not stored on s3 using the s3 protocol) - s3 is used to store blocks (i.e., it is similar to HDFS, only s3) and S3N for the native s3 file system (i.e. e. you get what you see there).

You can use distcp or simply download pigs from s3n. You can provide access and secret in Hadoop-site.xml, as indicated by the exceptions you received (see here for more information: http://wiki.apache.org/hadoop/AmazonS3 ), or you can add them to uri:

 raw_logs = LOAD 's3n://access: secret@XXXXX /input/access_log_1' USING TextLoader AS (line:chararray);

Make sure your secret does not contain backslashes - otherwise it will not work.

Copying a file from s3: // to the local file system

More articles: