Reading Nutch on EMR with S3

Hi, I am trying to run Apache Nutch 1.2 on Amazon EMR.
To do this, I set the input directory from S3. I get the following error:

  Fetcher: java.lang.IllegalArgumentException:
     This file system object (hdfs: //ip-11-202-55-144.ec2.internal: 9000)
     does not support access to the request path 
     's3n: // crawlResults2 / segments / 20110823155002 / crawl_fetch'
     You possibly called FileSystem.get (conf) when you should have called
     FileSystem.get (uri, conf) to obtain a file system supporting your path.

I understand the difference between FileSystem.get(uri, conf) and FileSystem.get(conf) . If I wrote this myself, I would have FileSystem.get(uri, conf) , but I'm trying to use the existing Nutch code.

I asked this question and someone told me that I need to change hadoop-site.xml to include the following properties: fs.default.name , fs.s3.awsAccessKeyId , fs.s3.awsSecretAccessKey . I updated these properties in core-site.xml ( hadoop-site.xml does not exist), but this did not change the situation. Does anyone have any other ideas? Thanks for the help.

+6
source share
1 answer

try specifying in

Hadoop-site.xml

 <property> <name>fs.default.name</name> <value>org.apache.hadoop.fs.s3.S3FileSystem</value> </property> 

This will indicate to Nutch that S3 should be used by default.

The properties

fs.s3.awsAccessKeyId as well as fs.s3.awsSecretAccessKey

you need it only if your S3 objects are placed under authentication (access to the S3 object is available to all users or only using authentication)

0
source

Source: https://habr.com/ru/post/896173/


All Articles