Access the AWS Shared Public Dataset

Question

Access the AWS Shared Public Dataset

I need to view and download a subset of the common set of common workaround bypass data. This page mentions where the data is located.
How can I view and possibly download common crawl data hosted in s3: // aws-publicdatasets / common-crawl / crawl-002 /?

+6

amazon amazon-s3 amazon-web-services amazon-ec2 common-crawl

gibraltar May 20, '13 at 12:27

source share

3 answers

Access to shared data to crawl General crawl is discussed at: http://blog.commoncrawl.org/2015/05/april-2015-crawl-archive-available/

What I find a useful way to get some trial data is to use the new index above the archive: http://index.commoncrawl.org/CC-MAIN-2015-18

If you request, for example, for "www.cwi.nl", you will find JSON structures about segments containing files from this domain.

{ "urlkey": "nl,cwi)/", "timestamp": "20150505031358", "status": "200", "url": "http://www.cwi.nl/", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1430455222810.45/warc/CC-MAIN-20150501044022-00044-ip-10-235-10-82.ec2.internal.warc.gz", "length": "5881", "mime": "text/html", "offset": "364108412", "digest": "DLQQ4NMJMRRZFGXSXGSFPRO3YJBKVHN5" }

Attach s3 information to it, and you can upload a data file that you can use as a sample data: https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015- 18 / segments / 1430455222810.45 / warc / CC-MAIN-20150501044022-00044-ip-10-235-10-82.ec2.internal.warc.gz

Good luck

+3

Arjen P. De Vries Jun 16 '15 at 9:59

source share

To access Common Crawl data, you need to start work on reducing the map against it, and since the case is on S3, you can do this by starting the Hadoop cluster using the Amazons EC2 service. This involves creating a custom bandage using our custom InputFormat class to extract data from individual ARC files in our S3 bucket.

Source: http://commoncrawl.org/the-data/

Getting started: http://commoncrawl.org/the-data/get-started/

0

David levesque May 20, '13 at 15:33

source share

Smerity · Accepted Answer · 2014-08-13T23:44:29+0000

Like the update, downloading Common Crawl corpus has always been free, and you can use HTTP instead of S3. S3 allows you to use anonymous credentials to access data.

If you want to download via HTTP, get one of the file locations, for example:

general crawl / crawl-data / CC-Main-2014-23 / segments / 1404776400583.60 / WACP / CC-MAIN-20140707234000-00000-IP-10-180-212-248.ec2.internal.warc.gz

and then add https://aws-publicdatasets.s3.amazonaws.com/ , resulting in a link:

https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180- 212-248.ec2.internal.warc.gz

For a list of all such files, refer to warc.paths.gz (or the equivalent for WET or WAT files) in later crawls or list files using anonymous credentials using s3cmd or a similar tool.

This link will work and allow you to download data without going through S3.

Access the AWS Shared Public Dataset

More articles: