How to write a mapreduce streaming job for warc files in python

Question

How to write a mapreduce streaming job for warc files in python

I am trying to write a mapreduce task for warc files using the WARC library for python. The following code works for me, but I need this code to work with the dooop map.

import warc
f = warc.open("test.warc.gz")
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

I want this code to read streaming input from warc files, i.e.

zcat test.warc.gz | warc_reader.py

Please tell me how I can change this code for streaming inputs. thanks

+4

python mapreduce hadoop hadoop-streaming warc

zahid adeel Jan 23 '14 at 6:53

source share

1 answer

CKLu · Answer 1 · 2019-09-05T06:53:11+0000

warc.open() warc.WARCFile(), warc.WARCFile() fileobj, sys.stdin . - :

import sys
import warc

f = warc.open(fileobj=sys.stdin)
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

hadoop , - .gz, hadoop \r\n WARC \n, WARC (. : hadoop \r\n \n ARC). warc "WARC/(\d+.\d+)\r\n" ( \r\n), , , :

IOError: Bad version line: 'WARC/1.0\n'

, PipeMapper.java, , , WARC.

, warc.py \n \r\n , Content-Length . , hadoop, Content-Length, , :

IOError: Expected '\n', found 'abc\n'

How to write a mapreduce streaming job for warc files in python

More articles: