How to write a mapreduce streaming job for warc files in python

I am trying to write a mapreduce task for warc files using the WARC library for python. The following code works for me, but I need this code to work with the dooop map.

import warc
f = warc.open("test.warc.gz")
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

I want this code to read streaming input from warc files, i.e.

zcat test.warc.gz | warc_reader.py

Please tell me how I can change this code for streaming inputs. thanks

+4
source share
1 answer

warc.open() warc.WARCFile(), warc.WARCFile() fileobj, sys.stdin . - :

import sys
import warc

f = warc.open(fileobj=sys.stdin)
for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

hadoop , - .gz, hadoop \r\n WARC \n, WARC (. : hadoop \r\n \n ARC). warc "WARC/(\d+.\d+)\r\n" ( \r\n), , , :

IOError: Bad version line: 'WARC/1.0\n'

, PipeMapper.java, , , WARC.

, warc.py \n \r\n , Content-Length . , hadoop, Content-Length, , :

IOError: Expected '\n', found 'abc\n'
+1

Source: https://habr.com/ru/post/1523270/


All Articles