How to read compressed bz2 (bzip2) Wikipedia is uploaded to a streaming xml reader for hadoop map

I am working on using Hadoop Map Reduce to do research on wikipedia data dumps (compressed in bz2 format). Since these dumps are so large (5 T), I cannot unzip the XML data in HDFS and just use the StreamXmlRecordReader that hasoop provides. Hadoop supports bz2 file compression, but it randomly breaks pages and sends them to the display device. Since this is xml, we need splits to be tags. Is there anyway to use the built-in bz2 decompression and the stream xml reader provided by the adoop command?

+6
source share
2 answers

The Wikimedia Foundation has just released an InputReader for the Hadoop Streaming interface, which can read compressed bz2 dump files and send them to your cartographers. The device sent to the cartup is not a whole page, but two revisions (so you can actually run diff for two versions). This is the initial release, and I'm sure there will be some bugs, but please give it a spin and help us check it out.

This InputReader requires Hadoop 0.21, because Hadoop 0.21 supports streaming bz2 files. Source code is available at: https://github.com/whym/wikihadoop

+7
source

Your problem is the same as described here . Therefore, my answer is also. You must create your own variant in TextInputFormat. There you create a new RecordReader that skips lines until it sees the beginning of a logical line.

0
source

Source: https://habr.com/ru/post/892949/


All Articles