Processing a large set of small files with Hadoop

I use the Hadoop WordCount sample program to process a large set of small files / web pages (approximately 2-3 kB). Since this is far from the optimal file size for hadoop files, the program runs very slowly. This is probably because the cost of installing and breaking work is much more than the work itself. Such small files also deplete namespaces for file names.

I read that in this case I should use the HDFS archive (HAR), but I'm not sure how to change this WordCount program to read from these archives. Can the program continue to work without changes or is some modification necessary?

Even if I collect many files in archives, the question remains if this will improve performance. I read that even if I pack several files, these files inside one archive will not be processed by one cartographer, but many that in my case (I think) will not improve performance.

If this question is too simple, please understand that I am new to Hadoop and have very little experience with it.

+6
source share
5 answers

Using HDFS will not change the fact that you call hadoop to process a large number of small files. The best option in this case is probably cat files in one (or several large) file (s). This will reduce the number of counters you display, which will reduce the amount of data needed for processing.

Using HDFS can improve performance if you work in a distributed system. If you only run psuedo-distributed (single computer), HDFS will not improve performance. A limitation is a machine.

When you work with a large number of small files, this will require a large number of maps and reducers. Installation / reduction may be comparable with the processing time of the file itself, causing a large overhead. cat The number of skipped mappers runs for the job should be reduced in the files, which should improve performance.

The advantage you could see when using HDFS to store files would be in distributed mode with multiple machines. Files will be stored in blocks (64 MB by default) on different machines, and each machine will be able to process the data block that is on the machine. This reduces network bandwidth, so it does not become a bottleneck in processing.

Archiving files, if hasoop is about to unlock them, will simply cause hasoop to have a large number of small files.

Hope this helps your understanding.

+4
source

From my still limited understanding of o Hadoop, I think the right solution would be to create a SequenceFile (s) containing your HTML files as values ​​and, possibly, a URL as a key. If you are doing an M / R job by SequenceFile (s), each handler processes a lot of files (depending on the size of the partition). Each file will be presented with a map function as one input. You can use SequenceFileAsTextInputFormat as InputFormat to read these files.

See also: Providing multiple non-text files for one map in Hadoop MapReduce

+3
source

I recently bookmarked this article to read it later, and found the same question here :) The post is a bit outdated, it’s not entirely accurate how relevant it is. Hadoop is changing at a very fast pace.

http://www.cloudera.com/blog/2009/02/the-small-files-problem/

There is Tom White on the blog, who is also the author of the Hadoop: The Definitive Guide, Second Edition, recommended for those starting out with Hadoop.

http://oreilly.com/catalog/0636920010388

+2
source

Can you merge files before sending them to Hadoop?

+1
source

CombineFileInputFormat can be used in this case, which works well for a large number of small files. This packs many of these files into one split, so each handler has more options for processing (1 split = 1 map task). The total processing time for mapreduce will also decrease, since there are fewer running mappers. Since they do not contain input formats that support the archive, using CombineFileInputFormat will improve performance.

0
source

Source: https://habr.com/ru/post/887581/


All Articles