Serve static files from Hadoop

My task is to create a distributed system for static images / video files. The data size is about tens of terabytes. This is mainly for HTTP access (thus, there is no data processing or just simple processing, for example, resizing, but this is not important, since this can be done directly in the application).

To be more clear, this is a system that:

  • Must be distributed (horizontally) since the total data size is very large.
  • Mostly for small static files (e.g. images, thumbnails, short videos), HTTP is used.
  • As a rule, there is no need for data processing (no MapReduce is required)
  • Setting up HTTP data access can be easily done.
  • (Must have) good bandwidth.

I am considering:

  • Own network file system: But this seems impossible, because the data cannot fit into one machine.

  • Hadoop file system. I used to work with Hadoop mapreduce, but I have no experience using Hadoop as a static file repository for HTTP requests. Therefore, I do not know if this is possible or if it is recommended.

  • MogileFS. This seems promising, but I believe that using MySQL to manage local files (on the same machine) will create too much overhead.

Any suggestion please?

+4
source share
4 answers

I am the author of Weed-FS. WeedFS is ideal for your requirement. Hadoop cannot process many small files, in addition to your reasons, each file must have an entry in the main. If the number of files is large, the hdfs node wizard cannot scale.

Weed-FS speeds up when compiling with the latest releases of Golang.

Weed-FS has recently made many new improvements. Now you can easily test and compare with the built-in download tool. This file loads all files recursively into a directory.

weed upload -dir=/some/directory 

Now you can compare "du -k / some / directory" to see disk usage, and "ls -l / your / weed / volume / directory" to see disk usage Weed-FS.

And I suppose you will need replication with a data center, rack support, etc. Now they are in mode

+7
source

Hadoop is optimized for large files, for example. The default block size is 64M. Many small files are wasteful and difficult to manage on Hadoop.

You can take a look at other distributed file systems, for example. GlusterFS

+3
source

Hadoop has an API for accessing files. See this entry in the documentation. I believe that Hadoop is not designed to store a large number of small files.

  • HDFS is not designed for efficient access to small files: it is primarily intended for streaming access to large files. Reading small files usually results in many attempts and many hops from the datanode to the datanode to retrieve each small file, all of which are inefficient data access patterns.
  • Each file, directory and block in HDFS is represented as an object in the namenodes memory, each of which occupies 150 bytes. The block size is 64 MB. Thus, even if the file is 10 KB in size, the entire block of 64 MB will be allocated to it. This is a space for waste.
  • If the file is very small, and there are many, then each map task handles very little input, and there are many more map tasks, each of which imposes additional overhead. Compare a 1 GB file, divided into 16 files of 64 MB blocks and 10,000 or about 100 KB files. 10,000 files use one card each, and the runtime can be tens or hundreds of times slower than the equivalent, with one input file.

In β€œHadoop Summit 2011” there was this talk from Karthik Ranganathan about Facebook Messaging in which he gave this bit: Facebook stores data (profiles, messages, etc.) via HDFS, but they don’t use the same infrastructure for images and videos. They have their own system called Haystack for images. This is not open source, but they shared the details of the abstract level of design.

This leads me to weed-fs : an open source project inspired by Haystacks design. His tailor is designed to store files. I have not used it yet, but it seems worth taking a picture.

+2
source

If you can upload files and do not need to update the package after adding to HDFS, you can compile several small files into one larger binary sequence file. This is a more efficient way to store small files in HDFS (as Arnon points out, HDFS is for large files and becomes very inefficient when working with small files).

This is the approach I used when using Hadoop to process CT images (details in Image Processing in Hadoop ). Here 225 fragments of CT scans (each individual image) were compiled into one, much larger binary sequence file for continuous streaming, which is read into Hadoop for processing.

Hope this helps!

WITH

0
source

Source: https://habr.com/ru/post/1484011/


All Articles