Cardless HDFS Distribution / Reduction

Question

Cardless HDFS Distribution / Reduction

Can I get distributed reads from an HDSF cluster using the HDFS client on the same machine?

I experimented with a cluster of 3 data nodes (DN1, DN2, DN3). Then I start 10 simultaneous reads from 10 independent files from a client program located on DN1, and it seems to only read data from DN1. Other data nodes (DN2, DN3) showed zero activity (judging by the debug logs).

I checked that blocks of all files are replicated to all 3 datanodes, so if I disable DN1, then the data will be read from DN2 (only DN2).

The increase in the number of read data did not help (tried from 2 to 30 GB).

Since I need to read several large files and extract only a small amount of data (a few kilobytes) from it, I would like to avoid using a card / reduction, as it requires setting up more services, and also requires writing the output of each splitting task to HDFS. Rather, it would be nice for the result to be passed directly to my client program from data nodes.

I use a SequenceFile to read / write data this way (jdk7):

 //Run in thread pool on multiple files simultaneously List<String> result = new ArrayList<>(); LongWritable key = new LongWritable(); Text value = new Text(); try(SequenceFile.Reader reader = new SequenceFile.Reader(conf, SequenceFile.Reader.file(filePath)){ reader.next(key); if(key.get() == ID_I_AM_LOOKING_FOR){ reader.getCurrentValue(value); result.add(value.toString()); } } return result; //results from multiple workers are merged later

Any help appreciated. Thanks!

+6

hadoop hdfs

rodion Dec 10 '11 at 5:00

source share

3 answers

In addition to what Edward said, note that your current cluster is very small (only 3 nodes), in which case you see files on all nodes. This is because the default replication rate for Hadoop is also 3. In a larger cluster, your files will not be available on each node, and therefore access to multiple files will most likely move to different nodes and spread the load.

If you work with smaller data sets, you can look at HBase, which allows you to work with smaller fragments and distribute the load between nodes (dividing areas)

+3

Arnon Rotem-Gal-Oz Dec 10 '11 at 9:04

source share

I would say that your case sounds good to MR. If we discard a certain MR computing paradigm, we can say that hasoop is built to bring the code into the data instead of the opposite. Moving code to data is essential to getting scalable data processing.
In another manual setup, MapReduce is easier than HDFS because it does not preserve state between jobs.
At the same time - the MR system will take care of parallel processing for you - something that will take time to execute correctly.
Another point - if the data processing results are so small - there will be no significant impact on performance if you combine them together in a gearbox.
In other words - I would suggest reconsidering the use of MapReduce.

0

David gruzman Dec 10 '11 at 20:17

source share

edwardw · Accepted Answer · 2011-12-10T07:47:52+0000

I am afraid that the behavior you see is by design. From the Hadoop Document :

Replica selection
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there is a replica on the same rack as the node reader, then this replica is preferred to satisfy the read request. If angg / HDFS Cluster spans multiple data centers, then a replica that is resident in the local data center is preferable to any remote copy.

This can be further confirmed by the corresponding Hadoop source code :

  LocatedBlocks getBlockLocations(...) { LocatedBlocks blocks = getBlockLocations(src, offset, length, true, true); if (blocks != null) { //sort the blocks DatanodeDescriptor client = host2DataNodeMap.getDatanodeByHost( clientMachine); for (LocatedBlock b : blocks.getLocatedBlocks()) { clusterMap.pseudoSortByDistance(client, b.getLocations()); // Move decommissioned datanodes to the bottom Arrays.sort(b.getLocations(), DFSUtil.DECOM_COMPARATOR); } } return blocks; }

Ie, all available replicas are checked one by one, if the previous one is not running, but the closest is always the first.

On the other hand, if you access HDFS files through the HDFS Proxy , it selects datanodes randomly . But I do not think you want.

Cardless HDFS Distribution / Reduction

More articles: