Can I get distributed reads from an HDSF cluster using the HDFS client on the same machine?
I experimented with a cluster of 3 data nodes (DN1, DN2, DN3). Then I start 10 simultaneous reads from 10 independent files from a client program located on DN1, and it seems to only read data from DN1. Other data nodes (DN2, DN3) showed zero activity (judging by the debug logs).
I checked that blocks of all files are replicated to all 3 datanodes, so if I disable DN1, then the data will be read from DN2 (only DN2).
The increase in the number of read data did not help (tried from 2 to 30 GB).
Since I need to read several large files and extract only a small amount of data (a few kilobytes) from it, I would like to avoid using a card / reduction, as it requires setting up more services, and also requires writing the output of each splitting task to HDFS. Rather, it would be nice for the result to be passed directly to my client program from data nodes.
I use a SequenceFile to read / write data this way (jdk7):
Any help appreciated. Thanks!
source share