I have a basic question regarding writing and reading files in HDFS.
For example, if I write a file using the default configuration, Hadoop internally must write each block to 3 data nodes. I understand that for each block, the client first writes the block to the first node data in the pipeline, which will then inform the second, etc. After the third node data has successfully received the block, it returns a confirmation to the node 2 data and, finally, to the client through Data node 1. Only after receiving confirmation for the block is the record considered successful, and the client continues to the next block .
If so, then not the time taken to write each block is longer than the traditional file write because of -
- replication rate (default is 3) and
- the recording process is sequentially blocked after the block.
Please correct me if I am mistaken in my understanding. In addition, the following questions are below:
- My understanding is that reading / writing files in Hadoop does not have parallelism, and the best thing it can do is the same as for normal reading or writing a file (i.e. if replication is set to 1 ) + some overhead in the distributed communication mechanism.
- Parallelism is provided only at the stage of data processing through Map Reduce, but not during the reading / writing of the file by the client.
source share