Hadoop 2.0 data write confirmation

I have a small request regarding recording data in usoop format

From the Apache documentation

In general, when the replication rate is three, the policy of placing HDFSs is to put one replica on one node in a local rack, another on a node in another (remote) rack, and the last on another node in the same remote rack. This policy reduces write traffic between racks, which generally improves write performance. The probability of rack failure is much less than a node error;

In the image below, when is recording confirmation considered successful?

1) Writing data to the first node data?

2) Writing data to the first data node + 2 other data nodes?

Hadoop Data Recording

I ask this question because I heard two conflicting statements in the YouTube video. One video is cited that the recording is successful as soon as the data is written to one node information and another video, cited that confirmation will be sent only after the data has been written to all three nodes.

+2
source share
2 answers

enter image description here

Step 1: The client creates the file by calling the create () method on the DistributedFileSystem.

Step 2: DistributedFileSystem makes an RPC denomination call to create a new file in the file system namespace without any blocks associated with it.

Named performs various checks to ensure that the file no longer exists and that the client has permission to create the file. If these checks pass, namenode records a new file; otherwise, file creation fails, and the client receives an IOException. TheDistributedFileSystem returns FSDataOutputStream so that the client begins to write data.

Step 3:. When a client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue called data queues. The data queue is consumed by the DataStreamer, which is responsible for asking to assign new blocks by selecting a list of suitable datanodes for storing replicas. The datanodes list forms a pipeline, and it is also assumed here that the replication level is three, so there are three nodes in the pipeline. TheDataStreamer passes packets to the first datanode in the pipeline, which stores the packet and redirects it to the second datanode in the pipeline.

Step 4: Similarly, the second datanode saves the packet and forwards it to the third (and last) datanode in the pipeline.

Step 5: DFSOutputStream also maintains an internal queue of packets awaiting confirmation by datanodes called the ack queue. A packet is removed from the ack queue only when it has been acknowledged with all the data in the pipeline.

Step 6: When the client has finished writing data, it calls the close () function in the stream.

Step 7: This action flushes all remaining packets into the data pipeline and waits for confirmation before contacting the namenode to indicate that the file is complete. The advertiser already knows which blocks the files are, so you just have to wait until the blocks are minimally replicated before returning successfully.

+5
source

A data write operation is considered successful if one replica is successfully written. It is controlled by the dfs.namenode.replication.min property in the hdfs-default.xml file. If a failure occurred while writing a replica during the creation of file dates, the recorded data is not considered unsuccessful but is not replicated, which creates the missing replicas when balancing the cluster. The Ack package is independent of the state of the data written to datanodes. Even if the data packet is not recorded, a confirmation packet is sent.

+2
source

Source: https://habr.com/ru/post/971035/


All Articles