
Step 1: The client creates the file by calling the create () method on the DistributedFileSystem.
Step 2: DistributedFileSystem makes an RPC denomination call to create a new file in the file system namespace without any blocks associated with it.
Named performs various checks to ensure that the file no longer exists and that the client has permission to create the file. If these checks pass, namenode records a new file; otherwise, file creation fails, and the client receives an IOException. TheDistributedFileSystem returns FSDataOutputStream so that the client begins to write data.
Step 3:. When a client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue called data queues. The data queue is consumed by the DataStreamer, which is responsible for asking to assign new blocks by selecting a list of suitable datanodes for storing replicas. The datanodes list forms a pipeline, and it is also assumed here that the replication level is three, so there are three nodes in the pipeline. TheDataStreamer passes packets to the first datanode in the pipeline, which stores the packet and redirects it to the second datanode in the pipeline.
Step 4: Similarly, the second datanode saves the packet and forwards it to the third (and last) datanode in the pipeline.
Step 5: DFSOutputStream also maintains an internal queue of packets awaiting confirmation by datanodes called the ack queue. A packet is removed from the ack queue only when it has been acknowledged with all the data in the pipeline.
Step 6: When the client has finished writing data, it calls the close () function in the stream.
Step 7: This action flushes all remaining packets into the data pipeline and waits for confirmation before contacting the namenode to indicate that the file is complete. The advertiser already knows which blocks the files are, so you just have to wait until the blocks are minimally replicated before returning successfully.
source share