Why does a datanode send block location information to namenode?

Question

Why does a datanode send block location information to namenode?

At https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html there are words:

DataNodes are configured with the location of both names, and also send information about the location of the block and the heartbeat to both.

But why is this information sent to the name-date and his retired brother? I thought this information was already in the fenmode image. The namod must know where he puts the blocks.

+5

hadoop hdfs

serg Dec 11 '15 at 16:46

source share

2 answers

Datanodes are not directly accessible from outside the cluster, but on a private network. The Hadoop cluster is prone to node errors, and NameNode keeps track of all the data on different DataNodes. Thus, any request to the cluster is addressed by NN, and it provides the block address in DN.

+1

karnation Dec 11 '15 at 17:20

source share

Manjunath ballur · Accepted Answer · 2015-12-11T19:20:43+0000

The name Node contains metadata for the entire cluster. It contains information about each folder, file, replication rate, block names, etc. The Node name also stores block location information for each file (this information is built from block reports sent by data nodes) in memory.

Data nodes store the following information for each block:

Actual data stored in the block
Metadata for data stored in the block. Mainly contains checksums for the data stored in the block.

They periodically send a heart rate and block reports called Node.

Heart beat :

The heart rate reporting interval is determined by the dfs.heartbeat.interval configuration dfs.heartbeat.interval (in hdfs-site.xml). The default value is 3 seconds.
Some information contained in Heart Heart:
- Registration : Node Data Registration Information
- Capacity : The total storage capacity is available in the Data Node.
- dfsUsed : storage used by HDFS
- : remaining storage for HDFS
- blockPoolUsed : storage used by the block pool
- xmitsInProgress : the number of transfers from this Node data to others
- xceiverCount : number of active transceiver streams
- xmitsInProgress : the number of transfers from this Node data to others
- cacheCapacity : total cache capacity available in Data Node
- cacheUsed : number of caches used
This information is used by the Node name in the following ways:
- Health Data Node . Should this Node data be marked as dead or alive?
- Registration of new Node data . If this is recently added Node data, its information is registered
- Update Node Data Metrics . Heartbeat information used to update Node metrics
- Issue Data Node commands . The Node name can produce the following Data Node data based on information obtained during a heartbeat: BlockRecoveryCommand (to restore certain blocks), BlockCommand (for transferring blocks to another Data Node, for the invalidity of certain blocks), Cache/Uncache (commands for caching / blocking blocks)

Block reports:

The block reporting interval is determined by the configuration dfs.blockreport.intervalMsec (in hdfs-site.xml). The default value is 21600000 milliseconds.
Some information contained in the block report:
- Registration : Node Data Registration Information
- blocks : information about blocks, which contains: block identifier, block length, timestamp of block generation, state of the block replica (for example, the replica is completed or restoration is expected, etc.)
This information is used by the name Node to:
- Report on the first block of the process . If this is the first report for recently logged Node data, it simply adds all valid replicas. It ignores all invalid blocks until the next block report.
- To update information about blocks : the map (Data Node → Blocks) is updated in the Node name. The new block report is compared with the old report and information on successful blocks, damaged blocks, invalid blocks, etc. is updated.

Why does a datanode send block location information to namenode?

More articles: