Restore Hadoop NameNode from Metadata Backup

I am trying to recover NN metadata. I made a backup of the Namenode and Journal node metadata. It contains editing logs and fsimages.

There are two NNs in my system. I am backing up metadata on both NNs (hdfs metadata and QJM metadata) at a regular frequency. I want to check the recovery procedure in the worst case. Suppose both NNs and Journal node work with completely deleted metadata.

I want to restore NN metadata from backup and start NN. I know that there may be data loss, since the latest changes made after the backup will be absent.

Questions

  • Do you think such a scenario is possible / possible?
  • I ran into some issues related to txn id mismatch fixed by txn id. Tell me if there is a solution for this.

Steps:

  • Take a backup of the NN and QJM metadata. Do some operations with hdfs files (create new files).
  • Stop NN and node log on both machines.
  • Remove metadata from / data / hdfs and log directories.
  • Restore Fsimages from a backup (taken some time ago).
  • Start NN. It crashes with a few exceptions.

An alternative approach. Restore all the editing and fsimage logs in both hdf files and qjm directories and run NN, but it still won't work.

Both NN are omitted, and I cannot educate. I do not want to format hdfs as it will change the cluster id and the backup will not be used.

Exceptions

  • There seems to be a space in the editing log. We expected txid 71453, but got txid 71466
  • The client is trying to move the committed txid back from 71599 to 71453
  • recoverUnfinalizedSegments failed to create the required log. I decided to synchronize the log with startTxId: 71453, but the registrar 10.204.64.26:8485 saw that txid 71599 committed
+6
source share
3 answers

You can run namenode with the recovery flag enabled. Namenode recovery will take care of corrupt maestadates.

./bin/hadoop namenode -recover 
+1
source
  • Since the last FsImage and Edit were lost or damaged, you should try to restore the metadata

    ./bin/hadoop namenode -recover

    Refer: Node Name Recovery Tools for the Hadoop Distributed File System

  • Since the log does not synchronize with namenode, you must recreate it.

    ./bin/hdfs namenode -initializeSharedEdits

  • Since the recovered FsImage lost the last data updated after the last backup, you should check and delete the corrupted data

    ./bin/hadoop fsck -delete /

    If you don't do fsck, namenode may be stuck in safe mode, for too many blocks without an answer.

+1
source

Launch all JournalNode. Make sure you copy the fsimage, fsimage.md5 and VERSION file. Then run hdfs namenode -initializeSharedEdits -force, it will format only JournalNode. Then run NameNode (1). It should work. Let me know if this doesn't work.

0
source

Source: https://habr.com/ru/post/970566/


All Articles