What HDFS operations are atomic?

I am trying to write code to import files into HDFS for use as an external hive table. I found that using something like:

foo | ssh hostname "hdfs dfs -put - / destination / $ FILENAME"

can cause an error type when a temporary file is created, and then after it is renamed. This can lead to a race condition for the hive between the directory listing and query execution.

One way is to copy the β€œhdfs dfs mv” file to the desired location in the temporary directory.

Specific and general / academic issues:

  • The hdfs dfs -mv command is atomic, right?
  • What other HDFS commands or operations are atomic?
  • Can the two "hdfs dfs -mkdir" commands issued at about the same time believe that both of them succeeded?
  • Is there a better way to avoid race conditions with a hive when moving files in place?
+4
source share
1 answer

In the Introduction to Hadoop FS, you can find atomic requirements.

The following are the main expectations of a Hadoop compatible file system. Some FileSystems do not meet all these expectations; as a result, some programs may not work as expected.

Atomicity

There are some operations that MUST be atomic. This is due to the fact that they are often used to implement blocking / exclusive access between processes in a cluster.

  • File creation. If the rewrite option is false, validation and creation MUST be atomic.
  • Delete a file.
  • Rename the file.
  • Renaming a directory.
  • Creating a single directory using mkdir ().

...

Most other operations do not have atomic requirements or warranties.

So, you should check the underlying file system. But based on these requirements, the answers are:

  • Yes
  • above
  • not
  • imho file renaming is a good choice to work
+7
source

Source: https://habr.com/ru/post/1500242/


All Articles