Difference between hadoop fs -put and hasoop distcp

We are going to make the phase of swallowing data lakes in our project, and I have mainly used hadoop fs -putHadoop developers throughout my experience. So what is the difference with hadoop distcpand difference in use?

+6
source share
4 answers

Distcp is a special tool used to copy data from one cluster to another. Usually you usually copy from one hdfs to hdfs, but not for the local file system. Another important thing is that the process executed as setting mapreduce from 0 reduces the task, which makes it faster due to the distribution of operations. It expands the list of files and directories into input for map tasks, each of which will copy a section of files specified in the list of sources

hdfs put - copies data from the local system to hdf. Uses the hdfs client for this behind the scenes and does all the work in sequence, referring to NameNode and Datanodes. Does not create MapReduce jobs for data processing.

+7
source

hdfs hadoop put HDFS.

distcp HDFS, HDFS.

distcp () HDFS

hadoop distcp $ CURRENT_HDFS_PATH $ BACKUP_HDFS_PATH

0

"distcp Local HDFS, HDFS" → "file" (, "file:///tmp/test.txt") URL (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-Dist/hadoop-common/FileSystemShell.html)

: "hadoop distcp -D dfs.replication = 1", distcp .

0

Distcp is hdfs hdfs . MapReduce 0 .

hadoop -distcp webhdfs://source-ip/directory/filename webhdfs://target-ip/directory/

scp - , .

scp //source-ip/directory/filename //target-ip/directory/

hdfs put - hdfs. MapReduce .

hadoop fs -put -f /path/file /hdfspath/file

hdfs -copies hdfs

, ,

hadoop fs -get /hdfsloc/file
0

Source: https://habr.com/ru/post/1673585/


All Articles