Disable replication for Hadoop job output only.

Is there a way to set the replication rate for outputting a specific MapReduce job, unlike the rest of the cluster (say 1)? I would like my main dataset to be 3 replicas (as it is now), but the output of some of my jobs quickly leaves the cluster and is ultimately discarded, so no replication is required, and I could use the space.

I could use setrep , but I think I can only do this after that.

+4
source share
1 answer

When you upload a file, you can override the default DFS replication ratio by passing

-D dfs.replication=1 

This should work when passed when a job is called.

+6
source

Source: https://habr.com/ru/post/1380269/


All Articles