What are the main reasons for splitting input in MapReduce?

In Cartographic documents , it is described that input files are divided into M-partitions. I know that HDFS in Hadoop automatically does 64 MB blocking (by default) and then replicates these blocks to several other nodes in the cluster for fault tolerance. I would like to know if this file splitting in HDFS is related to the input splitting described in the MapReduce documents mentioned. Is failure the only reason for this splitting, or are there more important reasons?

But what if I have MapReduce on a cluster of nodes without a distributed file system (data only on local disks with a common file system)? Do I need to split the input files on the local disk before the map phase?

Thank you for your responses.

+4
source share
2 answers

I would like to add a few missing concepts (ans is confusing for me)



HDFS

The file is stored as blocks (Fault / Node Tolerance). Block size (64MB-128MB) 64MB. Thus, the file is divided into blocks, blocks are stored in different nodes of the cluster. The block is replicated by the replication coefficient (default = 3).

Map reduce

A file that is already saved in HDFS is logically divided into INPUT-SPLITS . Cleavage size can be set by user

Property name Type Default value 

 mapred.min.split.size int 1 mapred.max.split.sizea long Long.MAX_VALUE. 

And then the size of the separation is calculated by the formula:

max (minimumSize, min (maximumSize, blockSize))

NOTE: logical separation



Looking forward to your questions now!

  I'd like to know if this partitioning of files in HDFS means the input splitting described in mentioned MapReduce papers. 

NO. Not all HDFS blocks and the “Reduce map” markup are the same thing.

 Is fault tolerance single reason of this splitting or are there more important reasons? 

No. Distributed computing will be the cause.

 And what if I have MapReduce over cluster of nodes without distributed file system (data only on local disks with common file sytem)? Do I need to split input files on local disk before map phase? 

In your case, I think yes, you will have to split the input file for Map Phase, and you will also have to split the intermediate output (from Mapper) for Reduce Phase. Another problem: data consistency, fault tolerance, data loss (in hasoop = 1%).

Map-Reduce is performed for distributed computing, so using Map-Reduce in an unallocated environment is not practical.

thanks

+3
source
 I'd like to know if this partitioning of files in HDFS means the input splitting described in mentioned MapReduce papers. 

No, the input splitting in MapReduce is to use the processing power of several processors during the reduction phase. Mapper takes a large amount of data and splits the data into logical partitions (most of the time, as indicated by the programmer in the user implementation of mapper). This data then goes to individual nodes, where independent processes, called reducers, crunch the data, and then the result ends at the end.

 Is fault tolerance single reason of this splitting or are there more important reasons? 

No, this is not the only reason for this. You can compare it with the file system level block size to ensure data is transferred to pieces, data compression based on each block, and distribution of I / O buffers.

+1
source

Source: https://habr.com/ru/post/1439826/


All Articles