I would like to add a few missing concepts (ans is confusing for me)
HDFS
The file is stored as blocks (Fault / Node Tolerance). Block size (64MB-128MB) 64MB. Thus, the file is divided into blocks, blocks are stored in different nodes of the cluster. The block is replicated by the replication coefficient (default = 3).
Map reduce
A file that is already saved in HDFS is logically divided into INPUT-SPLITS . Cleavage size can be set by user
Property name Type Default value
mapred.min.split.size int 1 mapred.max.split.sizea long Long.MAX_VALUE.
And then the size of the separation is calculated by the formula:
max (minimumSize, min (maximumSize, blockSize))
NOTE: logical separation
Looking forward to your questions now!
I'd like to know if this partitioning of files in HDFS means the input splitting described in mentioned MapReduce papers.
NO. Not all HDFS blocks and the “Reduce map” markup are the same thing.
Is fault tolerance single reason of this splitting or are there more important reasons?
No. Distributed computing will be the cause.
And what if I have MapReduce over cluster of nodes without distributed file system (data only on local disks with common file sytem)? Do I need to split input files on local disk before map phase?
In your case, I think yes, you will have to split the input file for Map Phase, and you will also have to split the intermediate output (from Mapper) for Reduce Phase. Another problem: data consistency, fault tolerance, data loss (in hasoop = 1%).
Map-Reduce is performed for distributed computing, so using Map-Reduce in an unallocated environment is not practical.
thanks