Using HDFS will not change the fact that you call hadoop to process a large number of small files. The best option in this case is probably cat files in one (or several large) file (s). This will reduce the number of counters you display, which will reduce the amount of data needed for processing.
Using HDFS can improve performance if you work in a distributed system. If you only run psuedo-distributed (single computer), HDFS will not improve performance. A limitation is a machine.
When you work with a large number of small files, this will require a large number of maps and reducers. Installation / reduction may be comparable with the processing time of the file itself, causing a large overhead. cat The number of skipped mappers runs for the job should be reduced in the files, which should improve performance.
The advantage you could see when using HDFS to store files would be in distributed mode with multiple machines. Files will be stored in blocks (64 MB by default) on different machines, and each machine will be able to process the data block that is on the machine. This reduces network bandwidth, so it does not become a bottleneck in processing.
Archiving files, if hasoop is about to unlock them, will simply cause hasoop to have a large number of small files.
Hope this helps your understanding.
source share