Your speed limit factor is the input / output speed of your storage device, so a change between simple line / pattern counting programs will not help, since the difference in execution speed between these programs is likely to be suppressed by the fact that the slow disk / storage / all that you have.
But if you have the same file that is copied to drives / devices, or the file is distributed among these drives, you can perform the operation in parallel. I donโt know specifically about this Hadoop, but assuming you can read a 10gb file from 4 different locations, you can start 4 different line counting processes, each of which is in one part of the file, and summarize their results:
$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l & $ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l & $ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l & $ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l &
Pay attention to & on each command line, so everything will run in parallel; dd here works like cat , but it allows you to specify how many bytes to read ( count * bs bytes) and how many to skip at the beginning of input ( skip * bs bytes). It works in blocks, therefore, you must specify bs as the size of the block. In this example, I split the 10Gb file into 4 equal fragments 4Kb * 655360 = 2684354560 bytes = 2.5 GB, one for each job, you can configure the script to do this for you based on the file size and the number of parallel jobs you are doing. You also need to summarize the execution result, which I did not do for my lack of shell script capabilities.
If your file system is smart enough to split a large file among many devices, such as a RAID or distributed file system or something else, and automatically parallelize I / O requests that can be parallelized, you can do this split by doing many parallel jobs but using the same file path and you can still have some speed increase.
EDIT: Another idea that came up for me is that if the lines inside the file are the same size, you can get the exact number of lines by dividing the file size by the line size, as in bytes. You can do this almost instantly in one job. If you have an average size and do not care exactly about the number of rows, but want to get an estimate, you can do the same operation and get a satisfactory result much faster than the exact operation.
lvella Oct 03 '12 at 22:25 2012-10-03 22:25
source share