Count lines in large files

I usually work with text files ~ 20 GB in size and very often count the number of lines in a given file.

The way I'm doing it now is just cat fname | wc -l cat fname | wc -l and it takes a lot of time. Is there any solution that will be much faster?

I work in a high-performance cluster with Hadoop installed. I was wondering if the map reduction approach would help.

I would like the solution to be as simple as a single-line, for example, wc -l solution, but not sure how possible it is.

Any ideas?

+64
linux mapreduce
Oct 03
source share
12 answers

Try: sed -n '$=' filename

Also, a cat is not needed: wc -l filename enough in your current form.

+94
03 Oct '12 at 20:45
source share

Your speed limit factor is the input / output speed of your storage device, so a change between simple line / pattern counting programs will not help, since the difference in execution speed between these programs is likely to be suppressed by the fact that the slow disk / storage / all that you have.

But if you have the same file that is copied to drives / devices, or the file is distributed among these drives, you can perform the operation in parallel. I donโ€™t know specifically about this Hadoop, but assuming you can read a 10gb file from 4 different locations, you can start 4 different line counting processes, each of which is in one part of the file, and summarize their results:

 $ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l & $ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l & $ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l & $ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l & 

Pay attention to & on each command line, so everything will run in parallel; dd here works like cat , but it allows you to specify how many bytes to read ( count * bs bytes) and how many to skip at the beginning of input ( skip * bs bytes). It works in blocks, therefore, you must specify bs as the size of the block. In this example, I split the 10Gb file into 4 equal fragments 4Kb * 655360 = 2684354560 bytes = 2.5 GB, one for each job, you can configure the script to do this for you based on the file size and the number of parallel jobs you are doing. You also need to summarize the execution result, which I did not do for my lack of shell script capabilities.

If your file system is smart enough to split a large file among many devices, such as a RAID or distributed file system or something else, and automatically parallelize I / O requests that can be parallelized, you can do this split by doing many parallel jobs but using the same file path and you can still have some speed increase.

EDIT: Another idea that came up for me is that if the lines inside the file are the same size, you can get the exact number of lines by dividing the file size by the line size, as in bytes. You can do this almost instantly in one job. If you have an average size and do not care exactly about the number of rows, but want to get an estimate, you can do the same operation and get a satisfactory result much faster than the exact operation.

+12
Oct 03 '12 at 22:25
source share

On a multi-core server, use a parallel parallel parallel parallel parallel parallel parallel parallel parallel line line. After the number of lines in a line is printed, bc sums all line counts.

 find . -name '*.txt' | parallel 'wc -l {}' 2>/dev/null | paste -sd+ - | bc 

To save space, you can even compress all files. The next line unpacks each file and counts its lines in parallel, and then sums all the values.

 find . -name '*.xz' | parallel 'xzcat {} | wc -l' 2>/dev/null | paste -sd+ - | bc 
+8
Feb 25 '16 at 17:56
source share

If your data is stored on HDFS, perhaps the fastest approach is to use streams of haops. Apache Pig COUNT UDF runs on a bag and therefore uses a single reducer to calculate the number of rows. Instead, you can manually set the number of reducers in a simple chaos flow script as follows:

 $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.reduce.tasks=100 -input <input_path> -output <output_path> -mapper /bin/cat -reducer "wc -l" 

Please note that I manually set the number of gears to 100, but you can configure this parameter. As soon as the work to reduce the map is performed, the result of each reducer is saved in a separate file. The final number of rows is the sum of the numbers returned by all reducers. you can get the final number of rows as follows:

 $HADOOP_HOME/bin/hadoop fs -cat <output_path>/* | paste -sd+ | bc 
+6
Aug 27 '14 at
source share

According to my test, I can verify that Spark-Shell (based on Scala) is faster than other tools (GREP, SED, AWK, PERL, WC). Here is the result of the test that I ran in a file that had 23782409 lines

 time grep -c $ my_file.txt; 

real 0m44.96s user 0m41.59s sys 0m3.09s

 time wc -l my_file.txt; 

real 0m37.57s user 0m33.48s sys 0m3.97s

 time sed -n '$=' my_file.txt; 

real 0m38.22s user 0m28.05s sys 0m10.14s

time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt ;

real 0m23.38s user 0m20.19s sys 0m3.11s

 time awk 'END { print NR }' my_file.txt; 

real 0m19.90s user 0m16.76s sys 0m3.12s

 spark-shell import org.joda.time._ val t_start = DateTime.now() sc.textFile("file://my_file.txt").count() val t_end = DateTime.now() new Period(t_start, t_end).toStandardSeconds() 

res1: org.joda.time.Seconds = PT15S

+6
03 Oct '16 at 13:47
source share

Hadoop essentially provides a mechanism for doing something similar to what @Ivella offers.

Hadoop HDFS (Distributed File System) is about to take your 20-gigabyte file and save it across the cluster in fixed-size blocks. Suppose you configured a block size of 128 MB, the file will be divided into blocks of 20x8x128 MB in size.

Then you started the map reduction program from this data, essentially counting the lines for each block (at the map stage), and then reducing the number of lines in the block to the final number of lines for the entire file.

In terms of performance, in general, the larger your cluster, the better the performance (more wc runs in parallel, more independent disks), but there is some overhead when working in orchestration, which means that performing tasks on small files will not actually provide more faster bandwidth than starting local wc

+3
04 Oct '12 at 1:20
source share

I know the question has been around for several years, but expanding on the latest Ivella idea , this bash script estimates the number of lines of a large file in seconds or less by measuring the size of one line and extrapolating it:

 #!/bin/bash head -2 $1 | tail -1 > $1_oneline filesize=$(du -b $1 | cut -f -1) linesize=$(du -b $1_oneline | cut -f -1) rm $1_oneline echo $(expr $filesize / $linesize) 

If you lines.sh this lines.sh script, you can call lines.sh bigfile.txt to get an estimated number of lines. In my case (about 6 GB, export from the database) the deviation from the true number of rows was only 3%, but it went about 1000 times faster. By the way, I used the second, and not the first row as the basis, because the first row had column names, and the actual data started on the second row.

+3
May 6 '17 at 14:01
source share

I'm not sure python is faster:

 [root@myserver scripts]# time python -c "print len(open('mybigfile.txt').read().split('\n'))" 644306 real 0m0.310s user 0m0.176s sys 0m0.132s [root@myserver scripts]# time cat mybigfile.txt | wc -l 644305 real 0m0.048s user 0m0.017s sys 0m0.074s 
+2
May 5 '15 at 8:06
source share

If your bottleneck is a disk, the way you read it is important. dd if=filename bs=128M | wc -l dd if=filename bs=128M | wc -l much faster than wc -l filename or cat filename | wc -l cat filename | wc -l for my machine with a hard drive and a fast processor and RAM. You can play with block size and see dd reporting as bandwidth. I rolled it up to 1GiB.

Note. There is some discussion about whether cat or dd faster. All I claim is that dd can be faster, depending on the system, and that is for me. Try it yourself.

+2
May 6 '17 at 1:41 AM
source share

If your computer has python, you can try this from the shell:

 python -c "print len(open('test.txt').read().split('\n'))" 

This uses python -c to pass a command that basically reads the file, and split the "new line" to get the number of lines of a new line or the total length of the file.

@BlueMoon :

 bash-3.2$ sed -n '$=' test.txt 519 

Using the foregoing:

 bash-3.2$ python -c "print len(open('test.txt').read().split('\n'))" 519 
+1
Jul 11 '14 at 2:16
source share
 find -type f -name "filepattern_2015_07_*.txt" -exec ls -1 {} \; | cat | awk '//{ print $0 , system("cat " $0 "|" "wc -l")}' 

Exit:

+1
Jul 30 '15 at 13:19
source share

Assume

  • Your file system is distributed
  • Your file system can easily fill a network connection on one node
  • You access your files like regular files

then you really want to cut the files into parts, count the parts in parallel on several nodes and summarize the results from there (this is basically @Chris White's idea).

This is how you do it with GNU Parallel (version> 20161222). You need to list the nodes in ~/.parallel/my_cluster_hosts and you must have ssh access to all of them:

 parwc() { # Usage: # parwc -l file # Give one chunck per host chunks=$(cat ~/.parallel/my_cluster_hosts|wc -l) # Build commands that take a chunk each and do 'wc' on that # ("map") parallel -j $chunks --block -1 --pipepart -a "$2" -vv --dryrun wc "$1" | # For each command # log into a cluster host # cd to current working dir # execute the command parallel -j0 --slf my_cluster_hosts --wd . | # Sum up the number of lines # ("reduce") perl -ne '$sum += $_; END { print $sum,"\n" }' } 

Use as:

 parwc -l myfile parwc -w myfile parwc -c myfile 
0
May 19 '18 at 8:19
source share



All Articles