Count the number of lines in a file without reading the entire file in memory?

I process huge data files (millions of lines each).

Before starting the processing, I would like to get the number of lines in the file, so I can indicate how far the processing has gone.

Due to the size of the files, it would be impractical to read the entire file in memory, just to count the number of lines. Does anyone have a good suggestion on how to do this?

+41
ruby
Apr 16 '10 at 4:02
source share
13 answers

If you are in a Unix environment, you can simply let wc -l do the work.

It will not load the entire file into memory; since it is optimized for file streaming and word / line counting, the performance is pretty good and then file streaming in Ruby.

SSCCE:

 filename = 'a_file/somewhere.txt' line_count = `wc -l "#{filename}"`.strip.split(' ')[0].to_i p line_count 

Or, if you need a collection of files passed on the command line:

 wc_output = `wc -l "#{ARGV.join('" "')}"` line_count = wc_output.match(/^ *([0-9]+) +total$/).captures[0].to_i p line_count 
+57
Apr 16 '10 at 4:53 on
source share

Reading a line file at a time:

 count = File.foreach(filename).inject(0) {|c, line| c+1} 

or perl-ish

 File.foreach(filename) {} count = $. 

or

 count = 0 File.open(filename) {|f| count = f.read.count("\n")} 

Will be slower than

 count = %x{wc -l #{filename}}.split.first.to_i 
+66
Apr 16 '10 at
source share

No matter which language you use, you will need to read the entire file if the lines are of variable length. This is because newlines can be anywhere, and theres can't know without reading the file (assuming that it is not cached, which, generally speaking, is wrong).

If you want to indicate progress, you have two realistic options. You can extrapolate progress based on the estimated string length:

 assumed lines in file = size of file / assumed line size progress = lines processed / assumed lines in file * 100% 

since you know the file size. Alternatively, you can measure progress as:

 progress = bytes processed / size of file * 100% 

That should be enough.

+13
Apr 16 '10 at 4:06
source share

using ruby:

 file=File.open("path-to-file","r") file.readlines.size 

39 milliseconds faster than wc -l in 325.477 line file.

+8
Mar 26 '14 at 10:39
source share

Summary of Published Solutions

 require 'benchmark' require 'csv' filename = "name.csv" Benchmark.bm do |x| x.report { `wc -l < #{filename}`.to_i } x.report { File.open(filename).inject(0) { |c, line| c + 1 } } x.report { File.foreach(filename).inject(0) {|c, line| c+1} } x.report { File.read(filename).scan(/\n/).count } x.report { CSV.open(filename, "r").readlines.count } end 

File with lines 807802:

  user system total real 0.000000 0.000000 0.010000 ( 0.030606) 0.370000 0.050000 0.420000 ( 0.412472) 0.360000 0.010000 0.370000 ( 0.374642) 0.290000 0.020000 0.310000 ( 0.315488) 3.190000 0.060000 3.250000 ( 3.245171) 
+3
Jul 14 '17 at 12:00
source share

For reasons that I don't quite understand, scanning a file for lines using File seems a lot faster than doing CSV#readlines.count .

The following sample used a CSV file with 1,045,574 rows of data and 4 columns:

  user system total real 0.639000 0.047000 0.686000 ( 0.682000) 17.067000 0.171000 17.238000 ( 17.221173) 

Code for reference below:

 require 'benchmark' require 'csv' file = "1-25-2013 DATA.csv" Benchmark.bm do |x| x.report { File.read(file).scan(/\n/).count } x.report { CSV.open(file, "r").readlines.count } end 

As you can see, scanning a file for newlines is an order of magnitude higher.

+2
Mar 07 '13 at 17:31
source share

Same as DJ, but giving real Ruby code:

 count = %x{wc -l file_path}.split[0].to_i 

First part

 wc -l file_path 

Gives you

 num_lines file_path 

split and to_i put this number in a number.

+2
Aug 11 '13 at 7:42 on
source share

I have this one liner.

 puts File.foreach('myfile.txt').count 
+2
Jan 17 '15 at 11:47
source share

wc -l in Ruby with less memory, lazy way:

 (ARGV.length == 0 ? [["", STDIN]] : ARGV.lazy.map { |file_name| [file_name, File.open(file_name)] }) .map { |file_name, file| "%8d %s\n" % [*file .each_line .lazy .map { |line| 1 } .reduce(:+), file_name] } .each(&:display) 

as originally shown by Shugo Maeda .

Example:

 $ curl -s -o wc.rb -L https://git.io/vVrQi $ chmod u+x wc.rb $ ./wc.rb huge_data_file.csv 43217291 huge_data_file.csv 
+2
Apr 6 '16 at 20:11
source share

If the file is a CSV file, the length of the records should be fairly uniform if the contents of the file are numeric. Doesn't it make sense to simply divide the file size by the record length or the average of the first 100 records.

+1
Mar 30 '13 at 13:14
source share

Testing results for lines over 135 thousand are shown below. This is my test code.

  file_name = '100m.csv' Benchmark.bm do |x| x.report { File.new(file_name).readlines.size } x.report { `wc -l "#{file_name}"`.strip.split(' ')[0].to_i } x.report { File.read(file_name).scan(/\n/).count } end 

result

  user system total real 0.100000 0.040000 0.140000 ( 0.143636) 0.000000 0.000000 0.090000 ( 0.093293) 0.380000 0.060000 0.440000 ( 0.464925) 

The wc -l code has one problem. If there is only one line in the file, and the last character does not end with \n , then the counter is zero.

So, I recommend calling wc when you are counting more than one line.

+1
Jan 11 '17 at 8:34 on
source share

With UNIX-style text files, it's very simple

 f = File.new("/path/to/whatever") num_newlines = 0 while (c = f.getc) != nil num_newlines += 1 if c == "\n" end 

What is it. For MS Windows text files, you will need to check the sequence "\ r \ n" instead of "\ n", but this is not much more difficult. For Mac OS Classic text files (unlike Mac OS X), you should check "\ r" instead of "\ n".

So yes, that sounds like C. So what? C is awesome, and Ruby is awesome, because when the C answer is easiest, you can expect your Ruby code to look. Hope yours doesn't have Java already processed.

By the way, please do not even consider any of the answers above that use the IO#read or IO#readlines method, which in turn calls the String method on what has been read. You said you didn’t want to read the entire file in memory and what exactly they do. That's why Donald Knuth recommends that people understand how to program closer to hardware, because if they don’t, they’re “weird code.” Obviously, you do not want to code when you do not need it, but this should be common sense. However, you must learn to recognize cases that you have in order to get close to nuts and bolts such as this one.

And don't try to get more "object oriented" than the situation calls for. This is an awkward trap for beginners who want to look more sophisticated than they really are. You should always be happy when the answer is really simple and not disappointed when there is no difficulty to give you the opportunity to write “impressive” code. However, if you want to look somewhat “object oriented” and don't mind reading the entire line in (i.e. you know that the lines are short enough), you can do this

 f = File.new("/path/to/whatever") num_newlines = 0 f.each_line do num_newlines += 1 end 

This would be a good compromise, but only if the lines are not too long, in which case it can work even faster than my first Solution.

0
Aug 29 '13 at 21:08
source share

Using foreach without inject about 3% faster than inject . Both of them are much faster (over 100 times in my experience) than using getc .

Using foreach without inject can also be slightly simplified (relative to the fragment indicated elsewhere in this thread) as follows:

 count = 0; File.foreach(path) { count+=1} puts "count: #{count}" 
0
Jul 26 '14 at 20:48
source share



All Articles