Count the number of lines in a file without reading the entire file in memory?

Question

Count the number of lines in a file without reading the entire file in memory?

I process huge data files (millions of lines each).

Before starting the processing, I would like to get the number of lines in the file, so I can indicate how far the processing has gone.

Due to the size of the files, it would be impractical to read the entire file in memory, just to count the number of lines. Does anyone have a good suggestion on how to do this?

+41

ruby

smnirven Apr 16 '10 at 4:02

source share

13 answers

Reading a line file at a time:

 count = File.foreach(filename).inject(0) {|c, line| c+1}

or perl-ish

 File.foreach(filename) {} count = $.

or

 count = 0 File.open(filename) {|f| count = f.read.count("\n")}

Will be slower than

 count = %x{wc -l #{filename}}.split.first.to_i

+66

glenn jackman Apr 16 '10 at

source share

No matter which language you use, you will need to read the entire file if the lines are of variable length. This is because newlines can be anywhere, and theres can't know without reading the file (assuming that it is not cached, which, generally speaking, is wrong).

If you want to indicate progress, you have two realistic options. You can extrapolate progress based on the estimated string length:

 assumed lines in file = size of file / assumed line size progress = lines processed / assumed lines in file * 100%

since you know the file size. Alternatively, you can measure progress as:

 progress = bytes processed / size of file * 100%

That should be enough.

+13

cletus Apr 16 '10 at 4:06

source share

using ruby:

 file=File.open("path-to-file","r") file.readlines.size

39 milliseconds faster than wc -l in 325.477 line file.

+8

JBoy Mar 26 '14 at 10:39

source share

Summary of Published Solutions

 require 'benchmark' require 'csv' filename = "name.csv" Benchmark.bm do |x| x.report { `wc -l < #{filename}`.to_i } x.report { File.open(filename).inject(0) { |c, line| c + 1 } } x.report { File.foreach(filename).inject(0) {|c, line| c+1} } x.report { File.read(filename).scan(/\n/).count } x.report { CSV.open(filename, "r").readlines.count } end

File with lines 807802:

  user system total real 0.000000 0.000000 0.010000 ( 0.030606) 0.370000 0.050000 0.420000 ( 0.412472) 0.360000 0.010000 0.370000 ( 0.374642) 0.290000 0.020000 0.310000 ( 0.315488) 3.190000 0.060000 3.250000 ( 3.245171)

+3

Exsemt Jul 14 '17 at 12:00

source share

For reasons that I don't quite understand, scanning a file for lines using File seems a lot faster than doing CSV#readlines.count .

The following sample used a CSV file with 1,045,574 rows of data and 4 columns:

  user system total real 0.639000 0.047000 0.686000 ( 0.682000) 17.067000 0.171000 17.238000 ( 17.221173)

Code for reference below:

 require 'benchmark' require 'csv' file = "1-25-2013 DATA.csv" Benchmark.bm do |x| x.report { File.read(file).scan(/\n/).count } x.report { CSV.open(file, "r").readlines.count } end

As you can see, scanning a file for newlines is an order of magnitude higher.

+2

fbonetti Mar 07 '13 at 17:31

source share

Same as DJ, but giving real Ruby code:

 count = %x{wc -l file_path}.split[0].to_i

First part

 wc -l file_path

Gives you

 num_lines file_path

split and to_i put this number in a number.

+2

justingordon Aug 11 '13 at 7:42 on

source share

I have this one liner.

 puts File.foreach('myfile.txt').count

+2

vikas027 Jan 17 '15 at 11:47

source share

wc -l in Ruby with less memory, lazy way:

 (ARGV.length == 0 ? [["", STDIN]] : ARGV.lazy.map { |file_name| [file_name, File.open(file_name)] }) .map { |file_name, file| "%8d %s\n" % [*file .each_line .lazy .map { |line| 1 } .reduce(:+), file_name] } .each(&:display)

as originally shown by Shugo Maeda .

Example:

 $ curl -s -o wc.rb -L https://git.io/vVrQi $ chmod u+x wc.rb $ ./wc.rb huge_data_file.csv 43217291 huge_data_file.csv

+2

altamic Apr 6 '16 at 20:11

source share

If the file is a CSV file, the length of the records should be fairly uniform if the contents of the file are numeric. Doesn't it make sense to simply divide the file size by the record length or the average of the first 100 records.

+1

ptcesq Mar 30 '13 at 13:14

source share

Testing results for lines over 135 thousand are shown below. This is my test code.

  file_name = '100m.csv' Benchmark.bm do |x| x.report { File.new(file_name).readlines.size } x.report { `wc -l "#{file_name}"`.strip.split(' ')[0].to_i } x.report { File.read(file_name).scan(/\n/).count } end

result

  user system total real 0.100000 0.040000 0.140000 ( 0.143636) 0.000000 0.000000 0.090000 ( 0.093293) 0.380000 0.060000 0.440000 ( 0.464925)

The wc -l code has one problem. If there is only one line in the file, and the last character does not end with \n , then the counter is zero.

So, I recommend calling wc when you are counting more than one line.

+1

Kim Seong Su Jan 11 '17 at 8:34 on

source share

With UNIX-style text files, it's very simple

 f = File.new("/path/to/whatever") num_newlines = 0 while (c = f.getc) != nil num_newlines += 1 if c == "\n" end

What is it. For MS Windows text files, you will need to check the sequence "\ r \ n" instead of "\ n", but this is not much more difficult. For Mac OS Classic text files (unlike Mac OS X), you should check "\ r" instead of "\ n".

So yes, that sounds like C. So what? C is awesome, and Ruby is awesome, because when the C answer is easiest, you can expect your Ruby code to look. Hope yours doesn't have Java already processed.

By the way, please do not even consider any of the answers above that use the IO#read or IO#readlines method, which in turn calls the String method on what has been read. You said you didn’t want to read the entire file in memory and what exactly they do. That's why Donald Knuth recommends that people understand how to program closer to hardware, because if they don’t, they’re “weird code.” Obviously, you do not want to code when you do not need it, but this should be common sense. However, you must learn to recognize cases that you have in order to get close to nuts and bolts such as this one.

And don't try to get more "object oriented" than the situation calls for. This is an awkward trap for beginners who want to look more sophisticated than they really are. You should always be happy when the answer is really simple and not disappointed when there is no difficulty to give you the opportunity to write “impressive” code. However, if you want to look somewhat “object oriented” and don't mind reading the entire line in (i.e. you know that the lines are short enough), you can do this

 f = File.new("/path/to/whatever") num_newlines = 0 f.each_line do num_newlines += 1 end

This would be a good compromise, but only if the lines are not too long, in which case it can work even faster than my first Solution.

0

Richard Ryan Aug 29 '13 at 21:08

source share

Using foreach without inject about 3% faster than inject . Both of them are much faster (over 100 times in my experience) than using getc .

Using foreach without inject can also be slightly simplified (relative to the fragment indicated elsewhere in this thread) as follows:

 count = 0; File.foreach(path) { count+=1} puts "count: #{count}"

0

peak Jul 26 '14 at 20:48

source share

DJ · Accepted Answer · 2010-04-16 04:53

If you are in a Unix environment, you can simply let wc -l do the work.

It will not load the entire file into memory; since it is optimized for file streaming and word / line counting, the performance is pretty good and then file streaming in Ruby.

SSCCE:

 filename = 'a_file/somewhere.txt' line_count = `wc -l "#{filename}"`.strip.split(' ')[0].to_i p line_count

Or, if you need a collection of files passed on the command line:

 wc_output = `wc -l "#{ARGV.join('" "')}"` line_count = wc_output.match(/^ *([0-9]+) +total$/).captures[0].to_i p line_count

Count the number of lines in a file without reading the entire file in memory?

More articles: