Running "wc -l <filename>" in Python code
I want to do a 10x crosscheck for huge files (every hundred thousand lines). I want to do "wc -l" every time I start reading a file, and then generate arbitrary numbers a fixed number of times, each time writing this line number to a separate file. I use this:
import os for i in files: os.system("wc -l <insert filename>"). How to insert a file name here. Its a variable. I looked through the documentation, but basically lists the ls , which does not have this problem.
import subprocess for f in files: subprocess.call(['wc', '-l', f]) Also see http://docs.python.org/library/subprocess.html#convenience-functions - for example, if you want to access the output in a string, you will want to use subprocess.check_output() instead of subprocess.call()
Let's compare:
from subprocess import check_output def wc(filename): return int(check_output(["wc", "-l", filename]).split()[0]) def native(filename): c = 0 with open(filename) as file: while True: chunk = file.read(10 ** 7) if chunk == "": return c c += chunk.count("\n") def iterate(filename): with open(filename) as file: for i, line in enumerate(file): pass return i + 1 Function go go timeit!
from timeit import timeit from sys import argv filename = argv[1] def testwc(): wc(filename) def testnative(): native(filename) def testiterate(): iterate(filename) print "wc", timeit(testwc, number=10) print "native", timeit(testnative, number=10) print "iterate", timeit(testiterate, number=10) Result:
wc 1.25185894966 native 2.47028398514 iterate 2.40715694427 So wc is about twice as fast by 150 MB of compressed files with ~ 500,000 lines, this is what I tested. However , testing the file generated with seq 3000000 >bigfile , I get the following numbers:
wc 0.425990104675 native 0.400163888931 iterate 3.10369205475 Hey look, FTW python! However, using longer strings (~ 70 characters):
wc 1.60881590843 native 3.24313092232 iterate 4.92839002609 So, the conclusion: it depends, but wc seems to be the best option for everyone.
No need to use wc -l Use the following python function
def file_len(fname): with open(fname) as f: for i, l in enumerate(f, 1): pass return i This is probably more efficient than calling an external utility (this input loop is the same way).
Update
Wrong, wc -l much faster!
seq 10000000 > huge_file $ time wc -l huge_file 10000000 huge_file real 0m0.267s user 0m0.110s sys 0m0.010s $ time ./p.py 10000000 real 0m1.583s user 0m1.040s sys 0m0.060s My solution is very similar to the "native" lazyr function:
import functools def file_len2(fname): with open(fname, 'rb') as f: lines= 0 reader= functools.partial(f.read, 131072) for datum in iter(reader, ''): lines+= datum.count('\n') last_wasnt_nl= datum[-1] != '\n' return lines + last_wasnt_nl This, unlike wc , treats the final line not ending with '\ n' as a separate line. If you need the same functionality as wc, then it can be (completely unwritten :) written as:
import functools as ft, itertools as it, operator as op def file_len3(fname): with open(fname, 'rb') as f: reader= ft.partial(f.read, 131072) counter= op.methodcaller('count', '\n') return sum(it.imap(counter, iter(reader, ''))) with comparable time to wc in all the test files I created.
Note: this applies to Windows and POSIX machines. Old MacOS used "\ r" as end-of-line characters.