" in Python code I want to do a 10x crosscheck for huge files (every hundred thousand lines). I want to do "wc -l" every t...">

Running "wc -l <filename>" in Python code

I want to do a 10x crosscheck for huge files (every hundred thousand lines). I want to do "wc -l" every time I start reading a file, and then generate arbitrary numbers a fixed number of times, each time writing this line number to a separate file. I use this:

 import os for i in files: os.system("wc -l <insert filename>"). 

How to insert a file name here. Its a variable. I looked through the documentation, but basically lists the ls , which does not have this problem.

+6
source share
7 answers
 import subprocess for f in files: subprocess.call(['wc', '-l', f]) 

Also see http://docs.python.org/library/subprocess.html#convenience-functions - for example, if you want to access the output in a string, you will want to use subprocess.check_output() instead of subprocess.call()

+6
source

Let's compare:

 from subprocess import check_output def wc(filename): return int(check_output(["wc", "-l", filename]).split()[0]) def native(filename): c = 0 with open(filename) as file: while True: chunk = file.read(10 ** 7) if chunk == "": return c c += chunk.count("\n") def iterate(filename): with open(filename) as file: for i, line in enumerate(file): pass return i + 1 

Function go go timeit!

 from timeit import timeit from sys import argv filename = argv[1] def testwc(): wc(filename) def testnative(): native(filename) def testiterate(): iterate(filename) print "wc", timeit(testwc, number=10) print "native", timeit(testnative, number=10) print "iterate", timeit(testiterate, number=10) 

Result:

 wc 1.25185894966 native 2.47028398514 iterate 2.40715694427 

So wc is about twice as fast by 150 MB of compressed files with ~ 500,000 lines, this is what I tested. However , testing the file generated with seq 3000000 >bigfile , I get the following numbers:

 wc 0.425990104675 native 0.400163888931 iterate 3.10369205475 

Hey look, FTW python! However, using longer strings (~ 70 characters):

 wc 1.60881590843 native 3.24313092232 iterate 4.92839002609 

So, the conclusion: it depends, but wc seems to be the best option for everyone.

+8
source

No need to use wc -l Use the following python function

 def file_len(fname): with open(fname) as f: for i, l in enumerate(f, 1): pass return i 

This is probably more efficient than calling an external utility (this input loop is the same way).

Update

Wrong, wc -l much faster!

 seq 10000000 > huge_file $ time wc -l huge_file 10000000 huge_file real 0m0.267s user 0m0.110s sys 0m0.010s $ time ./p.py 10000000 real 0m1.583s user 0m1.040s sys 0m0.060s 
+5
source

os.system gets the string. Just build the line explicitly:

 import os for i in files: os.system("wc -l " + i) 
+3
source

Here is the Python approach I found to solve this problem:

 count_of_lines_in_any_textFile = sum(1 for l in open('any_textFile.txt')) 
+3
source

My solution is very similar to the "native" lazyr function:

 import functools def file_len2(fname): with open(fname, 'rb') as f: lines= 0 reader= functools.partial(f.read, 131072) for datum in iter(reader, ''): lines+= datum.count('\n') last_wasnt_nl= datum[-1] != '\n' return lines + last_wasnt_nl 

This, unlike wc , treats the final line not ending with '\ n' as a separate line. If you need the same functionality as wc, then it can be (completely unwritten :) written as:

 import functools as ft, itertools as it, operator as op def file_len3(fname): with open(fname, 'rb') as f: reader= ft.partial(f.read, 131072) counter= op.methodcaller('count', '\n') return sum(it.imap(counter, iter(reader, ''))) 

with comparable time to wc in all the test files I created.

Note: this applies to Windows and POSIX machines. Old MacOS used "\ r" as end-of-line characters.

0
source

I found a much simpler way:

 import os linux_shell='more /etc/hosts|wc -l' linux_shell_result=os.popen(linux_shell).read() print(linux_shell_result) 
0
source

Source: https://habr.com/ru/post/891636/


All Articles