Suggested way to count lines, characters, and words from a file in general in Python

I found 2 ways to count the lines of a file , as seen below. (note: I need to read the file as a whole, not line by line)

Trying to figure out which approach is better in terms of efficiency and / or good coding.

names = {} for each_file in glob.glob('*.cpp'): with open(each_file) as f: names[each_file] = sum(1 for line in f if line.strip()) 

(as seen here )

 data = open('test.cpp', 'r').read() print(len(data.splitlines()), len(data.split()), len(data)) 

(as seen here )

And in the same topic regarding counting the number of characters and counting the number of words in a file; Is there a better way than suggested above?

+5
source share
2 answers

Use a generator expression for memory efficiency (this approach avoids reading the entire file into memory). Here is a demo.

 def count(filename, what): strategy = {'lines': lambda x: bool(x.strip()), 'words': lambda x: len(x.split()), 'chars': len } strat = strategy[what] with open(filename) as f: return sum(strat(line) for line in f) 

input.txt:

 this is a test file i just typed 

output:

 >>> count('input.txt', 'lines') 3 >>> count('input.txt', 'words') 8 >>> count('input.txt', 'chars') 33 

Please note that character counts include newline characters. Also note that this uses a rather crude definition of a β€œword” (you did not provide it), it simply breaks the line with a space and counts the elements of the returned list.

+6
source

Create some test files and check them in a large loop to see the average time. Make sure the test files match your scripts.

I used this code:

 import glob import time times1 = [] for i in range(0,1000): names = {} t0 = time.clock() with open("lines.txt") as f: names["lines.txt"] = sum(1 for line in f if line.strip()) print names times1.append(time.clock()-t0) times2 = [] for i in range(0,1000): names = {} t0 = time.clock() data = open("lines.txt", 'r').read() print("lines.txt",len(data.splitlines()), len(data.split()), len(data)) times2.append(time.clock()-t0) print sum(times1)/len(times1) print sum(times2)/len(times2) 

and left with average timings: 0.0104755582104 and 0.0180650466201 seconds

It was a text file with 23,000 lines. For instance:

 print("lines.txt",len(data.splitlines()), len(data.split()), len(data)) 

outputs: ('lines.txt', 23056, 161392, 1095160)

Check this out on your actual fileset to get more accurate sync data.

+4
source

Source: https://habr.com/ru/post/1246826/


All Articles