Efficient way to read data in python

Question

Efficient way to read data in python

Possible duplicate:
Lazy Method for reading a large file in Python?

I need to read 100 GB (400 million lines) of data from a file line by line. This is my current code, but is there an efficient method for this. I mean speed of execution.

f = open(path, 'r') for line in f: ... f.close()

+4

python

Rohita khatiwada Apr 4 '11 at 14:14

source share

3 answers

If the lines have a fixed byte length, and the lines should not be read in any particular order (you can still know the line number), you can easily split it into parallel subtasks, performing several threads / processes. Each subtusk needed to know only until seek() and the number of bytes before read() .

Also in this case, it is not optimal for reading line by line, since it needs to scan \n , but just use read() with a fixed length.

+2

vartec Apr 4 '11 at 14:25

source share

If you have a multi-core computer and you can use Python 3.2 (instead of Python 2), this would be a good option for the new concurrent.futures function in Python 3.2 - depending on the processing you need to do with each line. If you require processing to be performed in file order, you may have to worry about reassembling the results later.

Otherwise, the use of concurrent.futures can be planned by each client, which will be processed in another task, without much effort. What result should you generate on this?

If you think that it is not beneficial for you to parallelize the contents of each line, the most obvious way is the best way: that is, what you just did.

This example divides the processing into up to 12 subprocesses, each of which performs the function of the built-in len Python. Replace len for a function that takes a string as a parameter and does everything you need to process on that string:

 from concurrent.futures import ProcessPoolExecutor as Executor with Executor(max_workers=5) as ex: with open("poeem_5.txt") as fl: results = list(ex.map(len, fl))

A “list” call is needed to force the matching to be performed in the “c” statement. If you do not need a scalar value for each line, but to write the result to a file, you can do this in a for loop:

 for line in fl: ex.submit(my_function, line)

+1

jsbueno Apr 4 '11 at 16:52

source share

Andreas Jung · Accepted Answer · 2011-04-04T14:24:38+0000

Duplicate

Lazy Method for reading a large file in Python?

Beyond Interest

http://effbot.org/zone/readline-performance.htm

Efficient way to read data in python

More articles: