Python limits in concurrent file processing

Question

Python limits in concurrent file processing

I spent a week processing some gnarly text files — some in a hundred million rows.

I used python to open, analyze, convert and output these files. I performed tasks in parallel, often 6-8 at the same time, on a massive 8-processor 16-core EC2 unit using SSD.

And I would say that the output is bad at 0.001% of the record, for example:

Expected output: |1107|2013-01-01 00:00:00|PS|Johnson|etc. Actual output: |11072013-01-01 00:00:00|PS|Johnson|etc. or |1107|2013-01-01 :00:00|PS|Johnson

Almost always, the problem is not GIGO, but that Python was unable to write a separator or part of a date field. Thus, I assume that I overload the SSD with these jobs, or rather, the computer does not work with the python throttle based on write competition for the disk.

My question is this: how to get the fastest processing from this window, but not cause such errors to "write"?

+6

python

Todd curry Jul 26 '13 at 12:22

source share

1 answer

Daren thomas · Answer 1 · 2013-07-26T12:58:26+0000

Do you use multiprocessing module (separate processes) or just use threads for parallel processing?

I highly doubt that the SSD problem. Or a python. But maybe the csv module has a race condition and is not thread safe?

Also check your code. And the entrances. Are bad records sequential? Can you reproduce them? You mention GIGO, but don't really rule it out ("Almost always, ...").

Python limits in concurrent file processing

More articles: