Python limits in concurrent file processing

I spent a week processing some gnarly text files β€” some in a hundred million rows.

I used python to open, analyze, convert and output these files. I performed tasks in parallel, often 6-8 at the same time, on a massive 8-processor 16-core EC2 unit using SSD.

And I would say that the output is bad at 0.001% of the record, for example:

Expected output: |1107|2013-01-01 00:00:00|PS|Johnson|etc. Actual output: |11072013-01-01 00:00:00|PS|Johnson|etc. or |1107|2013-01-01 :00:00|PS|Johnson 

Almost always, the problem is not GIGO, but that Python was unable to write a separator or part of a date field. Thus, I assume that I overload the SSD with these jobs, or rather, the computer does not work with the python throttle based on write competition for the disk.

My question is this: how to get the fastest processing from this window, but not cause such errors to "write"?

+6
source share
1 answer

Do you use multiprocessing module (separate processes) or just use threads for parallel processing?

I highly doubt that the SSD problem. Or a python. But maybe the csv module has a race condition and is not thread safe?

Also check your code. And the entrances. Are bad records sequential? Can you reproduce them? You mention GIGO, but don't really rule it out ("Almost always, ...").

+1
source

Source: https://habr.com/ru/post/950351/


All Articles