I spent a week processing some gnarly text files β some in a hundred million rows.
I used python to open, analyze, convert and output these files. I performed tasks in parallel, often 6-8 at the same time, on a massive 8-processor 16-core EC2 unit using SSD.
And I would say that the output is bad at 0.001% of the record, for example:
Expected output: |1107|2013-01-01 00:00:00|PS|Johnson|etc. Actual output: |11072013-01-01 00:00:00|PS|Johnson|etc. or |1107|2013-01-01 :00:00|PS|Johnson
Almost always, the problem is not GIGO, but that Python was unable to write a separator or part of a date field. Thus, I assume that I overload the SSD with these jobs, or rather, the computer does not work with the python throttle based on write competition for the disk.
My question is this: how to get the fastest processing from this window, but not cause such errors to "write"?
source share