I have a python program that does something like this:
- Read the line from the csv file.
- Make some conversions on it.
- Split it into actual lines, as they will be written to the database.
- Write these lines in separate csv files.
- Return to step 1 if the file has not been fully read.
- Run SQL * Loader and load these files into the database.
Step 6 does not actually take much time. This seems to be step 4, which takes up most of the time. For the most part, I would like to optimize this to handle a set of records in the millions running on a quad server with some RAID setup.
There are several ideas that I have to solve:
- Read the entire file from the first step (or at least read it in very large fragments) and write the file to disk as a whole or in very large chunks. The idea is that the hard drive will spend less time switching between files. Will this do anything that buffering will not?
- Parallelize steps 1, 2, and 3 and 4 into separate processes. This would make steps 1, 2, and 3 no need to wait 4 to complete.
- Split the load file into separate pieces and process them in parallel. Lines do not need to be processed in any sequential order. This should probably be related to step 2 in some way.
, - " , ". , , . -, , - ?