How do I optimize this related I / O program for the file system?

I have a python program that does something like this:

  • Read the line from the csv file.
  • Make some conversions on it.
  • Split it into actual lines, as they will be written to the database.
  • Write these lines in separate csv files.
  • Return to step 1 if the file has not been fully read.
  • Run SQL * Loader and load these files into the database.

Step 6 does not actually take much time. This seems to be step 4, which takes up most of the time. For the most part, I would like to optimize this to handle a set of records in the millions running on a quad server with some RAID setup.

There are several ideas that I have to solve:

  • Read the entire file from the first step (or at least read it in very large fragments) and write the file to disk as a whole or in very large chunks. The idea is that the hard drive will spend less time switching between files. Will this do anything that buffering will not?
  • Parallelize steps 1, 2, and 3 and 4 into separate processes. This would make steps 1, 2, and 3 no need to wait 4 to complete.
  • Split the load file into separate pieces and process them in parallel. Lines do not need to be processed in any sequential order. This should probably be related to step 2 in some way.

, - " , ". , , . -, , - ?

+3
7

Python -, , , - - , . , , O_SYNC.

, ( open()). , 100- - 100 /, - 1 50% - , - 10 9% - . IO, , , . , / .

, 4 -. , , , , .

+3

-:

split, , .

batch Muncher.

cat, .

+5

/, , , / , .

, , / , -, . , .

Python, , , - , , .

, , . , .

+3

ramdisk 4? , kB .

+2

4.

, , , , 4k . , 32 . .

, "" .

+1

ram, ?

, 4.

, - .

+1

First of all, you need to be sure that you must optimize. You don't seem to know exactly where your time is going. Before spending more time thinking, use the performance profiler to see where the time is going.

http://docs.python.org/library/profile.html

When you know exactly where the time will be, you will be in a better position to know where to spend your time on optimization.

-2
source

Source: https://habr.com/ru/post/1720564/


All Articles