Is parallel file writing effective?

Question

Is parallel file writing effective?

I would like to know if concurrent file writing is effective. Indeed, the hard drive has one convenient readable head at a time. Thus, the HDD can perform one task at a time. But below the test (in python) contradicts what I expect:

The file to copy is about 1 GB

Script 1 (// task for reading and writing line by line 10 times in one file):

#!/usr/bin/env python from multiprocessing import Pool def read_and_write( copy_filename ): with open( "/env/cns/bigtmp1/ERR000916_2.fastq", "r") as fori: with open( "/env/cns/bigtmp1/{}.fastq".format( copy_filename) , "w" ) as fout: for line in fori: fout.write( line + "\n" ) return copy_filename def main(): f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ] pool = Pool(processes=4) results = pool.map( read_and_write, f_names ) if __name__ == "__main__": main()

Script 2 (task for reading and writing line by line 10 times in one file):

 #!/usr/bin/env python def read_and_write( copy_filename ): with open( "/env/cns/bigtmp1/ERR000916_2.fastq", "r") as fori: with open( "/env/cns/bigtmp1/{}.fastq".format( copy_filename) , "w" ) as fout: for line in fori: fout.write( line + "\n" ) return copy_filename def main(): f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ] for n in f_names: result = read_and_write( n ) if __name__ == "__main__": main()

Script 3 (// task to copy 10 times to one file):

 #!/usr/bin/env python from shutil import copyfile from multiprocessing import Pool def read_and_write( copy_filename ): copyfile( "/env/cns/bigtmp1/ERR000916_2.fastq", "/env/cns/bigtmp1/{}.fastq".format( copy_filename) ) return copy_filename def main(): f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ] pool = Pool(processes=4) results = pool.map( read_and_write, f_names ) if __name__ == "__main__": main()

Script 4 (task to copy 10 times to one file):

 #!/usr/bin/env python from shutil import copyfile def read_and_write( copy_filename ): copyfile( "/env/cns/bigtmp1/ERR000916_2.fastq", "/env/cns/bigtmp1/{}.fastq".format( copy_filename) ) return copy_filename def main(): f_names = [ "test_jm_{}".format(i) for i in range( 0, 10 ) ] for n in f_names: result = read_and_write( n ) if __name__ == "__main__": main()

Results:

 $ # // task to read and write line by line 10 times a same file $ time python read_write_1.py real 1m46.484s user 3m40.865s sys 0m29.455s $ rm test_jm* $ # task to read and write line by line 10 times a same file $ time python read_write_2.py real 4m16.530s user 3m41.303s sys 0m24.032s $ rm test_jm* $ # // task to copy 10 times a same file $ time python read_write_3.py real 1m35.890s user 0m10.615s sys 0m36.361s $ rm test_jm* $ # task to copy 10 times a same file $ time python read_write_4.py real 1m40.660s user 0m7.322s sys 0m25.020s $ rm test_jm*

The results of these basics seem to show that // io reading and writing are more efficient.

Thank you for the light

+5

python io parallel-processing

bioinfornatics Jan 08 '16 at 0:18

source share

1 answer

Danny_ds · Accepted Answer · 2016-01-08T00:57:58+0000

I would like to know if concurrent file writing is effective.

Short answer: physical recording to the same disk from several streams at the same time will not be faster than writing to this disk from a single stream (here we are talking about normal hard drives). In some cases, it can even be much slower.

But, as always, it depends on many factors:

OS Cache: Records are usually cached by the operating system and then written to disk in chunks. Thus, multiple threads can write to this cache at the same time without any problems and have an advantage in speed. Especially if processing / preparing data takes longer than writing to disk.
In some cases, even when directly writing to a physical disk from several threads, the OS optimizes this and writes only large blocks to each file.
However, in the worst case, smaller blocks can be written to disk every time, which leads to the need to search for a hard disk (± 10 ms on a regular hdd!) On each file switch (with the same on an SSD it will not be so bad, because there is more direct access and no need to search).

Thus, in the general case, when writing to a disk from several streams at the same time, it may be a good idea to prepare (some) data in memory and write the final data to disk in large blocks using some kind of lock or, possibly, from one dedicated write-thread . If files grow during recording (i.e. the file size is not set in front), writing data to larger blocks can also prevent disk fragmentation (at least as much as possible).

On some systems, there may not be any difference whatsoever, but on others it can make a big difference and become much slower (or even on the same system with different hard drives).

In order to well appreciate the differences in recording speed using one stream or several streams, the total file sizes would have to be larger than the available memory - or at least all the buffers should be flushed to disk before measuring the end time, measuring only time, which is required to write data to the OS cache does not make much sense here.

Ideally, the total time taken to write all the data to the disk should equal the write speed of the physical hard disk. If writing to a disc using one stream is slower than the speed of writing to the disk (this means that processing the data takes longer than writing), it is obvious that using more streams will speed up the process. If recording from several streams becomes slower than the speed of writing to the disk, time will be lost in accessing the disk caused by switching between different files (or different blocks inside one large file).

To get an idea of the time loss when performing a large number of disk accesses, let's look at some numbers:

Let's say we have hdd with a write speed of 50 MB / s:

Recording a single continuous block of 50 MB will take 1 second (under ideal conditions).
Performing the same in 1 MB blocks, with a file switch and resulting disk access between them, will give: 20 ms for recording the search time 1 MB + 10 ms. Writing 50 MB will take 1.5 seconds. which is 50% more time, only for a quick search between them (the same is true for reading from disk) - the difference will be even greater, given the higher read speed.

In reality, it will be somewhere in the middle, depending on the system.

Although we could hope that the OS takes care of all this (or, for example, using IOCP ), this is not always the case.

Is parallel file writing effective?

More articles: