Pypy is slowly writing a file

Question

Pypy is slowly writing a file

I have been trying to use PyPy lately and it is 25 times faster for my current project and it works very well. Unfortunately, however, file recordings are incredibly slow. Writing files is about 60 times slower.

I did a bit of searching, but I did not find anything useful. Is this a known issue? Is there a workaround?

In a simple test case:

with file(path, 'w') as f: f.writelines(['testing to write a file\n' for i in range(5000000)])

I see a 60x slowdown in PyPy compared to regular Python. It uses 64-bit versions 2.7.3 and PyPy 1.9, 32-bit and Python 2.7.2. Both are on the same OS and machine, of course (Windows 7).

Any help would be greatly appreciated. PyPy is much faster for what I am doing, but the file write speed is limited to half a megabyte per second, this is clearly less useful.

+4

python file-io pypy

Simon lundberg 25 sept. '12 at 13:36

source share

4 answers

John la rooy · Answer 1 · 2012-09-25T13:58:06+0000

It is slower but not 60x slower in this system

TL; DR; Use write('\n'.join(...)) instead of writelines(...)

 $ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in range(5000000)])" 10 loops, best of 3: 1.15 sec per loop $ python -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in range(5000000)])" 10 loops, best of 3: 434 msec per loop

xrange doesn't matter

 $ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in xrange(5000000)])" 10 loops, best of 3: 1.15 sec per loop

Using a generator expression is slower for pypy but faster for python

 $ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines('testing to write a file\n' for i in xrange(5000000))" 10 loops, best of 3: 1.62 sec per loop $ python -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines('testing to write a file\n' for i in xrange(5000000))" 10 loops, best of 3: 407 msec per loop

moving data creation outside of the reference amplifies the difference (~ 4.2x)

 $ pypy -m timeit -s "path='tst'; data=['testing to write a file\n' for i in range(5000000)]" "with file(path, 'w') as f:f.writelines(data)" 10 loops, best of 3: 786 msec per loop $ python -m timeit -s "path='tst'; data=['testing to write a file\n' for i in range(5000000)]" "with file(path, 'w') as f:f.writelines(data)" 10 loops, best of 3: 189 msec per loop

Using write() instead of writelines() much faster for both

 $ pypy -m timeit -s "path='tst'; data='\n'.join('testing to write a file\n' for i in range(5000000))" "with file(path, 'w') as f:f.write(data)" 10 loops, best of 3: 51.9 msec per loop $ python -m timeit -s "path='tst'; data='\n'.join('testing to write a file\n' for i in range(5000000))" "with file(path, 'w') as f:f.write(data)" 10 loops, best of 3: 52.4 msec per loop

 $ uname -srvmpio Linux 3.2.0-26-generic #41-Ubuntu SMP Thu Jun 14 17:49:24 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux $ python --version Python 2.7.3 $ pypy --version Python 2.7.2 (1.8+dfsg-2, Feb 19 2012, 19:18:08) [PyPy 1.8.0 with GCC 4.6.2]

lolopop · Answer 2 · 2012-09-25T13:40:19+0000

xrange is the answer for this example because it does not generate a list , but a generator. 64-bit python is probably faster than 32-bit pypy when creating a list with 50 million items.

If you have another code, post the actual code, not just the test.

Jan-Philip Gehrcke · Answer 3 · 2012-09-25T14:05:47+0000

First try the comparison method.

When the goal is to measure the pure performance of writing files, this is a serious flaw, a system error, for creating data that needs to be written to a file in the code segment that you are synchronizing. This is because creating data also takes time that you do not want to measure.

Therefore, if you plan to store all dummy data in memory, create it before measuring time.

However, in your case, on-the-fly data generation is more likely to be faster than your I / O will ever be. Thus, using the Python generator, in this case the generator expression in combination with the write call, you will get rid of this system error.

I do not know how writelines compares with write . However, according to your writelines example:

 with file(path, 'w') as f: f.writelines('xxxx\n' for _ in xrange(10**6))

Writing large chunks of data with write can be faster:

 with file(path, 'w') as f: for chunk in ('x'*99999 for _ in xrange(10**3)): f.write(chunk)

When you got the benchmarking right, I'm sure you will find the differences between Python and PyPy. Perhaps PyPy even slows down significantly in some circumstances. However, with proper benchmarking, I believe that you will be able to find the conditions under which PyPy writes files fast enough for your purposes.

Matthew trevor · Answer 4 · 2012-09-25T14:05:59+0000

Here you create two lists: one with a range and one with a list.

List 1: one option is to replace the list that returns range with the xrange generator. Another is to try PyPy's own optimization called range lists .

You can enable this feature with the –objspace-std-withrangelist .

List 2: you create your output list before writing it. It should also be a generator, so include a list comprehension in the generator expression:

 f.writelines('testing to write a file\n' for i in range(5000000))

As long as the generator expression is the only argument passed to the function, it does not even need to be doubled in parentheses.

Pypy is slowly writing a file

More articles: