How can I determine why my Racket code is so slow?

Question

How can I determine why my Racket code is so slow?

Just for fun, I wrote a quick Racket command line script to parse old remote Unix files. Fortune files are simply gigantic text files with one % on an empty line separating the entries.

As a quick first hack, I wrote the following Racket code:

 (define fortunes (with-input-from-file "fortunes.txt" (λ () (regexp-split #rx"%" (port->string)))))

I thought it would work almost instantly. Instead, it takes a lot of time to start - about a few minutes. For comparison, what I consider equivalent to Python:

 with open('fortunes.txt') as f: fortunes = f.read().split('%')

runs immediately, with equivalent results for Racket code.

What am I doing wrong here? Yes, there are some obvious bad results, for example, I’m sure it would be better if I didn’t damage the whole file in RAM using port->string , but the behavior is so pathologically bad that I feel like I should do something stupid at a much higher level than that.

Is there another way, similar to a racket, with the same efficiency? Is Racket I / O really poor for some operations? Is there a way to profile my code a little deeper than the naive profiler in DrRacket, so I can understand what causes the problem because of this line?

EDIT . The fortune file I'm using is FreeBSD, found at http://fortunes.cat-v.org/freebsd/ , which weighs about 2 MB. The best runtime for Racket 5.1.3 x64 on OS X Lion was:

 real 1m1.479s user 0m57.400s sys 0m0.691s

For Python 2.7.1 x64, this was:

 real 0m0.057s user 0m0.029s sys 0m0.015s

Eli is correct that the time is spent almost entirely on regexp-split (although the full second is apparently spent on port->string ), but it is not clear to me that there is a preferred but equally simple method.

+6

performance racket

Benjamin pollack Aug 16 '11 at 23:21

source share

2 answers

Most of the costs seem to be related to running regexp-split on a line. The fastest alternative I found is splitting the byte string and then converting the results to strings:

 (map bytes->string/utf-8 (call-with-input-file "db" (λ (i) (regexp-split #rx#"%" (port->bytes i)))))

If you accidentally delete DB ~ 2 MB, your code takes about 35 seconds, and this version takes 33 ms.

(I'm not sure why a string takes so long, but it is definitely too slow.)

EDIT . We tracked the performance error. Rough description: when Racket executes a regexp-match in a string, it converts large parts of the string to a byte string (in UTF-8) for search. This function is the main one, which is implemented in C. regexp-split uses it repeatedly to find all matches and, therefore, saves the repeated execution of this conversion. I am looking for a way to do something better, but until it is fixed, use the solution described above.

+5

Eli barzilay Aug 17 '11 at 0:57

source share

Sam Tobin-Hochstadt · Accepted Answer · 2011-08-19T03:09:08+0000

This has now been fixed in the latest version of the Git HEAD version of Racket, see github.com/plt/racket/commit/8eefaba . Your example now works after 0.1 seconds for me.

How can I determine why my Racket code is so slow?

More articles: