Just for fun, I wrote a quick Racket command line script to parse old remote Unix files. Fortune files are simply gigantic text files with one % on an empty line separating the entries.
As a quick first hack, I wrote the following Racket code:
(define fortunes (with-input-from-file "fortunes.txt" (λ () (regexp-split
I thought it would work almost instantly. Instead, it takes a lot of time to start - about a few minutes. For comparison, what I consider equivalent to Python:
with open('fortunes.txt') as f: fortunes = f.read().split('%')
runs immediately, with equivalent results for Racket code.
What am I doing wrong here? Yes, there are some obvious bad results, for example, I’m sure it would be better if I didn’t damage the whole file in RAM using port->string , but the behavior is so pathologically bad that I feel like I should do something stupid at a much higher level than that.
Is there another way, similar to a racket, with the same efficiency? Is Racket I / O really poor for some operations? Is there a way to profile my code a little deeper than the naive profiler in DrRacket, so I can understand what causes the problem because of this line?
EDIT . The fortune file I'm using is FreeBSD, found at http://fortunes.cat-v.org/freebsd/ , which weighs about 2 MB. The best runtime for Racket 5.1.3 x64 on OS X Lion was:
real 1m1.479s user 0m57.400s sys 0m0.691s
For Python 2.7.1 x64, this was:
real 0m0.057s user 0m0.029s sys 0m0.015s
Eli is correct that the time is spent almost entirely on regexp-split (although the full second is apparently spent on port->string ), but it is not clear to me that there is a preferred but equally simple method.