A simple problem: actually: you have one billion (1e + 9) unsigned 32-bit integers stored as ASCII decimal strings in a TSV file (with partition delimiters). Conversion using is int()terribly slow compared to other tools working with the same dataset. What for? And more importantly: how to do it faster?
So the question is: what is the fastest way to convert a string to an integer in Python?
What I'm really thinking about is some hidden Python functions that can (ab) be used for this purpose, as opposed to using Guido array.arrayin its "Optimization Anecdote" .
Sample data (with tab extensions to spaces)
38262904 "pfv" 2002-11-15T00:37:20+00:00
12311231 "tnealzref" 2008-01-21T20:46:51+00:00
26783384 "hayb" 2004-02-14T20:43:45+00:00
812874 "qevzasdfvnp" 2005-01-11T00:29:46+00:00
22312733 "bdumtddyasb" 2009-01-17T20:41:04+00:00
The time required to read the data does not matter here; data processing is a bottleneck.
Microbenchmarks
All of the following interpreted languages. The host computer runs on a 64-bit version of Linux.
Python 2.6.2 with IPython 0.9.1, ~ 214k conversions per second (100%):
In [1]: strings = map(str, range(int(1e7)))
In [2]: %timeit map(int, strings);
10 loops, best of 3: 4.68 s per loop
REBOL 3.0 Version 2.100.76.4.2, ~ 231kcps (108%):
>> strings: array n: to-integer 1e7 repeat i n [poke strings i mold (i - 1)]
== "9999999"
>> delta-time [map str strings [to integer! str]]
== 0:00:04.328675
REBOL 2.7.6.4.2 (March 15, 2008), ~ 523kcps (261%):
As John noted in the comments, this version does not create a list of converted integers, so the given ratio of speed relative to Python 4.99s runtime is for str in strings: int(str).
>> delta-time: func [c /local t] [t: now/time/precise do c now/time/precise - t]
>> strings: array n: to-integer 1e7 repeat i n [poke strings i mold (i - 1)]
== "9999999"
>> delta-time [foreach str strings [to integer! str]]
== 0:00:01.913193
KDB + 2.6t 2009.04.15, ~ 2016kcps (944%):
q)strings:string til "i"$1e7
q)\t "I"$strings
496