Faster Pandas Dataframe Processing

Question

Faster Pandas Dataframe Processing

I am trying to process very large files (10,000+ cases) where zip codes are not easily formatted. I need to convert them all to the first 5 digits, and here is my current code:

def makezip(frame, zipcol):
    i = 0
    while i < len(frame):
        frame[zipcol][i] = frame[zipcol][i][:5]
        i += 1
    return frame

frame is the data frame, and zipcol is the name of the column containing the zip codes. Although this works, it takes a lot of time to process. Is there a faster way?

+4

python pandas

whateveryousayiam May 19, '15 at 21:35

source share

1 answer

joris · Accepted Answer · 2015-05-19T21:39:21+0000

You can use the accessory .strin string columns to access certain string methods. And on this you can also chop:

frame[zipcol] = frame[zipcol].str[:5]

Based on a small example, this is about 50 times faster than a line by line loop:

In [29]: s = pd.Series(['testtest']*10000)

In [30]: %timeit s.str[:5]
100 loops, best of 3: 3.06 ms per loop

In [31]: %timeit str_loop(s)
10 loops, best of 3: 164 ms per loop

ty

In [27]: def str_loop(s):
   .....:     for i in range(len(s)):
   .....:         s[i] = s[i][:5]
   .....:

Faster Pandas Dataframe Processing

More articles: