Faster Pandas Dataframe Processing

I am trying to process very large files (10,000+ cases) where zip codes are not easily formatted. I need to convert them all to the first 5 digits, and here is my current code:

def makezip(frame, zipcol):
    i = 0
    while i < len(frame):
        frame[zipcol][i] = frame[zipcol][i][:5]
        i += 1
    return frame

frame is the data frame, and zipcol is the name of the column containing the zip codes. Although this works, it takes a lot of time to process. Is there a faster way?

+4
source share
1 answer

You can use the accessory .strin string columns to access certain string methods. And on this you can also chop:

frame[zipcol] = frame[zipcol].str[:5]

Based on a small example, this is about 50 times faster than a line by line loop:

In [29]: s = pd.Series(['testtest']*10000)

In [30]: %timeit s.str[:5]
100 loops, best of 3: 3.06 ms per loop

In [31]: %timeit str_loop(s)
10 loops, best of 3: 164 ms per loop

ty

In [27]: def str_loop(s):
   .....:     for i in range(len(s)):
   .....:         s[i] = s[i][:5]
   .....:
+7
source

Source: https://habr.com/ru/post/1588886/


All Articles