Pandas slow data frame freeze

Question

Pandas slow data frame freeze

I have an Excel file (.xlsx) with approximately 800 rows and 128 columns with fairly dense data in the grid. There are about 9,500 cells that I'm trying to replace with cell values using the Pandas data frame:

xlsx = pandas.ExcelFile(filename) frame = xlsx.parse(xlsx.sheet_names[0]) media_frame = frame[media_headers] # just get the cols that need replacing from_filenames = get_from_filenames() # returns ~9500 filenames to replace in DF to_filenames = get_to_filenames() media_frame = media_frame.replace(from_filenames, to_filenames) frame.update(media_frame) frame.to_excel(filename)

replace() takes 60 seconds. Any way to speed this up? This is not huge data or tasks, I expected Pandas to move much faster. FYI I tried to do the same processing with the same file in CSV, but the time savings were minimal (about 50 seconds on replace() )

+5

python numpy pandas excel dataframe

Neil Oct 4 '16 at 5:48

source share

2 answers

piRSquared · Answer 1 · 2016-10-04T07:24:28+0000

strategy
create pd.Series representing a map from file names to file names.
stack our info frame, map , then unstack

customization

 import pandas as pd import numpy as np from string import letters media_frame = pd.DataFrame( pd.DataFrame( np.random.choice(list(letters), 9500 * 800 * 3) \ .reshape(3, -1)).sum().values.reshape(9500, -1)) u = np.unique(media_frame.values) from_filenames = pd.Series(u) to_filenames = from_filenames.str[1:] + from_filenames.str[0] m = pd.Series(to_filenames.values, from_filenames.values)

decision

 media_frame.stack().map(m).unstack()

time

5 x 5 data frames

100 x 100

9500 x 800

9500 x 800
map using series vs dict
d = dict(zip(from_filenames, to_filenames))

Neil · Answer 2 · 2016-10-04T08:38:29+0000

I earned a 60 second task in 10 seconds by removing replace() altogether and using set_value () one element at a time.

Pandas slow data frame freeze

time

More articles: