Pandas slow data frame freeze

I have an Excel file (.xlsx) with approximately 800 rows and 128 columns with fairly dense data in the grid. There are about 9,500 cells that I'm trying to replace with cell values ​​using the Pandas data frame:

xlsx = pandas.ExcelFile(filename) frame = xlsx.parse(xlsx.sheet_names[0]) media_frame = frame[media_headers] # just get the cols that need replacing from_filenames = get_from_filenames() # returns ~9500 filenames to replace in DF to_filenames = get_to_filenames() media_frame = media_frame.replace(from_filenames, to_filenames) frame.update(media_frame) frame.to_excel(filename) 

replace() takes 60 seconds. Any way to speed this up? This is not huge data or tasks, I expected Pandas to move much faster. FYI I tried to do the same processing with the same file in CSV, but the time savings were minimal (about 50 seconds on replace() )

+5
source share
2 answers

strategy
create pd.Series representing a map from file names to file names.
stack our info frame, map , then unstack

customization

 import pandas as pd import numpy as np from string import letters media_frame = pd.DataFrame( pd.DataFrame( np.random.choice(list(letters), 9500 * 800 * 3) \ .reshape(3, -1)).sum().values.reshape(9500, -1)) u = np.unique(media_frame.values) from_filenames = pd.Series(u) to_filenames = from_filenames.str[1:] + from_filenames.str[0] m = pd.Series(to_filenames.values, from_filenames.values) 

decision

 media_frame.stack().map(m).unstack() 

time

5 x 5 data frames

enter image description here

100 x 100

enter image description here

9500 x 800

enter image description here

9500 x 800
map using series vs dict
d = dict(zip(from_filenames, to_filenames))

enter image description here

+6
source

I earned a 60 second task in 10 seconds by removing replace() altogether and using set_value () one element at a time.

+1
source

Source: https://habr.com/ru/post/1257674/


All Articles