I have an Excel spreadsheet that I need to import into SQL Server on a daily basis. The spreadsheet will contain about 250,000 rows in 50 columns. I tested both uses of openpyxl and xlrd using almost identical code.
Here's the code I'm using (minus the instructions for debugging):
import xlrd import openpyxl def UseXlrd(file_name): workbook = xlrd.open_workbook(file_name, on_demand=True) worksheet = workbook.sheet_by_index(0) first_row = [] for col in range(worksheet.ncols): first_row.append(worksheet.cell_value(0,col)) data = [] for row in range(1, worksheet.nrows): record = {} for col in range(worksheet.ncols): if isinstance(worksheet.cell_value(row,col), str): record[first_row[col]] = worksheet.cell_value(row,col).strip() else: record[first_row[col]] = worksheet.cell_value(row,col) data.append(record) return data def UseOpenpyxl(file_name): wb = openpyxl.load_workbook(file_name, read_only=True) sheet = wb.active first_row = [] for col in range(1,sheet.max_column+1): first_row.append(sheet.cell(row=1,column=col).value) data = [] for r in range(2,sheet.max_row+1): record = {} for col in range(sheet.max_column): if isinstance(sheet.cell(row=r,column=col+1).value, str): record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip() else: record[first_row[col]] = sheet.cell(row=r,column=col+1).value data.append(record) return data xlrd_results = UseXlrd('foo.xls') openpyxl_resuts = UseOpenpyxl('foo.xls')
Transferring the same Excel file containing 3,500 rows gives a significantly different runtime. Using xlrd
, I can read the entire file in the dictionary list in less than 2 seconds. Using openpyxl
, I get the following results:
Reading Excel File... Read 100 lines in 114.14509415626526 seconds Read 200 lines in 471.43183994293213 seconds Read 300 lines in 982.5288782119751 seconds Read 400 lines in 1729.3348784446716 seconds Read 500 lines in 2774.886833190918 seconds Read 600 lines in 4384.074863195419 seconds Read 700 lines in 6396.7723388671875 seconds Read 800 lines in 7998.775000572205 seconds Read 900 lines in 11018.460735321045 seconds
Although I can use xlrd
in the last script, I will have to hardcode a lot of formatting due to various problems (i.e. int reads as float, date reads int, datetime reads like float). Being the fact that I need to reuse this code for a few more import operations, it makes no sense to try and hard-code certain columns to format them correctly and maintain the same code in 4 different scenarios.
Any tips on how to proceed?