Reading an Excel file is slower using openpyxl compared to xlrd

I have an Excel spreadsheet that I need to import into SQL Server on a daily basis. The spreadsheet will contain about 250,000 rows in 50 columns. I tested both uses of openpyxl and xlrd using almost identical code.

Here's the code I'm using (minus the instructions for debugging):

import xlrd import openpyxl def UseXlrd(file_name): workbook = xlrd.open_workbook(file_name, on_demand=True) worksheet = workbook.sheet_by_index(0) first_row = [] for col in range(worksheet.ncols): first_row.append(worksheet.cell_value(0,col)) data = [] for row in range(1, worksheet.nrows): record = {} for col in range(worksheet.ncols): if isinstance(worksheet.cell_value(row,col), str): record[first_row[col]] = worksheet.cell_value(row,col).strip() else: record[first_row[col]] = worksheet.cell_value(row,col) data.append(record) return data def UseOpenpyxl(file_name): wb = openpyxl.load_workbook(file_name, read_only=True) sheet = wb.active first_row = [] for col in range(1,sheet.max_column+1): first_row.append(sheet.cell(row=1,column=col).value) data = [] for r in range(2,sheet.max_row+1): record = {} for col in range(sheet.max_column): if isinstance(sheet.cell(row=r,column=col+1).value, str): record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip() else: record[first_row[col]] = sheet.cell(row=r,column=col+1).value data.append(record) return data xlrd_results = UseXlrd('foo.xls') openpyxl_resuts = UseOpenpyxl('foo.xls') 

Transferring the same Excel file containing 3,500 rows gives a significantly different runtime. Using xlrd , I can read the entire file in the dictionary list in less than 2 seconds. Using openpyxl , I get the following results:

 Reading Excel File... Read 100 lines in 114.14509415626526 seconds Read 200 lines in 471.43183994293213 seconds Read 300 lines in 982.5288782119751 seconds Read 400 lines in 1729.3348784446716 seconds Read 500 lines in 2774.886833190918 seconds Read 600 lines in 4384.074863195419 seconds Read 700 lines in 6396.7723388671875 seconds Read 800 lines in 7998.775000572205 seconds Read 900 lines in 11018.460735321045 seconds 

Although I can use xlrd in the last script, I will have to hardcode a lot of formatting due to various problems (i.e. int reads as float, date reads int, datetime reads like float). Being the fact that I need to reuse this code for a few more import operations, it makes no sense to try and hard-code certain columns to format them correctly and maintain the same code in 4 different scenarios.

Any tips on how to proceed?

+5
source share
2 answers

You can simply iterate over the sheet:

 def UseOpenpyxl(file_name): wb = openpyxl.load_workbook(file_name, read_only=True) sheet = wb.active rows = sheet.rows first_row = [cell.value for cell in next(rows)] data = [] for row in rows: record = {} for key, cell in zip(first_row, row): if cell.data_type == 's': record[key] = cell.value.strip() else: record[key] = cell.value data.append(record) return data 

This should scale for large files. You might want to write your result if the data list gets too big.

The openpyxl version now takes about twice as much as xlrd:

 %timeit xlrd_results = UseXlrd('foo.xlsx') 1 loops, best of 3: 3.38 s per loop %timeit openpyxl_results = UseOpenpyxl('foo.xlsx') 1 loops, best of 3: 6.87 s per loop 

Note that xlrd and openpyxl can interpret what is an integer, and something float is a little different. For my test data, I needed to add float() to make the results comparable:

 def UseOpenpyxl(file_name): wb = openpyxl.load_workbook(file_name, read_only=True) sheet = wb.active rows = sheet.rows first_row = [float(cell.value) for cell in next(rows)] data = [] for row in rows: record = {} for key, cell in zip(first_row, row): if cell.data_type == 's': record[key] = cell.value.strip() else: record[key] = float(cell.value) data.append(record) return data 

Now both versions give the same results for my test data:

 >>> xlrd_results == openpyxl_results True 
+7
source

This sounds to me like the perfect candidate for the Pandas module:

 import pandas as pd import sqlalchemy import pyodbc # pyodbc # # assuming the following: # username: scott # password: tiger # DSN: mydsn engine = create_engine('mssql+pyodbc://scott: tiger@mydsn ') # pymssql # #engine = create_engine('mssql+pymssql://scott: tiger@hostname :port/dbname') df = pd.read_excel('foo.xls') # write the DataFrame to a table in the sql database df.to_sql("table_name", engine) 

Description Function DataFrame.to_sql ()

PS It should be pretty fast and very easy to use.

+2
source

Source: https://habr.com/ru/post/1244511/


All Articles