How to quickly open an Excel file in Python?

Now I use PyExcelerator to read excel files, but it is very slow. Since I always need to open excel files over 100 MB, it takes me more than 20 minutes to download only one file.

I need functionality:

  • Open the Excel files, select specific tables and load them into a Dict or List object.
  • Sometimes: select specific columns and load whole rows that have specific columns in specific values.
  • Reading Excel files with password protection.

And the code I'm using now is:

book = pyExcelerator.parse_xls(filepath) parsed_dictionary = defaultdict(lambda: '', book[0][1]) number_of_columns = 44 result_list = [] number_of_rows = 500000 for i in range(0, number_of_rows): ok = False result_list.append([]) for h in range(0, number_of_columns): item = parsed_dictionary[i,h] if type(item) is StringType or type(item) is UnicodeType: item = item.replace("\t","").strip() result_list[i].append(item) if item != '': ok = True if not ok: break 

Any suggestions?

+6
source share
4 answers

pyExcelerator does not seem to be supported. To write xls files, use xlwt, which is a fork pyExcelerator with bug fixes and many improvements. (X basic) xiwt has eliminated the (very simple) ability to read xls pyExcelerator. To read xls files, use xlrd.

If it takes 20 minutes to download a 100 MB file, you should use one or more of: a slow computer, a computer with very little available memory, or an older version of Python.

Neither pyExcelerator nor xlrd read password protected files.

Here is a link that covers xlrd and xlwt .

Disclaimer: I am the author of xlrd and the accompanying xlwt.

+5
source

xlrd is good for reading files, and xlwt is pretty good for writing. Both experiences are superior to pyExcelerator in my experience.

+2
source

You can try to pre-assign the list to its size in one expression, rather than adding one element at a time: (one large memory allocation should be faster than many small ones)

 book = pyExcelerator.parse_xls(filepath) parsed_dictionary = defaultdict(lambda: '', book[0][1]) number_of_columns = 44 number_of_rows = 500000 result_list = [] * number_of_rows for i in range(0, number_of_rows): ok = False #result_list.append([]) for h in range(0, number_of_columns): item = parsed_dictionary[i,h] if type(item) is StringType or type(item) is UnicodeType: item = item.replace("\t","").strip() result_list[i].append(item) if item != '': ok = True if not ok: break 

If this gives a noticeable increase in performance, you can also try to redistribute each element of the list with the number of columns, and then assign them by index rather than adding one value at a time. Here is a snippet that creates a 10x10 two-dimensional list in one expression with an initial value of 0:

 L = [[0] * 10 for i in range(10)] 

So, folded into your code, it might work something like this:

 book = pyExcelerator.parse_xls(filepath) parsed_dictionary = defaultdict(lambda: '', book[0][1]) number_of_columns = 44 number_of_rows = 500000 result_list = [[''] * number_of_rows for x in range(number_of_columns)] for i in range(0, number_of_rows): ok = False #result_list.append([]) for h in range(0, number_of_columns): item = parsed_dictionary[i,h] if type(item) is StringType or type(item) is UnicodeType: item = item.replace("\t","").strip() result_list[i,h] = item if item != '': ok = True if not ok: break 
+1
source

Not related to your question . If you are trying to verify that none of the columns is an empty row, you first set ok = True and do it instead in the inner loop ( ok = ok and item != '' ). Alternatively, you can simply use isinstance(item, basestring) to check if the variable is a string or not.

Revised version

 for i in range(0, number_of_rows): ok = True result_list.append([]) for h in range(0, number_of_columns): item = parsed_dictionary[i,h] if isinstance(item, basestring): item = item.replace("\t","").strip() result_list[i].append(item) ok = ok and item != '' if not ok: break 
+1
source

Source: https://habr.com/ru/post/887161/


All Articles