ExcelFile Vs. read_excel in pandas

I dive into pandas and experiment. Regarding reading data from an Excel file. I wonder what the difference is between using ExcelFile for read_excel. Both seem to work (although slightly different from the syntax, as you would expect), and the documentation supports both. In both cases, the documentation describes the same method: "Reading an Excel table in a DataFrame" and "Reading an Excel table in a pandas DataFrame". ( documentation for read_excel and for excel_file )

I see the answers here on SO, which uses either, without resolving this difference. In addition, a Google search did not return a result that discusses this issue.

WRT my testing, they seem equivalent:

path = "test/dummydata.xlsx" xl = pd.ExcelFile(path) df = xl.parse("dummydata") # sheet name 

and

 path = "test/dummydata.xlsx" df = pd.io.excel.read_excel(path, sheetname=0) 

except that the latter saves me a string, is there a difference between them, and is there a reason to use one of them?

Thanks!

+13
source share
3 answers

There is not much difference except the syntax. Technically, ExcelFile is a class, and read_excel is a function. In any case, the actual analysis is processed by the _parse_excel method defined in ExcelFile .

In earlier versions of pandas read_excel consisted entirely of one statement (except for comments):

 return ExcelFile(path_or_buf,kind=kind).parse(sheetname=sheetname, kind=kind, **kwds) 

And ExcelFile.parse did not do much more than calling ExcelFile._parse_excel .

In recent versions of pandas, read_excel ensures that it has an ExcelFile object (and creates it if it is not), and then directly _parse_excel method:

 if not isinstance(io, ExcelFile): io = ExcelFile(io, engine=engine) return io._parse_excel(...) 

and with updated (and unified) parameter handling, ExcelFile.parse indeed the only statement:

 return self._parse_excel(...) 

This is why the docs for ExcelFile.parse now say

Equivalent to read_excel (ExcelFile, ...). See the read_excel documentation line for more information on accepted parameters.

As for the other answer, which claims that ExcelFile.parse faster in a loop, it really comes down to whether you create an ExcelFile object from scratch every time. Of course, you can create your ExcelFile once outside the loop and pass it to read_excel inside the loop:

 xl = pd.ExcelFile(path) for name in xl.sheet_names: df = pd.read_excel(xl, name) 

That would be equivalent

 xl = pd.ExcelFile(path) for name in xl.sheet_names: df = xl.parse(name) 

If your cycle includes different paths (in other words, you read many different books, and not just several sheets in one book), then you still can not do without creating a new instance of ExcelFile for each path, and then again, and ExcelFile.parse and read_excel will be equivalent (and equally slow).

+13
source

ExcelFile.parse is faster.

Suppose you are reading data in a loop. With ExcelFile.parse you just pass in the Excelfile object ( xl in your case). This way the excel sheet is just loaded once and you use it to get your data. In the case of Read_Excel, you pass the path instead of the Excelfile object. Thus, every time a book is loaded again. It makes a mess if your book has many sheets and tens of thousands of lines.

+9
source

I believe that Pandas's first excel implementation used a two-step process, but then added a one-step process called read_excel. Probably left the first one because people already used it

+4
source

Source: https://habr.com/ru/post/977000/


All Articles