ExcelFile Vs. read_excel in pandas

Question

ExcelFile Vs. read_excel in pandas

I dive into pandas and experiment. Regarding reading data from an Excel file. I wonder what the difference is between using ExcelFile for read_excel. Both seem to work (although slightly different from the syntax, as you would expect), and the documentation supports both. In both cases, the documentation describes the same method: "Reading an Excel table in a DataFrame" and "Reading an Excel table in a pandas DataFrame". ( documentation for read_excel and for excel_file )

I see the answers here on SO, which uses either, without resolving this difference. In addition, a Google search did not return a result that discusses this issue.

WRT my testing, they seem equivalent:

path = "test/dummydata.xlsx" xl = pd.ExcelFile(path) df = xl.parse("dummydata") # sheet name

and

 path = "test/dummydata.xlsx" df = pd.io.excel.read_excel(path, sheetname=0)

except that the latter saves me a string, is there a difference between them, and is there a reason to use one of them?

Thanks!

+13

python pandas excel

Optimesh Oct 20 '14 at 20:51

source share

3 answers

John y · Answer 1 · 2018-04-23T22:58:27+0000

There is not much difference except the syntax. Technically, ExcelFile is a class, and read_excel is a function. In any case, the actual analysis is processed by the _parse_excel method defined in ExcelFile .

In earlier versions of pandas read_excel consisted entirely of one statement (except for comments):

 return ExcelFile(path_or_buf,kind=kind).parse(sheetname=sheetname, kind=kind, **kwds)

And ExcelFile.parse did not do much more than calling ExcelFile._parse_excel .

In recent versions of pandas, read_excel ensures that it has an ExcelFile object (and creates it if it is not), and then directly _parse_excel method:

 if not isinstance(io, ExcelFile): io = ExcelFile(io, engine=engine) return io._parse_excel(...)

and with updated (and unified) parameter handling, ExcelFile.parse indeed the only statement:

 return self._parse_excel(...)

This is why the docs for ExcelFile.parse now say

Equivalent to read_excel (ExcelFile, ...). See the read_excel documentation line for more information on accepted parameters.

As for the other answer, which claims that ExcelFile.parse faster in a loop, it really comes down to whether you create an ExcelFile object from scratch every time. Of course, you can create your ExcelFile once outside the loop and pass it to read_excel inside the loop:

 xl = pd.ExcelFile(path) for name in xl.sheet_names: df = pd.read_excel(xl, name)

That would be equivalent

 xl = pd.ExcelFile(path) for name in xl.sheet_names: df = xl.parse(name)

If your cycle includes different paths (in other words, you read many different books, and not just several sheets in one book), then you still can not do without creating a new instance of ExcelFile for each path, and then again, and ExcelFile.parse and read_excel will be equivalent (and equally slow).

Pranav kohli · Answer 2 · 2016-07-25T05:20:50+0000

ExcelFile.parse is faster.

Suppose you are reading data in a loop. With ExcelFile.parse you just pass in the Excelfile object ( xl in your case). This way the excel sheet is just loaded once and you use it to get your data. In the case of Read_Excel, you pass the path instead of the Excelfile object. Thus, every time a book is loaded again. It makes a mess if your book has many sheets and tens of thousands of lines.

Bob haffner · Answer 3 · 2014-10-20T21:11:03+0000

I believe that Pandas's first excel implementation used a two-step process, but then added a one-step process called read_excel. Probably left the first one because people already used it

ExcelFile Vs. read_excel in pandas

More articles: