Reading excel sheet with multiple headers using Pandas

I have an excel sheet with several headers, for example:

_________________________________________________________________________ ____|_____| Header1 | Header2 | Header3 | ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK| 1 | ds | 5 | 6 |9 |10 | ....................................... 2 | dh | .......................................................... 3 | ge | .......................................................... 4 | ew | .......................................................... 5 | er | .......................................................... 

Now you can see that the first two columns have no headers, they are empty, and the other columns have headers such as Header1, Header2 and Header3. Therefore, I want to read this sheet and combine it with another sheet with a similar structure.

I want to combine it in the first column of "ColX". Now I am doing this:

 import pandas as pd totalMergedSheet = pd.DataFrame([1,2,3,4,5], columns=['ColX']) file = pd.ExcelFile('ExcelFile.xlsx') for i in range (1, len(file.sheet_names)): df1 = file.parse(file.sheet_names[i-1]) df2 = file.parse(file.sheet_names[i]) newMergedSheet = pd.merge(df1, df2, on='ColX') totalMergedSheet = pd.merge(totalMergedSheet, newMergedSheet, on='ColX') 

But I do not know that he is not reading the columns correctly, and I think that they will not return the results the way I want. So, I want the resulting frame to be as follows:

 ________________________________________________________________________________________________________ ____|_____| Header1 | Header2 | Header3 | Header4 | Header5 | ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColK| ColL|ColM|ColN|ColO||ColP|ColQ|ColR|ColS| 1 | ds | 5 | 6 |9 |10 | .................................................................................. 2 | dh | ................................................................................... 3 | ge | .................................................................................... 4 | ew | ................................................................................... 5 | er | ...................................................................................... 

Any suggestions please. Thanks.

+6
source share
1 answer

Pandas already has a function that will be read throughout the entire Excel spreadsheet for you, so you do not need to manually analyze / merge each sheet. Take a look at pandas.read_excel () . It not only allows you to read in an Excel file on a single line, but also provides options to help solve the problem you are facing.

Since you have columns, you are looking for MultiIndexing . By default, pandas will read in the top line as a single header line. You can pass the header argument to pandas.read_excel() , which indicates how many lines should be used as headers. In your particular case, you need header=[0, 1] , specifying the first two lines. You can also have multiple sheets, so you can pass sheetname=None (this means going through all the sheets). The command will be as follows:

 df_dict = pandas.read_excel('ExcelFile.xlsx', header=[0, 1], sheetname=None) 

This returns a dictionary in which the keys are sheet names and the values ​​are DataFrames for each sheet. If you want to collapse all this into one DataFrame, you can simply use pandas.concat:

 df = pandas.concat(df_dict.values(), axis=0) 
+8
source

Source: https://habr.com/ru/post/1012253/


All Articles