How to reindex invalid columns retrieved from pandas read_html?

Question

How to reindex invalid columns retrieved from pandas read_html?

I am extracting some content from a website with multiple tables with the same number of columns, pandas read_html . When I read a single link, which actually has multiple tables with the same number of columns, pandas effectively reads all the tables as one (something like a flat / normalized table). However, I am interested in doing the same for a list of links from a website (i.e. One flat table for multiple links), so I tried the following:

IN:

 import multiprocessing def process(url): df_url = pd.read_html(url) df = pd.concat(df_url, ignore_index=False) return df_url links = ['link1.com','link2.com','link3.com',...,'linkN.com'] pool = multiprocessing.Pool(processes=6) df = pool.map(process, links) df

However, I assume that I am not pointing corecctly to read_html() , which are columns, so I get this invalid list of lists:

Of:

 [[ Form Disponibility \ 0 290090 01780-500-01) Unavailable - no product available for release. Relation \ Relation drawbacks 0 NaN Removed 1 NaN Removed ], [ Form \ Relation \ 0 American Regent is currently releasing the 0.4... 1 American Regent is currently releasing the 1mg... drawbacks 0 Demand increase for the drug 1 Removed , Form \ 0 0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N... Disponibility Relation \ 0 Product available NaN 2 Removed 3 Removed ]]

So my question is , what parameter do I need to move to get a flat pandas data file from the above nested list ? I tried header=0 , index_col=0 , match='"columns"' , none of them worked or I did not need to do the alignment when creating the pandas dataframe with pd.Dataframe() ?. My main goal is to have a pandas dataframe as with these columns:

 form, Disponibility, Relation, drawbacks 1 2 ... n

+5

python python-3.x pandas multiprocessing dataframe

tumbleweed Nov 05 '16 at 5:50

source share

1 answer

Maxu · Accepted Answer · 2016-11-05T10:21:10+0000

IIUC you can do it like this:

first you want to return a concatenated DF instead of a DF list (since read_html returns a DF list ):

 def process(url): return pd.concat(pd.read_html(url), ignore_index=False)

and then combine them for all urls:

 df = pd.concat(pool.map(process, links), ignore_index=True)

How to reindex invalid columns retrieved from pandas read_html?

More articles: