I am extracting some content from a website with multiple tables with the same number of columns, pandas read_html . When I read a single link, which actually has multiple tables with the same number of columns, pandas effectively reads all the tables as one (something like a flat / normalized table). However, I am interested in doing the same for a list of links from a website (i.e. One flat table for multiple links), so I tried the following:
IN:
import multiprocessing def process(url): df_url = pd.read_html(url) df = pd.concat(df_url, ignore_index=False) return df_url links = ['link1.com','link2.com','link3.com',...,'linkN.com'] pool = multiprocessing.Pool(processes=6) df = pool.map(process, links) df
However, I assume that I am not pointing corecctly to read_html() , which are columns, so I get this invalid list of lists:
Of:
[[ Form Disponibility \ 0 290090 01780-500-01) Unavailable - no product available for release. Relation \ Relation drawbacks 0 NaN Removed 1 NaN Removed ], [ Form \ Relation \ 0 American Regent is currently releasing the 0.4... 1 American Regent is currently releasing the 1mg... drawbacks 0 Demand increase for the drug 1 Removed , Form \ 0 0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N... Disponibility Relation \ 0 Product available NaN 2 Removed 3 Removed ]]
So my question is , what parameter do I need to move to get a flat pandas data file from the above nested list ? I tried header=0 , index_col=0 , match='"columns"' , none of them worked or I did not need to do the alignment when creating the pandas dataframe with pd.Dataframe() ?. My main goal is to have a pandas dataframe as with these columns:
form, Disponibility, Relation, drawbacks 1 2 ... n