I would take a completely different approach. We use LXML to clean html pages.
One of the reasons we switched was because the BS was not supported for a while - or I must say that it was updated.
In my test, I ran the following
import requests from lxml import html from collections import OrderedDict page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content tree = html.fromstring(page_as_string)
Now I looked at the page, and I see that the data is divided into two tables. Since you want EPS, I noticed that it is in the second table. We could write code to sort it programmatically, but I will leave it for you.
tables = [ e for e in tree.iter() if e.tag == 'table'] eps_table = tables[-1]
now I noticed that the first row has column headers, so I want to separate all the rows
table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']
now allows you to get column headers:
column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']
Finally, we can match column headers with row labels and cell values
my_results = [] for row in table_rows[1:]: cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td'] temp_dict = OrderedDict() for numb, cell in enumerate(cell_content): if numb == 0: temp_dict['row_label'] = cell.strip() else: dict_key = column_headings[numb] temp_dict[dict_key] = cell my_results.append(temp_dict)
now to access the results
for row_dict in my_results: if row_dict['row_label'] == 'EPS (Basic)': for key in row_dict: print key, ':', row_dict[key] row_label : EPS (Basic) 2008 : 13.46 2009 : 20.62 2010 : 26.69 2011 : 30.17 2012 : 32.81 5-year trend :
Now much remains to be done, for example, I have not tested the squareness (the number of cells in each row is equal).
Finally, I'm a newbie, and I suspect others will advise more direct methods to get these elements (xPath or cssselect), but it really works, and you get everything from the table in a good structured way.
I must add that each row from the table is available, they are in the original row order. The first element (which is a dictionary) in the my_results list contains data from the first row, the second element has data from the second row, etc.
When I need a new lxml build, I visit a page maintained by a very nice guy in UC-IRVINE
I hope this helps