Parsing html data into python list for manipulation

Question

Parsing html data into python list for manipulation

I am trying to read on html sites and retrieve their data. For example, I would like to read in EPS (earnings per share) for the last 5 years of companies. Basically, I can read it and can use either BeautifulSoup or html2text to create a huge text block. Then I want to search for the file - I am using re.search - but it seems it cannot make it work correctly. Here is the line I'm trying to access:

EPS (primary) \ n13.4620.6226.6930.1732.81 \ n \ n

So, I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].

Thanks for any help.

from stripogram import html2text from urllib import urlopen import re from BeautifulSoup import BeautifulSoup ticker_symbol = 'goog' url = 'http://www.marketwatch.com/investing/stock/' full_url = url + ticker_symbol + '/financials' #build url text_soup = BeautifulSoup(urlopen(full_url).read()) #read in text_parts = text_soup.findAll(text=True) text = ''.join(text_parts) eps = re.search("EPS\s+(\d+)", text) if eps is not None: print eps.group(1)

+4

python html regex html-parsing beautifulsoup

Warren lamont Jul 17 '13 at 19:55

source share

3 answers

I would take a completely different approach. We use LXML to clean html pages.

One of the reasons we switched was because the BS was not supported for a while - or I must say that it was updated.

In my test, I ran the following

 import requests from lxml import html from collections import OrderedDict page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content tree = html.fromstring(page_as_string)

Now I looked at the page, and I see that the data is divided into two tables. Since you want EPS, I noticed that it is in the second table. We could write code to sort it programmatically, but I will leave it for you.

 tables = [ e for e in tree.iter() if e.tag == 'table'] eps_table = tables[-1]

now I noticed that the first row has column headers, so I want to separate all the rows

 table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']

now allows you to get column headers:

 column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']

Finally, we can match column headers with row labels and cell values

 my_results = [] for row in table_rows[1:]: cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td'] temp_dict = OrderedDict() for numb, cell in enumerate(cell_content): if numb == 0: temp_dict['row_label'] = cell.strip() else: dict_key = column_headings[numb] temp_dict[dict_key] = cell my_results.append(temp_dict)

now to access the results

 for row_dict in my_results: if row_dict['row_label'] == 'EPS (Basic)': for key in row_dict: print key, ':', row_dict[key] row_label : EPS (Basic) 2008 : 13.46 2009 : 20.62 2010 : 26.69 2011 : 30.17 2012 : 32.81 5-year trend :

Now much remains to be done, for example, I have not tested the squareness (the number of cells in each row is equal).

Finally, I'm a newbie, and I suspect others will advise more direct methods to get these elements (xPath or cssselect), but it really works, and you get everything from the table in a good structured way.

I must add that each row from the table is available, they are in the original row order. The first element (which is a dictionary) in the my_results list contains data from the first row, the second element has data from the second row, etc.

When I need a new lxml build, I visit a page maintained by a very nice guy in UC-IRVINE

I hope this helps

+2

Pynewbie Jul 17 '13 at 20:45

source share

 from bs4 import BeautifulSoup import urllib2 import lxml import pandas as pd url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet' soup = BeautifulSoup(urllib2.urlopen(url).read()) table = soup.find('table', {'data-ajax-content' : 'true'}) data = [] for row in table.findAll('tr'): cells = row.findAll('td') cols = [ele.text.strip() for ele in cells] data.append([ele for ele in cols if ele]) df = pd.DataFrame(data) print df dictframe = df.to_dict() print dictframe

The above code will provide you with a DataFrame from a web page and then use it to create a python dictionary.

+1

Matt lamont Jul 26 '15 at 15:55

source share

alecxe · Accepted Answer · 2013-07-17T20:11:20+0000

Bad practice using regex to parse html. Use the BeautifulSoup parser: find the cell with the rowTitle class and the EPS (Basic) text in it, and then move on to the following siblings using the valueCell class:

 from urllib import urlopen from BeautifulSoup import BeautifulSoup url = 'http://www.marketwatch.com/investing/stock/goog/financials' text_soup = BeautifulSoup(urlopen(url).read()) #read in titles = text_soup.findAll('td', {'class': 'rowTitle'}) for title in titles: if 'EPS (Basic)' in title.text: print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]

prints:

 ['13.46', '20.62', '26.69', '30.17', '32.81']

Hope this helps.

Parsing html data into python list for manipulation

More articles: