Get variable data inside script tag in Python or Content added from js

Question

Get variable data inside script tag in Python or Content added from js

I want to get data from another url for which I am using urllib and Beautiful Soup . My data is inside a table tag (which I found out using the Firefox console). But when I tried to get the table using its identifier, the result is None. Then I think this table should be dynamically added through some js code.

I tried all of both parsing 'lxml', 'html5lib' , but still I can not get this table data.

I also tried one more thing:

web = urllib.urlopen("my url") html = web.read() soup = BeautifulSoup(html, 'lxml') js = soup.find("script") ss = js.prettify() print ss

Result:

 <script type="text/javascript"> myPage = 'ETFs'; sectionId = 'liQuotes'; //section tab breadCrumbId = 'qQuotes'; //page is_dartSite = "quotes"; is_dartZone = "news"; propVar = "ETFs"; </script>

But now I do not know how I can get the data of these js variables.

Now I have two options: get the contents of the table or get js variables, any of them can fulfill my task, but, unfortunately, I do not know how to get them. Therefore, please tell me how I can solve any problem of the problem.

thanks

+6

javascript python web-scraping urllib2 beautifulsoup

Inforian Jun 09 '14 at 10:29

source share

2 answers

mhawke · Answer 1 · 2014-06-09T12:42:06+0000

EDIT

This will do the trick using the re module to extract the data and load it as JSON:

 import urllib import json import re from bs4 import BeautifulSoup web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx") soup = BeautifulSoup(web.read(), 'lxml') data = soup.find_all("script")[19].string p = re.compile('var table_body = (.*?);') m = p.match(data) stocks = json.loads(m.groups()[0]) >>> for stock in stocks: ... print stock ... [u'ASPS', u'Altisource Portfolio Solutions SA', 116.96, 2.2, 1.92, 86635, u'N', u'N'] [u'AGNC', u'American Capital Agency Corp.', 23.76, 0.13, 0.55, 3184303, u'N', u'N'] . . . [u'ZION', u'Zions Bancorporation', 29.79, 0.46, 1.57, 2154017, u'N', u'N']

The problem is that the script tag offset is hardcoded and there is no reliable way to find it on the page. Changes to the page may damage your code.

ORIGINAL answer

Instead of trying to clear the data, you can download a CSV representation of the same data from http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx?render=download .

Then use the Python csv module to analyze and process it. Not only is this more convenient, it will also be a more flexible solution, since any changes to the HTML can easily break your screen cleaning code.

Otherwise, if you look at the actual HTML, you will find that the data is available on the page in the following script tag:

 <script type="text/javascript">var table_body = [["ATVI", "Activision Blizzard, Inc", 20.92, 0.21, 1.01, 6182877, .1, "N", "N"], ["ADBE", "Adobe Systems Incorporated", 66.91, 1.44, 2.2, 3629837, .6, "N", "N"], ["AKAM", "Akamai Technologies, Inc.", 57.47, 1.57, 2.81, 2697834, .3, "N", "N"], ["ALXN", "Alexion Pharmaceuticals, Inc.", 170.2, 0.7, 0.41, 659817, .1, "N", "N"], ["ALTR", "Altera Corporation", 33.82, -0.06, -0.18, 1928706, .0, "N", "N"], ["AMZN", "Amazon.com, Inc.", 329.67, 6.1, 1.89, 5246300, 2.5, "N", "N"], .... ["YHOO", "Yahoo! Inc.", 35.92, 0.98, 2.8, 18705720, .9, "N", "N"]];

parkerproject · Answer 2 · 2014-11-26T19:42:10+0000

Just to add @mhawke to the answer, rather than hard-coded the script tag offset, you will skip all script tags and match those that match your template;

 web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx") pattern = re.compile('var table_body = (.*?);') soup = BeautifulSoup(web.read(), "lxml") scripts = soup.find_all('script') for script in scripts: if(pattern.match(str(script.string))): data = pattern.match(script.string) stock = json.loads(data.groups()[0]) print stock

Get variable data inside script tag in Python or Content added from js

More articles: