Setting up a python screen scraper that can run on the Google App engine

Question

Setting up a python screen scraper that can run on the Google App engine

I want to configure an automatic screen scraper that will run in a Google application using python. I want it to clean the site and put the specified results into the Entity mechanism in the application. I am looking for some guidance on what to use. I saw beautifulsoup, but I wonder if people can recommend anything else that can work in the Google App engine.

+3

python google-app-engine screen-scraping

cozza Mar 09 '10 at 1:38

source share

4 answers

I had good (albeit slow) results using mechanize and BeautifulSoup. In fact, to save code space in the Google App Engine, I use the (old) version of BeautifulSoup included in mechanize.

I have a mechanism in a zip file mechanize.zip. The index of this zip file looks like this:

mechanize/
mechanize/__init__.py
mechanize/_auth.py
mechanize/_beautifulsoup.py
mechanize/_clientcookie.py
... etc

Python

import sys
sys.path.insert(0, 'mechanize.zip')

import mechanize
from mechanize._beautifulsoup import BeautifulSoup

+1

pix 16 . '10 1:22

lxml, C GAE.

0

Ignacio Vazquez-Abrams 09 . '10 1:42

I used BeautifulSoup with great success while parsing HTML. The problem is that everything BeautifulSoup does, parses HTML. I ended up writing all http interactions using urlfetch.

To clear my target, I need a full-fledged browser with slave code that can execute javascript on my pages on the target site. I think I need to reset the python application and go to java so that I can use HTMLUnit - prototyping continues. - mattb

0

Matt brown Apr 17 '10 at 22:41

source share

Alex martelli · Accepted Answer · 2010-03-09T02:24:30+0000

Beautifulsoup works great in App Engine (just make sure you're using 3.0.8, not iffy 3.1.0). The main alternative, I think, would be html5lib - I am not trying to use it in App Engine, but I believe that it works there (rather slowly - if this is a problem, I think you need to stick to BeautifulSoup), for example, this service works in App Engine and is based on html5lib.

Setting up a python screen scraper that can run on the Google App engine

More articles: